Site Navigation

STAT8007 - Statistical Meth for Big Data

Title:Statistical Meth for Big Data
Long Title:Statistical Methods for Big Data
Module Code:STAT8007
Credits: 5
NFQ Level:Advanced
Field of Study: Statistics
Valid From: Semester 2 - 2012/13 ( February 2013 )
Module Delivered in no programmes
Module Coordinator: David Goulding
Module Author: Catherine Palmer
Module Description: In this module the learner will study statistical techniques, with particular emphasis on the big data sets. Statistical analytical software such as R will be used in the labs.
Learning Outcomes
On successful completion of this module the learner will be able to:
LO1 Explore data sets and select appropriate statistical methods for data science problems.
LO2 Interpret the results of statistical analyses performed by a software package or presented in research papers.
LO3 Distinguish between parametric and non-parametric methods and decide when the most commonly used non-parametric methods should be applied.
LO4 Analyse data sets with categorical response variables using logistic regression.
Pre-requisite learning
Module Recommendations
This is prior learning (or a practical skill) that is strongly recommended before enrolment in this module. You may enrol in this module if you have not acquired the recommended learning but you will have considerable difficulty in passing (i.e. achieving the learning outcomes of) the module. While the prior learning is expressed as named CIT module(s) it also allows for learning (in another module or modules) which is equivalent to the learning specified in the named module(s).
No recommendations listed
Incompatible Modules
These are modules which have learning outcomes that are too similar to the learning outcomes of this module. You may not earn additional credit for the same learning and therefore you may not enrol in this module if you have successfully completed any modules in the incompatible list.
No incompatible modules listed
Co-requisite Modules
No Co-requisite modules listed

This is prior learning (or a practical skill) that is mandatory before enrolment in this module is allowed. You may not enrol on this module if you have not acquired the learning specified in this section.

No requirements listed
No Co Requisites listed

Module Content & Assessment

Indicative Content
Data Pre-Processing and Exploration
Graphical and numerical methods to explore categorical and continuous data sets. Anomoly detection, missing values, data reduction techniques. Normality tests: histograms, Q-Q plots, Kolmogorov-Smirnov, Chi square. Homogeneity of variance, F- test. cluster analysis, transformation of variables.
Statistical Inference
Hypothesis testing, Chi-square distribution, analysis of variance (ANOVA), experimental design, observational (vs) experimental data.
Non-Parametric Methods
Non-parametric versus parametric methods. Typical non-parametric methods: The Sign test, Mann-Whitney Test, Wilcoxon, Spearman’s Rank correlation coefficient, Kruskal-Wallis analysis of ranks.
Generalised Linear Models
Definition of a generalized linear model: link functions. Overview of different types of generalised linear models and their uses with a focus on logistic regression for binary data.
Software analysis
SPSS, R, Excel
Assessment Breakdown%
Course Work100.00%
Course Work
Assessment Type Assessment Description Outcome addressed % of total Assessment Date
Short Answer Questions Theory Assessment - Data Exploration 1,2 25.0 Week 7
Short Answer Questions Theory Assessment - generalised linear models 3,4 25.0 Week 11
Project Analyse (large) data set(s) and report results. 1,2,3,4 50.0 Sem End
No End of Module Formal Examination
Reassessment Requirement
Coursework Only
This module is reassessed solely on the basis of re-submitted coursework. There is no repeat written examination.

The institute reserves the right to alter the nature and timings of assessment


Module Workload

Workload: Full Time
Workload Type Workload Description Hours Frequency Average Weekly Learner Workload
Lecture Lectures 2.0 Every Week 2.00
Lab Labs 2.0 Every Week 2.00
Independent Learning Independent learning 3.0 Every Week 3.00
Total Hours 7.00
Total Weekly Learner Workload 7.00
Total Weekly Contact Hours 4.00
Workload: Part Time
Workload Type Workload Description Hours Frequency Average Weekly Learner Workload
Lecture Lecture 1.5 Every Week 1.50
Lab Lab 1.5 Every Week 1.50
Lecturer Supervised Learning (Non-contact) Lecturer Supervised Learning 4.0 Every Week 4.00
Total Hours 7.00
Total Weekly Learner Workload 7.00
Total Weekly Contact Hours 3.00

Module Resources

Recommended Book Resources
  • Michael J. Crawley 2012, The R Book, Wiley-Blackwell [ISBN: 978-0470973929]
  • Pang-Ning Tan, Michael Steinbach, Vipin Kumar 2006, Introduction to data mining, Pearson Addison Wesley Boston [ISBN: 978-0321321367]
Supplementary Book Resources
  • Annette J. Dobson 2002, An Introduction to Generalized Linear Models, Second Edition, Chapman and Hall [ISBN: 978-1584881650]
  • Luis Torgo, 2010, Data Mining With R, Chapman & Hall [ISBN: 978-1439810187]
  • Colin Gray, Paul R Kinnear 2011, IBM SPSS Statistics 19 Made Simple [ISBN: 1848720696]
This module does not have any article/paper resources
This module does not have any other resources

Cork Institute of Technology
Rossa Avenue, Bishopstown, Cork

Tel: 021-4326100     Fax: 021-4545343
Email: help@cit.edu.ie