#REQUEST.pageInfo.pagedescription#

Site Navigation

DATA8005 - Distributed Data Management

banner1
Title:Distributed Data Management
Long Title:Distributed Data Management
Module Code:DATA8005
 
Credits: 5
NFQ Level:Advanced
Field of Study: Data Format
Valid From: Semester 1 - 2018/19 ( September 2018 )
Module Delivered in 1 programme(s)
Module Coordinator: TIM HORGAN
Module Author: Ignacio Castineiras
Module Description: Big data analytics turns big datasets into high-quality information, providing deeper insights enabling better decisions. However, big data requires novel data storage and data process techniques. In this module the learners will be survey the main NoSQL-based data models as an alternative to the traditional relational model. The learner will also explore the ecosystem of a big data framework, with an special focus on the data storage and the application of large-scale data analysis libraries.
Learning Outcomes
On successful completion of this module the learner will be able to:
LO1 Appraise the challenges posed by big data and the new infrastructure, data models and processing techniques it demands.
LO2 Survey the main NoSQL-based data models, exploring the best-fit for different use-cases.
LO3 Query a range of NoSQL databases using a high-level programming language.
LO4 Explore the scalability, flexibility and reliability of a distributed data cluster supporting large data sets.
LO5 Implement an analytical solution over a large-scale dataset using MapReduce and Spark.
Pre-requisite learning
Module Recommendations
This is prior learning (or a practical skill) that is strongly recommended before enrolment in this module. You may enrol in this module if you have not acquired the recommended learning but you will have considerable difficulty in passing (i.e. achieving the learning outcomes of) the module. While the prior learning is expressed as named CIT module(s) it also allows for learning (in another module or modules) which is equivalent to the learning specified in the named module(s).
No recommendations listed
Incompatible Modules
These are modules which have learning outcomes that are too similar to the learning outcomes of this module. You may not earn additional credit for the same learning and therefore you may not enrol in this module if you have successfully completed any modules in the incompatible list.
No incompatible modules listed
Co-requisite Modules
No Co-requisite modules listed
Requirements
This is prior learning (or a practical skill) that is mandatory before enrolment in this module is allowed. You may not enrol on this module if you have not acquired the learning specified in this section.
No requirements listed
Co-requisites
No Co Requisites listed
 

Module Content & Assessment

Indicative Content
The Big Data Revolution.
Data storage and data process: Historical evolution. New infrastructure, data models and processing techniques required to deal with big data. Main challenges: Capture, store, search, analyse and visualise the data.
Big Data Framework.
Dataset characterisation: Variety, velocity and volume. Data Framework ecosystem overview: Tools to ingest, store, analyse and manage data. Data integration: Extracting, transforming and loading relational and nonrelational-based data.
Data Storage.
Distributed File System. Data nodes vs. name nodes. Data replication and fault tolerance. Cluster manager: component and roles. Large files splitting and distribution algorithms.
NoSQL Data Models.
NoSQL databases arising to tackle problem RDBMS is not good at: Schema-less, high level data representation, scale-out distributed-based infrastructure. CAP theorem. Lost of transactional properties: ACID relational properties vs BASE for NoSQL. Wide range of data models: Pure key/value, colummn-based, document oriented and graph-based. Trade-off between their expressiveness and efficiency. Polyglot persistance: On combining different NoSQL data models for a fit for purpose multi-component system.
Data Processing: Large-scale Analytics.
Text, temporal and geospatial-based datasets. Execution plan: Cluster nodes collaboaration, parallel processing, job scheduling, network transferrence, key/value-based communication. Large-scale analytics libraries: MapReduce and Spark. On comparing their expressiveness and efficiency.
Assessment Breakdown%
Course Work100.00%
Course Work
Assessment Type Assessment Description Outcome addressed % of total Assessment Date
Practical/Skills Evaluation Implement some queries for a large dataset stored both in a document-oriented vs. graph-based database. 1,2,3 50.0 Week 7
Practical/Skills Evaluation Implement a solution for analysing a large dataset using MapReduce and Spark. 1,4,5 50.0 Week 12
No End of Module Formal Examination
Reassessment Requirement
Coursework Only
This module is reassessed solely on the basis of re-submitted coursework. There is no repeat written examination.

The institute reserves the right to alter the nature and timings of assessment

 

Module Workload

Workload: Full Time
Workload Type Workload Description Hours Frequency Average Weekly Learner Workload
Lecture Lecture based on Indicative Content 1.0 Every Week 1.00
Lab Lab based on Indicative Content 3.0 Every Week 3.00
Independent Learning Student undertakes independent study and develops relevant programming skills 3.0 Every Week 3.00
Total Hours 7.00
Total Weekly Learner Workload 7.00
Total Weekly Contact Hours 4.00
Workload: Part Time
Workload Type Workload Description Hours Frequency Average Weekly Learner Workload
Lecture Lecture based on Indicative Content 1.0 Every Week 1.00
Lab Lab based on Indicative Content 3.0 Every Week 3.00
Independent Learning Student undertakes independent study and develops relevant programming skills 3.0 Every Week 3.00
Total Hours 7.00
Total Weekly Learner Workload 7.00
Total Weekly Contact Hours 4.00
 

Module Resources

Recommended Book Resources
  • Pramod J. Sadalage and Martin Fowler 2013, NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, Addison-Wesley [ISBN: 9780321826626]
  • Ofer Mendelevitch, Casey Stella and Douglas Eadline 2017, Practical Data Science with Hadoop and Spark : Designing and Building Effective Analytics at Scale, Pearson Education [ISBN: 9780134024141]
Supplementary Book Resources
  • John Sharp et. al 2013, Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence, Microsoft patterns & practices [ISBN: 9781621140306]
  • Kristina Chodorow 2013, MongoDB: The Definitive Guide, O'Reilly Media [ISBN: 9781449344689]
  • Srinath Perera and Thilina Gunarathne 2013, Hadoop MapReduce Cookbook, Packt Publishing [ISBN: 9781849517294]
This module does not have any article/paper resources
Other Resources
 

Module Delivered in

Programme Code Programme Semester Delivery
CR_SDAAN_8 Higher Diploma in Science in Data Science & Analytics 2 Mandatory

Cork Institute of Technology
Rossa Avenue, Bishopstown, Cork

Tel: 021-4326100     Fax: 021-4545343
Email: help@cit.edu.ie