CS 6240 - Parallel Data Processing with Map-Reduce

Explore to the underlying principles of the distributed processing of large data sets. Gain an understanding of the performance and usability tradeoffs of various data analytics infrastructures. Work with large data sets and conduct practical experiments with machine learning techniques. Gain a working knowledge of technologies such as Hadoop and Spark and an insight into their implementation. The class builds on known principles such as the design recipe, testing and code reviews.

Notes / Data

Office Hours

Nat Tuck WVH 314 Wed 3-4pm
Monisha Singh CCIS Lab Wed noon-2pm
Yogendra Miraje CCIS Lab Wed 3:30 - 5:30pm
Shreyas Mahimkar CCIS Lab Tues 1-3pm

Work For This Class

This course will require 15hrs/week on average.

Weekly Assignments

Each week you will be given a homework assignment, due on Thursday at midnight. The initial assignments will be solo, but will jump to teams of 2 and then 4 as the semester progresses.

Once you submit your assignment, it will be assigned to another team for code review. Code reviews are due the following Monday at midnight. Late submissions are not accepted.

Code Walks

Each week some teams will be randomly selected for in-class code walks. Both the team that wrote the code and the team that reviewed it will be asked to explain and justify their work.

Quizzes

At the start of each class there may be a 5-10 minute quiz. This will test key points of understanding on the course material, as well as the assigned reading.

In-Class Coding

There will be weekly in-class coding assignments. These should take about an hour, and are expected to be completed in class. For these assignments you will work alone - although verbal discussion is allowed - make sure to bring a laptop.

Blackboard Modules

NU Online Blackboard: http://nuonline.neu.edu

Each week you are expected to review the lesson material and complete the online quiz for a module on NU Online Blackboard. This material is from the online version of the class, and covers slightly different material from a slightly different perspective.

Final Project

Once you've learned how to use map-reduce, you'll have to build it.

Grading

Homework & Peer Code Reviews 20%
Code Walks 10%
Final Project 40%
In-Class Coding 10%
Blackboard Modules 10%
Quizzes 10%

Grades will be assigned on the following scale:

95+ 90+ 85+ 80+ 75+ 70+ 60+
A A- B+ B B- C D

Schedule

Jan 12
Jan 15
  • Distributed Processing
  • Concurrency: Threads in Java
  • Read: HTG Ch 1-2
  • Friday: HW0, BB1 Due
Jan 19
Jan 22
  • Databases and Big Data
  • SQL and NoSQL
  • Read: CAP Theorem
  • Tuesday: HW0 Review Due
  • Thursday: HW1, BB2 Due
Jan 26
Jan 29
  • Map Reduce
  • Data Analytics in the Small
  • Read: HTDG Ch 5, 6, 7; MR04
  • Monday: HW1 Code Review Due
  • Thursday: HW2, BB3 Due
Feb 2
Feb 5
  • Map-Reduce in Depth
  • Combiners, In-Map Combining, Custom Partitioners
  • Read: HTDG Ch 5, 6, 7, 8, 16
  • Monday: HW2 Code Review Due
  • Thursday: HW3, BB4 Due
Feb 9
Feb 12
  • Measuring Performance
  • Distributed Sort & Join
  • Read: MAS11; VLDB12; O+14
  • Monday: HW3 Code Review Due
  • Thursday: HW4, BB5 Due
Feb 16
Feb 19
  • Hadoop & Beyond
  • Introducing Spark
  • Read: S14; SK12
  • Monday: HW4 Code Review Due
  • Thursday: HW5, BB6 Due
Feb 22
Feb 25
  • Data-Parallel Pipelines
  • BigTable / HBase
  • Read: FJ10; R+12; OS06; M3R12
  • Monday: HW5 Code Review Due
  • Thursday: HW6, BB7 Due
Mar 1
Mar 4
  • Resilient Distributed Data
  • Relational Data Processing
  • Read: RDD12, A+15
  • Monday: HW6 Code Review Due
  • Thursday: HW7, BB8 Due
Mar 7

Spring Break

Mar 15
Mar 18
  • Scaling Spark
  • Project Planning: Building Map-Reduce
  • Read: AD15
  • Monday: HW7 Code Review Due
  • Thursday: HW8, BB9 Due
Mar 22
Mar 25
  • Building a Distributed System
Mar 29
Apr 1
  • Application: K-Means
  • Thursday: HW9 Review, BB12 Due
Apr 5
Apr 8
  • Application: Shortest Path
  • Thursday: BB13 Due
  • Work on project.
Apr 12
Apr 15
  • OpenCL
  • K-Means in OpenCL
  • Thursday: BB14 Due
  • Work on project.
Apr 19
Apr 22
  • Bonus Topic
  • Thursday: Project Report Due
Apr 25-29

Final Project Presentations

Reading

HTDG Hadoop The Definitive Guide, White, O'Reilly
MR04 MapReduce: Simplified Data Processing on Large Clusters, Dean, Ghemawat, OSDI04, PDF
FJ10 FlumeJava: Easy, Efficient Data-Parallel Pipelines, Chambers+, PLDI'10, PDF
SK12 Possible Hadoop Trajectories Stonebraker, Kepner, CACM'12 Text
S14 Hadoop at a Crossroads? Stonebraker, CACM'14 Text
M3R12 Increased Performance for In-Memory Hadoop Jobs, Shinnar+, VLDB'12 PDF
RDD12 A Fault-Tolerant Abstraction for In-Memory Cluster Comp, Zaharia+, NSDI'12 PDF
VLDB12 The Performance of MapReduce: An In-depth Study, Jiang+l., VLDB'10, PDF
MAS11 Evaluating MapReduce Performance Using Workload Suites, Chen+., MASCOTS'11 PDF
OS06 Bigtable: A Distributed Storage System for Structured Data, Chang+, OSDI06 PDF
R+12 Nobody ever got fired for using Hadoop on a cluster, Rowstron+, PDF
A+15 Spark SQL: Relational Data Processing in Spark, Armburst+, SIGMOD15 PDF
O+14 Anti-Combining for MapReduce, Okcan+, SIGMOD14 PDF
AD15 Scaling spark in the real world: performance and usability, Armburst+, VLDB15 PDF

Policies

Sharing Code

Here is the code sharing / plagarism policy for the class:

Grade Challenges

If you want to contest a grade once you've recieved it, this class uses a variant of the "coach's challenge" system to resolve such challenges. Here are the rules: