CS 6240 - Parallel Data Processing with Map-Reduce

Explore to the underlying principles of the distributed processing of large data sets. Gain an understanding of the performance and usability tradeoffs of various data analytics infrastructures. Work with large data sets and conduct practical experiments with machine learning techniques. Gain a working knowledge of technologies such as Hadoop and Spark and an insight into their implementation. The class builds on known principles such as the design recipe, testing and code reviews.

Course Resources

Notes / Data

Office Hours

Nat Tuck WVH 314 Thursday, 2-4pm ntuck ⚓ ccs.neu.edu
Mirek Riedewald WVH 332 TBA  
Joe Sackett WVH 462 Tuesday, 3:30 - 4:30pm jsackett ⚓ ccs.neu.edu
Ankur Shanbhag WVH First Floor Wednesday, 4-5pm ankurs ⚓ ccs.neu.edu
Swapnil Mahajan WVH First Floor Friday, 4-5pm swapm31 ⚓ ccs.neu.edu

Work For This Class

Homework Assignments

Every week or two you will be given a homework assignment to complete. Homework assignments will be posted to Bottlenose, and work should be submitted there as well.

The last assignment will be a small project, worth the same points as two homeworks.

Assignments are due at 11pm on the specified day. Late submissions will recieve an automatic 50% point deduction. Submissions more than a day late will recive a 100% point deduction.

Blackboard Modules

NU Online Blackboard: http://nuonline.neu.edu

Each week you are expected to review the lesson material and complete the online quiz for a module on NU Online Blackboard. This should be completed before class so you are prepared for the lecture and any in-class questions.

In-Class Coding

There will in-class coding assignments approximately weekly. Make sure to bring your laptop to class. These are due at the end of class, and should be submitted online through Bottlenose.

Participation

You are expected to use the online discussion forum, and to answer questions asked by your classmates. This will be graded by looking at total number of posts and number of good answers.

Questions will occasionally be asked in class of a random student. Being present and answering will contribute slightly to your in-class coding grade. Not being present will hurt your slightly.

Grading

Homework 40%
Particpation & In-Class Coding 10%
Blackboard Modules 10%
Exam 40%

Grades will be assigned on the following scale:

93+ 90+ 87+ 83+ 80+ 75+ 70+ 60+
A A- B+ B B- C+ C D

Schedule

Here's how the semester is likely to play out. Details subject to change.

Dates Topics BB Module Work Due
Sep 9
  • Introduction
  • Big Picture: Parallel Computing

Intro

  • Read: HTG Ch 1-2
Sep 13
Sep 16
  • Threads in Java
  • Mutexes and Deadlock

Parallel Programs, HDFS

 
Sep 20
Sep 23
  • Map Reduce

Map-Reduce

  • Read: HTDG Ch 5, 6, 7
  • HW1 due
Sep 27
Sep 30
  • Map-Reduce in Depth
  • Combiners, In-Map Combining, Custom Partitioners

M-R Fundamental Techniques

  • Read: HTDG Ch 5, 6, 7, 8, 16
Oct 4
Oct 7
  • Algorithms in Map-Reduce

M-R Basic Algorithms

  • HW2 Due
Oct 11
Oct 14
  • Graph Algorithms
  • Iterative Algorithms

M-R Graph Algorithms

 
Oct 18
Oct 21
  • More Algorithms in M-R
  • Introducing Spark

Advanced Algorithms

  • HW3 Due
Oct 25
Oct 28
  • Data Partitioning
  • More Spark

Partitioning

 
Nov 1
Nov 4
  • Data Mining

Data Mining I

  • HW4 Due
Nov 8
No class Friday
  • Random Forests

Data Mining II

Matrix Multiplication

 
Nov 15
Nov 18
  • Pig Latin
  • Exam Review

Pig Latin

  • HW5 Due
Nov 22
No class Friday
  • EXAM

Exam

 
Nov 29
Dec 2
  • Databases

SQL Databases

 
Dec 6
Dec 9
  • NoSQL Databases

HBase, CAP

  • Project Due
Dec 12 - 16

Data Mining Presentations

Policies

Collaboration & Sharing Code

Grade Challenges

If you want to contest a grade once you've recieved it, this class uses a variant of the "coach's challenge" system to resolve such challenges. Here are the rules:

Deadline Extensions

Deadline Extensions, Makeup Assignments, and Extra Credit Assignments will not be given on request. Exceptions may be made for major emergencies. Examples of non-emergencies include heavy load in other courses, interviews, and job fairs.