CS 6240 - Parallel Data Processing in MapReduce

Excercise Data

Scala / Spark Links

Overview

Graduate course. This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.

Essential Resources

Instructor Info

  • Instructor: Nat Tuck
  • Email: ntuck@ccs.neu.edu
  • Office: 314 WVH
  • Office Hours: TBA
  • TA: TBA

Office Hours

Name Time Room Email
Nat Tuck Monday, 3pm-4pm WVH 314 ntuck@ccs
Rundong Li Wednesdays, 3:00-4:00 PM WVH Room 472 rundong@ccs
Pooja Chitrakar Thursdays, 6:30-7:30 PM CCIS Lab chitrap@ccs
Nikite Gulve Fridays, 12:00-1:00 PM WVH Main Lab nik2709@ccs.neu.edu

Policies

  • There are no deadline extensions or make-up assignments/exams unless you have a major emergency with appropriate documentation.
  • Please note that you are not allowed to share homework solutions with others, or copy anybody else’s homework entirely or in parts. We will check for originality during the grading process. Violations will be reported both to OSCCR and to the college, and will likely result in an F for the course.

Class Split

This is a "partially flipped" class. For most weeks you will only need to come to one lecture a week. You will be assigned Tuesday or Friday randomly.

Schedule

Week # Dates Topics Assignments Due Split?
1 Sep 11 Course Intro - No
2 Sep 15, 18 Parallel Processing - Yes
3 Sep 22, 25 Map-Reduce Overview HW1 Yes
4 Sep 29, Oct 2 Fundamental Techniques - Yes
5 Oct 6, 9 Basic Algoritdms HW2 Yes
6 Oct 13, 16 Applications of Basic Algoritdms - Yes
7 Oct 20, 23 Pig Project Proposal Yes
8 Oct 27, 30 Databases - Yes
9 Nov 3, 6 CAP Theorem, HBase & Hive HW3 Yes
10 Nov 10, 13 Midterm Exam - No
11 Nov 17, 20 Graph Algorithms HW4 Yes
12 Nov 24 Intelligent Partitioning - No
13 Dec 1, 4 Data Mining, Spark Final Project Yes
14 Dec 8, 11 Project Presentations - No

Pre-reqs

CS 5800 or CS 7800, or consent of instructor

Grading

Blackboard Modules 10%
Participation 20%
Mid-term Exam 20%
Homeworks 20%
Project Proposal 10%
Project 20%

Recommended Textbooks & Materials

To gain a deeper understanding of the material covered in this course, we recommend the following books, most of which are available online (and for free) for Northeastern University students from Safari Books Online.

  • Hadoop: The Definitive Guide by Tom White
  • MapReduce Design Patterns by Donald Miner and Adam Shook
  • Programming Elastic MapReduce by Kevin Schmidt and Christopher Phillips
  • HBase: The Definitive Guide by Lars George
  • Programming Hive by Edward Capriolo, Dean Wampler, and Jason Rutherglen
  • Hadoop in Practice by Alex Holmes
  • Hadoop in Action by Chuck Lam

For a nice compact summary of MapReduce and some design patterns, read Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, which is available for free at http://www.umiacs.umd.edu/~jimmylin/book.html.

For some topics we will work with research papers or other online resources. One important resource will be the Hadoop API.

Special Accomodations

If the Disability Resource Center has formally approved you for an academic accommodation in this class, please present the instructor with your “Professor Notification Letter” during the first week of the semester, so that we can address your specific needs as early as possible.

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity web page.