Management and Processing of Big Data – Level I

Return to schedule



Course Title Management and Processing of Big Data – Level I
Course Number 900-058-EQ-01
Platform Linux
Duration 48 hours
Gouvernement du Québec fee (taxes incl.)
General Public fee (taxes incl.)
Schedule Saturday 9:00 a.m.- 4:00 p.m. (30 minute lunch); Sunday, November 1: 9:30 a.m. – 12:00  p.m.
Dates September ,12, 19, 26, October, 3, 17, 24, 31 &  Sunday, November1 (no class on October 10 )
Prerequisites Comfortable in Linux environment, some programming experience is required (ideally JAVA )
Target Audience Developers
Instructor Deepak Parameshwarappa
Location Online Format

NB: This is a non-credit course. Certificate provided for all participants who have completed 80% of course hours

Course Description:

This course provides practical foundation level training that enables participation in big data projects. Participants will be introduced to big data technology and tools, including MapReduce and Hadoop. They will learn how to install and configure the Hadoop in cluster environment, how to write complex MapReduce programs, and how to analyze Big Data using Pig and Hive.


Topics Covered in this Course

Introduction into Hadoop

  • What is Big Data?
    • Introduction to Big data and its challenges
  • Importance of Big Data and Hadoop in the industry
  • Various dimensions of Big Data
  • Big Data Use Cases in different domains like telecom, banking, social media, and so on
  • Traditional systems vs Big Data tools
  • Scope and future of Big Data in industry

Components of Hadoop: Basic Concepts and HDFS

  • Installation, configuration and setup of Hadoop cluster
  • Working with Hadoop in pseudo-distributed mode
  • Brief introduction to Hadoop and its ecosystem
  • Hadoop cluster Architecture
  • Hadoop Administration and troubleshooting
  • Running basic HDFS commands

Hadoop MapReduce Framework

  • MapReduce Architecture
  • What is YARN?
    • YARN components
    • YARN Architecture
  • Input splits, HDFS blocks
  • Distributed Cache

Apache Sqoop

  1. What is Sqoop? Why do we need it?
  2. Sqoop Architecture
  3. Importing and exporting data from RDBMS
  4. Hands-on Sqoop commands

Apache Pig – Alternative to MapReduce

  1. Drawbacks of MapReduce
  2. What is Pig? – Architecture Overview
  3. Hands-on Pig scripts

Pig Use case – Olympic Data analysis using Apache Pig

Apache Hive – Analysis and Data Warehousing tool in Hadoop Ecosystem

  1. What is Hive? – Architecture Overview
  2. Hive history and its components
  3. Hands on Hive – orderBy, clusterBY, distributeBy and so on
  4. Hive SerDe
  5. Hive – internal vs external tables
  6. Different file formats – avro, JSON, parquet, RC, ORC

Hive Use case – Olympic Data analysis

Apache Hive Advanced – Optimization

  1. Partitioning and Bucketing
  2. Optimized advanced Joins
  3. Indexes and Views
  4. Analytical and window functions in Hive – rank, dense_rank, lag, lead.

Hive Use case – Retail data analysis, Consumer Complaints analysis

Apache HBase – Introduction to NoSQL DataBase

  1. Difference between RDBMS and NOSQL
  2. HBASE architecture
  3. HBase – Hands On