Management and Processing of Big Data – Level II

register-button24 Return to schedule

Course Title Management and Processing of Big Data – Level II
Course Number 900-068-EQ-01
Platform Linux
Duration 45 hours
Gouvernement du Québec fee (taxes incl.) $90
General Public fee (taxes incl.) $812.82
Schedule Saturday: 9 a.m. – 4 p.m.; Sunday, December 6: 9 a.m. – 3:30 p.m. (30 minute lunch)
Dates Saturday : November 7 ,14, 21, 28 , December 5 ,12 & Sunday, December 6
Prerequisites Absolute prerequisite: Management and Processing of Big data – Level I
Target Audience Developers
Instructor Deepak Parameshwarappa
Location Online format

NB: This is a non-credit course. Certificate provided for all participants who have completed 80% of course hours


Course Description

A continuation of the course Management and Processing of Big Data –level I, this course provides practical foundation level training that enables participation in big data projects. Participants will be introduced to big data technology and tools, including MapReduce and Hadoop. They will learn how to install and configure the Hadoop in cluster environment, how to write complex MapReduce programs, and how to analyze Big Data using Pig and Hive.

Topics covered in this course

Introduction to PySpark

  • What is Apache Spark and why it is needed?
  • Spark VS Hadoop
    • How does Spark fits in Hadoop Ecosystem?
  • Spark Use Case

Brief Introduction to Python for Apache Spark

  • Data types, variables
  • Conditional statements and Loops
  • Python files I/O Functions
  • Numbers, Strings and related operations
  • Lists, Tuples, Dictionaries, Sets and related operations
  • Hands-On

Functions, OOPS in Python

  • Lambda functions
  • Object Oriented Concepts
  • Standard Libraries and Modules used in Python
  • Hands-On

Deep Dive into Apache Spark

  • Spark Architecture and its Components
    • Understanding SparkContext, SparkSession, Driver, Stages and Tasks.
  • Spark features and characteristics
  • Introduction to PySpark Shell and how to submit a PySpark Job
  • Installation of Apache Spark in windows
  • Brief discussion on Spark Web UI

Spark RDDs

  • What are RDDs, features and characteristics.
  • RDDs Transformations and Actions
  • Key-Value Pair RDDs and operations
  • Brief discussion on RDD Lineage
  • RDD Partitioning
    • Coalesce
    • Repartition
  • Hands-On
    • Read and write data using RDDs
    • Use cases on map, flatMap, reduceByKey, aggregateBykey, groupByKey, fold, sortByKey
    • UBER data analysis

DataFrames and Spark SQL

  • Drawbacks of RDDs
  • What are DataFrames and Spark SQL and what is the need of it?
  • Spark SQL Architecture and Catalyst optimizers
  • A tale of 3 Spark API – RDDs VS DataSets VS DataFrames
  • Interoperating with RDDs
  • Brief overview of different file formats
    • Avro
    • Parquet
    • JSON
    • CSV
    • RC and ORC
    • Partitioning and Bucketing in DataFrames
  • Advanced Analytical Windowing functions
  • Performance tuning in Spark
  • Spark – Hive integration
  • Hands-On
  • Stock Market data analysis
  • UBER data analysis
  • Aviation data analysis
  • How to use complex Joins to make analysis