Course Outline

About Spark


  • Basic knowledge about cluster computing systems
  • Working knowledge of Working knowledge of Scala, Java, Python or R


  • 2-5 days

Lab Setup

  • Hardware Configuration
    • A minimum of 20GB of disk space and at least
    • Ensure that all participants have a properly functioning Internet connection
  • Software Configuration
    • Ubuntu 16.04 Desktop/Server edition

Course Outline for Spark

Big Data
  • Dealing with web-scale data
  • How big is big data?
  • Where is the data?
Map/Reduce Algorithms
  • Map-Only
  • Sorting
  • Inverted Indexes
  • Counting and Summing
  • Filtering
  • Trying out a few examples
  • The traditional parallel programming paradigm
  • Issues with traditional parallel programming
  • Introduction to Map/Reduce
  • Thinking in Map/Reduce
Introduction to Spark
  • What is Spark?
  • Differences from Pig and Hive
  • Installing Spark
  • Where to use Spark
  • Linking with Spark
  • Using the Shell
Resilient Distributed Datasets (RDDs)
  • Parallelized Collections
  • External Datasets
  • RDD Operations
    • Basics
    • Passing functions to Spark
    • Closures
    • Working with Key-Value Pairs
    • Transformations
    • Actions
    • Shuffle operations
RDD Persistence
  • Which Storage Level to Choose?
  • Removing Data
Shared Variables
  • Broadcast Variables
  • Accumulators
Deploying to a Cluster
  • Launching Application with spark-submit
  • Advanced Dependency Management
Launching Spark jobs from Java/Scala
  • Spark Launcher
  • Spark App handle
Using Spark with Different Languages
  • Scala
  • Java
  • Python
  • R
PySpark and Jupyter
  • Introduction to PySpark
  • Using Jupyter notebook
  • Using pySpark shell
  • PySpark in Jupyter

The classroom training will be provided in Bangalore (Bengaluru), Chennai, Hyderabad or Mumbai and will be conducted in the client's premises. All the necessary hardware/software infrastructure must be provided by the client.