Course outline for Spark

TechnologySparkDuration * 2-5 days LevelPrerequisites listed

Pre-requisites

  • Basic knowledge about cluster computing systems
  • Working knowledge of Scala, Java, Python or R

Lab Setup

  • Hardware Configuration
    • A minimum of 20GB of disk space and at least
    • Ensure that all participants have a properly functioning Internet connection
  • Software Configuration
    • Ubuntu 20.04/22.04 Desktop/Server edition

How we train

Online training for Spark

  • Instructor-led live cohorts
  • Self-paced learning with expert coaches
  • 24x7 cloud labs with end-to-end examples

All sessions are 100% hands-on. Labs and activities are derived from real-world work our engineers deliver.

Classroom training

Available for corporate teams in:

  • Bengaluru
  • Chennai
  • Hyderabad
  • Mumbai
  • Delhi/Gurgaon/NCR
  • Pune

Note: Classroom training is for corporate clients only.

Self-paced hands-on sessions are delivered via VirtualCoach.

Detailed Course Outline

Hands-on

Big Data

  • Dealing with web-scale data
  • How big is big data?
  • Where is the data?

Map/Reduce Algorithms

  • Map-Only
  • Sorting
  • Inverted Indexes
  • Counting and Summing
  • Filtering
  • Trying out a few examples

Map/Reduce

  • The traditional parallel programming paradigm
  • Issues with traditional parallel programming
  • Introduction to Map/Reduce
  • Thinking in Map/Reduce

Introduction to Spark

  • What is Spark?
  • Differences from Pig and Hive
  • Installing Spark
  • Where to use Spark
  • Linking with Spark
  • Using the Shell

Resilient Distributed Datasets (RDDs)

  • Parallelized Collections
  • External Datasets
  • RDD Operations
    • Basics
    • Passing functions to Spark
    • Closures
    • Working with Key-Value Pairs
    • Transformations
    • Actions
    • Shuffle operations

RDD Persistence

  • Which Storage Level to Choose?
  • Removing Data

Shared Variables

  • Broadcast Variables
  • Accumulators

Deploying to a Cluster

  • Launching Application with spark-submit
  • Advanced Dependency Management

Launching Spark jobs from Java/Scala

  • Spark Launcher
  • Spark App handle

Using Spark with Different Languages

  • Scala
  • Java
  • Python
  • R

PySpark and Jupyter

  • Introduction to PySpark
  • Using Jupyter notebook
  • Using pySpark shell
  • PySpark in Jupyter