Course outline for Spark

Pre-requisites for learning Spark

  • Basic knowledge about cluster computing systems
  • Working knowledge of Scala, Java, Python or R

Lab Setup

  • Hardware Configuration
    • A minimum of 20GB of disk space and at least
    • Ensure that all participants have a properly functioning Internet connection
  • Software Configuration
    • Ubuntu 20.04/22.04 Desktop/Server edition

Duration

  • 2-5 days

Training Mode

Online training for Spark

We provide:

  • Instructor led live training
  • Self-paced learning with access to expert coaches
  • 24x7 access to cloud labs with end to end working examples

All jnaapti sessions are 100% hands-on. All our instructors are engineers by heart. Activities are derived from real-life problems faced by our expert faculty. Self-paced hands-on sessions are delivered via Virtual Coach.

Classroom training for Spark

Classroom sessions are conducted in client locations in:

  • Bengaluru
  • Chennai
  • Hyderabad
  • Mumbai
  • Delhi/Gurgaon/NCR

Note: Classroom training is for corporate clients only

Detailed Course Outline for Spark

Big Data

  • Dealing with web-scale data
  • How big is big data?
  • Where is the data?

Map/Reduce Algorithms

  • Map-Only
  • Sorting
  • Inverted Indexes
  • Counting and Summing
  • Filtering
  • Trying out a few examples

Map/Reduce

  • The traditional parallel programming paradigm
  • Issues with traditional parallel programming
  • Introduction to Map/Reduce
  • Thinking in Map/Reduce

Introduction to Spark

  • What is Spark?
  • Differences from Pig and Hive
  • Installing Spark
  • Where to use Spark
  • Linking with Spark
  • Using the Shell

Resilient Distributed Datasets (RDDs)

  • Parallelized Collections
  • External Datasets
  • RDD Operations
    • Basics
    • Passing functions to Spark
    • Closures
    • Working with Key-Value Pairs
    • Transformations
    • Actions
    • Shuffle operations

RDD Persistence

  • Which Storage Level to Choose?
  • Removing Data

Shared Variables

  • Broadcast Variables
  • Accumulators

Deploying to a Cluster

  • Launching Application with spark-submit
  • Advanced Dependency Management

Launching Spark jobs from Java/Scala

  • Spark Launcher
  • Spark App handle

Using Spark with Different Languages

  • Scala
  • Java
  • Python
  • R

PySpark and Jupyter

  • Introduction to PySpark
  • Using Jupyter notebook
  • Using pySpark shell
  • PySpark in Jupyter