Course Outline


Notes

The course is targeted at application developers who are currently evaluating the use of Hadoop in their projects or are intending to work with data at scale The course covers some aspects of Hadoop administration too (gives an overview of Hadoop cluster setups)

Training Duration

5 days

Participant Skill Pre-Requisites

  • Good knowledge of Java and Eclipse
  • Must be comfortable in a Linux environment
  • Knowledge of web scale data challenges is preferred

Lab Setup

Hardware pre-requisites

  • A minimum of 20GB of disk space and atleast 4GB of RAM
  • Ensure that all participants have a properly functioning Internet connection

Software pre-requisites

  • Ubuntu 16.04 Desktop Edition.
  • JDK 1.8
  • Eclipse Oxygen Java or JEE Edition

Course Outline for Hadoop, HBase, Pig, Hive

Overview
  • Dealing with web-scale data
  • How big is big data?
  • Where is the data?
  • A few use-cases for Hadoop
  • When to and when not to use Hadoop
Introduction to the Hadoop Ecosystem
  • Hadoop architectural overview
  • Companies and products related to Hadoop
  • Hadoop Sub-projects
  • Downloading and installing Hadoop
Hadoop "Hello World"
  • Setting up a "one-node" cluster
  • Running a simple example
  • Using Eclipse to develop Hadoop programs
  • Executing the provided examples
Beyond a Single System
  • Setting up a 2-node cluster
  • Setting up multi-node clusters
HDFS
  • What is a distributed filesystem?
  • Introduction to Hadoop DFS
  • HDFS Concepts – Blocks, Namenodes and Datanodes
  • Configuring HDFS
  • Interacting with HDFS
  • Using the HDFS web interface
Map/Reduce
  • The traditional parallel programming paradigm
  • Issues with traditional parallel programming
  • Introduction to Map/Reduce
  • Thinking in Map/Reduce
Map/Reduce Algorithms
  • Map-Only
  • Sorting
  • Inverted Indexes
  • Counting and Summing
  • Filtering
  • Trying out a few examples
Using Hadoop Commands
  • Commands overview
  • User commands
  • Admin commands
Hadoop best practices and use cases
  • A look at some of the high-level usecases and how Hadoop/Map-Reduce can be used to solve them
  • Twitter Stream Analysis
Hadoop Cluster Setup
  • SSH Configuration
  • Hadoop Configuration
  • Verifying the cluster setup
  • Using Hadoop via Amazon EMR
PIG
  • Why Pig?
  • Installing Pig
  • Running Pig locally
  • Running Pig on a Hadoop Cluster
  • The Pig Console (grunt)
  • The Pig Data Model
  • Pig Latin - Input and Output, Relational Operations and User Defined Functions
Hive
  • What is Hive
  • Installation
  • Configuration
  • Data Definition Language – Tables, Views, Indexes
  • Data Manipulation Language
  • Handling Joins in Hive
HBase
  • Installation of HBase
  • Running HBase
  • MapReduce integration
  • Understanding the HBase architecture
  • Cluster setup

The classroom training will be provided in Bangalore (Bengaluru), Chennai, Hyderabad or Mumbai and will be conducted in the client's premises. All the necessary hardware/software infrastructure must be provided by the client.