Introduction to Linux and Big Data Virtual Machine (VM)

Introduction/Installation of VirtualBox and the Big Data VM Introduction to Linux.

    • Why Linux?
    • Windows and the Linux equivalents
    • Different flavors of Linux
    • Unity Shell (Ubuntu UI)
    • Basic Linux Commands (enough to get started with Hadoop)

Understanding Big Data

    • 3V (Volume-Variety-Velocity) characteristics
    • Structured and Unstructured Data
    • Application and use cases of Big Data

Limitations of traditional large Scale systems
How a distributed way of computing is superior (cost and scale)
Opportunities and challenges with Big Data
HDFS (The Hadoop Distributed File System) HDFS Overview and Architecture

  • Deployment Architecture
  • Name Node, Data Node and Checkpoint Node (aka Secondary Name Node)
  • Safe mode
  • Configuration files
  • HDFS Data Flows (Read vs Write)

How HDFS addresses fault tolerance?

  • CRC Check Sum
  • Data replication
  • Rack awareness and Block placement policy
  • Small files problem

HDFS Interfaces

  • Command Line Interface
  • File System
  • Administrative
  • Web Interface

Advanced HDFS features

  • Load Balancer
  • DistCp
  • HDFS Federation
  • HDFS High Availability

MapReduce – 1 (Theoretical Concepts)
MapReduce overview

  • Functional Programming paradigms
  • How to think in a MapReduce way?

MapReduce Architecture

  • Legacy MR vs Next Generation MapReduce (aka YARN/MRv2)
  • Slots vs Containers
  • Schedulers
  • Shuffling, Sorting
  • Hadoop Data Types
  • Input and Output Formats
  • Input Splits
  • Partitioning (Hash Partitioner vs Customer Partitioner)
  • Configuration files
  • Distributed Cache

MR Algorithm and Data Flow

  • Word Count
  • Indexing

MapReduce – 2 (Practice)
Developing, debugging and deploying MR programs

  • Stand alone mode (in Eclipse)
  • Pseudo distributed mode (as in the Big Data VM)
  • Fully distributed mode (as in Production)


  • Old and the new MR API
  • Java Client API
  • Hadoop data types and custom Writables/WritableComparables
  • Different input and output formats

Hadoop Streaming (Developing and Debugging non Java MR programs – Ruby and Python)
Optimization techniques

  • Speculative execution
  • Combiners
  • JVM Reuse
  • Compression

MR algorithms

  • Sorting
  • Term Frequency – Inverse Document Frequency
  • Student Data Base
  • Max Temperature
  • Different ways of joining data

Higher Level Abstractions for MR (Pig)

  • Introduction and Architecture
  • Different Modes of executing Pig constructs
  • Data Types
  • Dynamic invokers
  • Pig streaming
  • Macros
  • Pig Latin language Constructs (LOAD, STORE, DUMP, SPLIT etc)
  • User Defined Functions
  • Use Cases

Higher Level Abstractions for MR (Hive)

  • Introduction and Architecture
  • Different Modes of executing Hive queries
  • Metastore Implementations
  • HiveQL(DDL & DML Operations)
  • External vs Managed Tables
  • Views
  • Partitions & Buckets
  • User Defined Functions
  • Transformations using Non Java
  • Use Cases

Comparison of Pig and Hive
NoSQL Databases – 1 (Theoretical Concepts)
NoSQL Concepts

  • Review of RDBMS
  • Need for NoSQL
  • rewers CAP Theorem
  • ACID vs BASE
  • Schema on Read vs. Schema on Write
  • Different levels of consistency
  • Bloom filters

Different types of NoSQL databases

  • Key Value
  • Columnar
  • Document
  • Graph

Columnar Databases concepts
NoSQL Databases – 2 (Practice)
HBase Architecture

  • Master and the Region Server
  • Catalog tables (ROOT and META)
  • Major and Minor compaction
  • Configuration files
  • HBase vs Cassandra

Interfaces to HBase (for DDL and DML operations)

  • Java API
  • Client API
  • Filters
  • Scan Caching and Batching
  • Command Line Interface

Advances HBase Features

  • HBase Data Modeling
  • Bulk loading data in HBase
  • HBase Coprocessors – EndPoints (similar to Stored Procedures in RDBMS)
  • HBase Coprocessors – Observers (similar to Triggers in RDBMS)

Setting up a Hadoop Cluster using Apache Hadoop
Brief introduction to what Cloud is and AWS

Cloudera Hadoop cluster on the Amazon Cloud (Practice)

  • Using EMR (Elastic Map Reduce)
  • Using EC2 (Elastic Compute Cloud)

SSH Configuration

Stand alone mode (Theory)
Distributed mode (Theory)

  • Pseudo distributed
  • Fully distributed

Getting started with Apache Spark

  • Limitations of the MR model and how Spark/RDD addresses them
  • Spark Installation demo
  • Different modes of running Spark
  • What are RDDs?
  • Different transformations and actions on RDD.
  • Integrating Spark with PyCharm
  • Developing Spark programs in PyCharm, Shell etc
  • Spark Streaming overview and demo
  • Spark SQL overview and demo

Hadoop Ecosystem and Use Cases

  • Hadoop industry solutions
  • Importing/exporting data across RDBMS and HDFS using Sqoop
  • Getting real-time events into HDFS using Flume
  • Creating workflows in Oozie
  • Graph processing with Neo4J
  • NoSQL databases Cassandra and Mongo
  • Distributed coordination using ZooKeeper

Proof of concepts and use cases

  • Two projects which are very close to real life projects.
  • Further ideas for data analysis


Hadoop Administrator

Take your knowledge to the next level with Hadoop Training.

This is a 24 hours instructor lead developer training course provides system administrators a comprehensive understanding of all the steps necessary to operate and manage Hadoop clusters. The course covers installation, configuration, load balancing and tuning your cluster.

Upon completion of the course, attendees can clear Hadoop administrator certification from Cloudera or from HortonWorks. Certification is a great differentiator; it helps establish individuals as leaders in their field, providing customers with tangible evidence of skills and expertise.


→ Introduction

  • What is Cloud Computing
  • What is Grid Computing
  • What is Virtualization
  • How above three are inter-related to each other
  • What is Big Data
  • Introduction to Analytics and the need for big data analytics
  • Hadoop Solutions – Big Picture
  • Hadoop distributions
  • Comparing Hadoop Vs. Traditional systems
  • Volunteer Computing
  • Data Retrieval – Radom Access Vs. Sequential Access
  • NoSQL Databases

→ The Motivation for Hadoop

  • Problems with traditional large-scale systems
  • Requirements for a new approach

→ Hadoop: Basic Concepts

  • What is Hadoop?
  • The Hadoop Distributed File System
  • How MapReduce Works
  • Anatomy of a Hadoop Cluster

→ Hadoop demons

  • Namenode
  • Datanode
  • Secondary namenode
  • Job tracker
  • Task tracker

→ HDFS at detail

  • Blocks and Splits
  • Replication
  • Data high availability
  • Data Integrity
  • Cluster architecture and block placement

→ Programming Practices & Performance Tuning

  • Developing MapReduce Programs in
    • Local Mode
    • Pseudo-distributed Mode
    • Fully distributed mode

→ Writing a MapReduce Program

  • Examining a Sample MapReduce Program
  • Basic API Concepts
  • The Driver Code
  • The Mapper
  • The Reducer
  • Hadoop’s Streaming API

→ Setup Hadoop cluster

  • Install and configure Apache Hadoop
  • Make a fully distributed Hadoop cluster on a single laptop/desktop
  • Install and configure Cloudera Hadoop distribution in fully distributed mode
  • Install and configure Horton Works Hadoop distribution in fully distributed mode
  • Monitoring the cluster
  • Getting used to management console of Cloudera and Horton Works

→ Hadoop Security

  • Why Hadoop Security Is Important
  • Hadoop’s Security System Concepts
  • What Kerberos Is and How it Works
  • Configuring Kerberos Security
  • Integrating a Secure Cluster with Other Systems

→ Managing and Scheduling Jobs

  • Managing Running Jobs
  • Hands-On Exercise
  • The FIFO Scheduler
  • The FairScheduler
  • Configuring the FairScheduler
  • Hands-On Exercise

→ Cluster Maintenance

  • Checking HDFS Status
  • Hands-On Exercise
  • Copying Data Between Clusters
  • Adding and Removing
  • Cluster Nodes
  • Rebalancing the Cluster
  • Hands-On Exercise
  • NameNode Metadata Backup

→ Cluster Monitoring and Troubleshooting

  • General System Monitoring
  • Managing Hadoop’s Log Files
  • Using the NameNode and
  • JobTracker Web UIs
  • Hands-On Exercise
  • Cluster Monitoring with Ganglia
  • Common Troubleshooting Issues
  • Benchmarking Your Cluster

Hadoop Ecosystem covered as part of Hadoop Administrator

→ Eco system component: Ganglia

  • Install and configure Ganglia on a cluster
  • Configure and use Ganglia
  • Use Ganglia for graphs.

→ Eco system component: Nagios

  • Nagios concepts
  • Install and configure Nagios on cluster
  • Use Nagios for sample alerts and monitoring

→ Eco system component: Hive

  • Hive concepts
  • Install and configure hive on cluster
  • Create database, access it console
  • Develop and run sample applications in Java/Python to access hive

→ Eco system component: Sqoop

  • Install and configure sqoop on cluster
  • Import data from Oracle/Mysql to hive

→ Overview of other Eco system component:

  • Oozie, Avro, Thrift, Rest, Mahout, Cassandra, YARN, MR2 etc