• Image

Welcome to Big Data Hadoop

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Why Hadoop ?

Capturing data




Hadoop Architecture

  • History of Hadoop – Facebook, Dynamo, Yahoo, Google
  • Hadoop Core
  • Yarn architecture, Hadoop 2.0

Hadoop Distributed File System (HDFS)

  • HDFS Clusters – NameNodes, DataNodes & Clients
  • Metadata
  • Web-based Administration


  • Processing & Generating large data sets
  • Map functions
  • Programming MapReduce using SQL / Bash / Python
  • Parallel Processing
  • Failover

Data warehousing with Hive

  • Data Summarisation
  • Ad-hoc queries
  • Analysing large datasets
  • HiveQL (SQL-like Query Language)
  • Integration with SQL databases
  • n-grams analysis

Parallel Processing with Pig

  • Parallel evaluation
  • Query language interface
  • Relational Algebra

Data Mining with Mahout

  • Clustering
  • Classification
  • Batch-based collaborative filtering

Searching with Elastic Search

  • Elastic search concepts
  • Installation, import of the data
  • Demonstration of API, sample queries

Structured Data Storage with HBase

  • Big Data: How big is big?
  • Optimised Real-time read/write access

Cassandra multi-master database

  • The Cassandra Data Model
  • Eventual Consistency
  • When to use Cassandra


  • Redis Data Model
  • When to use Redis


  • MongoDB data model
  • Installation of MongoDB
  • When to use MongoDB


  • Kafka architecture
  • Installation
  • Example usage
  • When to use Kafka

Lambda Architecture

  • Concept
  • Hadoop + Stream processing integration
  • Architecture examples

Big Data in the Cloud

  • Amazon Web Services
  • Concepts: Pay pay use model
  • Amazon S3, EC2, EMR
  • Google Cloud Platform
  • Google Big Query