Logo DWBI.org Login / Sign Up
Sign Up
Have Login?
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Login
New Account?
Recovery
Go to Login
By continuing you indicate that you agree to Terms of Service and Privacy Policy of the site.
Big Data

Introduction to Apache Hadoop

Updated on Oct 03, 2020

The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of commodity machines (low-cost hardware computers) using simple programming models.

Design Paradigm

Hadoop framework library is designed to detect and handle failures at the application layer, on top of a cluster of computers, each of which may be prone to failures. This is how the software library provides high-availability and resilience instead of simply relying on high-end hardware.

It is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Hadoop enables a computing solution that is Scalable, Flexible, Fault tolerant & Cost effective. Hadoop is capable of processing big data of sizes ranging from Gigabytes to Petabytes. It is a complete solution for large-scale analytics. Hadoop changed the economics and the dynamics of large-scale computing.

Hadoop Modules

Hadoop is composed of four core components — Hadoop Common, Hadoop Distributed File System (HDFS), YARN and Hadoop MapReduce.

  • Hadoop Common: The library module containing common utilities that supports the other Hadoop components. It consists of utilities which provides file system and OS level abstractions.
  • Hadoop Distributed File System: A Java-based distributed, scalable, and portable file system that provides reliable data storage of diverse data and high-throughput access to application data across all the nodes in a Hadoop cluster. It links together the file systems on many local nodes to create a single file system.
  • Hadoop YARN: Yet Another Resource Negotiator – The next-generation framework for job scheduling and cluster resource management. It assigns CPU, memory and storage to applications running on a Hadoop cluster. It enables application frameworks other than MapReduce to run on Hadoop, opening up new possibilities.
  • Hadoop MapReduce: A YARN-based framework for writing applications to process large amounts of structured and unstructured data in-parallel on a cluster of thousands of machines, in a reliable and fault-tolerant manner.

Hadoop theoretically, can be used for any sort of work which are batch-oriented rather than real-time, very data-intensive, and benefits from parallel processing of data.

Essentially, it accomplishes two tasks: massive data storage and faster processing.

  • Distributed Data Storage: HDFS
  • Distributed Data Processing: MapReduce

Benefits of Hadoop

  • Low cost: The open-source framework is free and uses commodity hardware to store and process very large volumes of variety data.
  • Massive storage: The Hadoop framework can store huge amounts of data by breaking the data into blocks and storing it on clusters of low-cost commodity hardware.
  • Computing power: Its distributed computing model offers local computation and storage to quickly process large volumes of data. We can increase the processing power simply by adding computing nodes to the cluster.
  • Scalability: We can easily ramp up the system simply by adding more nodes to a cluster with a little administration required.
  • Storage flexibility: We can store a variety of data whether structured, semi-structured or unstructured. We can store as much data as we want and decide how to use it later.
  • Resilient framework: Data and application processing are protected against hardware failure by Inherent data protection and self-healing capabilities. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data.