Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system
for processing large datasets across a cluster of commodity servers. It provides a simple programming
framework for large-scale data processing using the resources available across a cluster of computers.
Hadoop is inspired by a system invented at Google to create inverted index for its search product. Jeffrey
Dean and Sanjay Ghemawat published papers in 2004 describing the system that they created for Google.
The first one, titled “MapReduce: Simplified Data Processing on Large Clusters” is available at research.
google.com/archive/mapreduce.html. The second one, titled “The Google File System” is available at
research.google.com/archive/gfs.html. Inspired by these papers, Doug Cutting and Mike Cafarella
developed an open source implementation, which later became Hadoop.
Many organizations have replaced expensive proprietary commercial products with Hadoop for
processing large datasets. One reason is cost. Hadoop is open source and runs on a cluster of commodity
hardware. You can scale it easily by adding cheap servers. High availability and fault tolerance are provided
by Hadoop, so you don’t need to buy expensive hardware. Second, it is better suited for certain types of data
processing tasks, such as batch processing and ETL (extract transform load) of large-scale data.
Hadoop is built on a few important ideas. First, it is cheaper to use a cluster of commodity servers for
both storing and processing large amounts of data than using high-end powerful servers. In other words,
Hadoop uses scale-out architecture instead of scale-up architecture.