HDFS, as the name implies, is a distributed file system. It stores a file across a cluster of commodity servers.
It was designed to store and provide fast access to big files and large datasets. It is scalable and fault tolerant.
HDFS is a block-structured file system. Just like Linux file systems, HDFS splits a file into fixed-size
blocks, also known as partitions or splits. The default block size is 128 MB, but it is configurable. It should be
clear from the blocks’ size that HDFS is not designed for storing small files. If possible, HDFS spreads out the
blocks of a file across different machines. Therefore, an application can parallelize file-level read and write
operations, making it much faster to read or write a large HDFS file distributed across a bunch of disks on
different computers than reading or writing a large file stored on a single disk.
Distributing a file to multiple machines increases the risk of a file becoming unavailable if one of the
machines in a cluster fails. HDFS mitigates this risk by replicating each file block on multiple machines. The
default replication factor is 3. So even if one or two machines serving a file block fail, that file can still be
read. HDFS was designed with the assumption that machines may fail on a regular basis. So it can handle
failure of one or more machines in a cluster.