Apache YARN (Yet Another Resource Negotiator) is the cluster resource management layer of Hadoop. The Yarn was introduced in Hadoop 2.x.

Yarn allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS (Hadoop Distributed File System). Apart from resource management, Yarn also does job Scheduling.

Yarn extends the power of Hadoop to other evolving technologies, so they can take the advantages of HDFS (most reliable and popular storage system on the planet) and economic cluster.

YARN Daemons

1. Global Resource Manager

The main job…

One of the major advantage of using Hadoop is its ability to handle failures and allow jobs to complete successfully. In this article we are going to discuss about the different types of Failures that can occur in Hadoop and how they are handled.

Let us begin with the HDFS failure and then discuss about the YARN failures. In HDFS there are two main daemons, Namenode and Datanode.

Namenode Failure:

Namenode is the master node which stores metadata like filename, number of blocks, number of replicas, location of blocks and block IDs.

In Hadoop 1x, Namenode is the single point…

In Hadoop 2, Map Reduce jobs are executed using the YARN(Yet Another Resource Negotiator). Let us understand the different id’s that are created while executing a mapreduce application.

Application Id:

When a MR job is submitted by the client, the resource manager will first create the application ID. Application ID is composed of the time that the resource manager is started and an incrementing counter maintained by the RM to uniquely identify the application.


In the above example, 1622829088382 refers to the start time format of resource manager (not the application) and 0005 indicates that it is the fifth…

In real world the clusters are busy and the resources are limited, as a result the applications often need to wait to have some of its resources fulfilled. The YARN scheduler takes the responsibility of allocating the resources to applications based on some defined policies. In this article we are going to discuss about the three scheduling options available in YARN.

1. FIFO Scheduler:

In FIFO (FIRST IN FIRST OUT) scheduler, applications are placed in a queue and runs them in the order of submission. This scheduling option is simple understand and doesn’t require any configuration, but it is not…

Before we start our discussion on what exactly is Erasure coding, let us understand the below two terms and see how HDFS achieve them.

  1. Durability: How many simultaneous failure can be tolerated ? It is also known as fault tolerance.
  2. Storage Efficiency: How much portion of storage is useful for the data in real time?

In HDFS the durability, reliability, read bandwidth and write bandwidth can be achieved by replication process.

To provide fault tolerance, HDFS replicates blocks of a file on different DataNodes depending on the replication factor. …

Hadoop Distributed File System (HDFS) is file system of Hadoop designed for storing very large files running on clusters of commodity hardware. Generally, when dataset outgrows the storage capacity of a single machine, it is necessary to partition it across number of separate machines. The file system that manages the storage across a network of machines are called distributed file systems.

In this topic, we will discuss about the different HDFS commands with examples. Most of the commands have similar functionality as that of the Unix commands.

1. mkdir:

This command is similar to that of Unix mkdir and is…

In this blog we are going to discuss about how to integrate Apache Kafka with Spark using Python and its required configuration.

How Kafka works ?

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.



Introduction to Parquet

Apache Parquet is a columnar open source storage format that can efficiently store nested data which is widely used in Hadoop and Spark. Initially developed by Twitter and Cloudera. Columnar formats are attractive since they enable greater efficiency, in terms of both file size and query performance. File sizes are usually smaller than row-oriented equivalents since in a columnar format the values from one column are stored next to each other, which usually allows a very efficient encoding. …

Karthik Sharma

Data Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store