Hadoop
is one of the most powerful and widely used frameworks and has been adopted as
the default framework for many organizations. However, many haven't had an
opportunity to learn it.
It's easy to think that because HDP is a complex framework, it requires
specific programming knowledge for you to be able to use it effectively and
efficiently... but that's not true. Even if you have little or no prior
programming knowledge, it doesn't mean that you can't understand how to use it.
The reason for this assumption is that for you to get started with Hadoop
development, you need to first know how different components of the HDP
framework work, and what role they play in the overall functionality of the
framework. Only after that can you be able to learn and use it effectively.
With a structured approach, and some basic technical knowledge and programming skills, anyone who wants to use Hadoop should be able to do so without much
effort but of course, this is only if you have the tools and resources.
Let's go through the topics on how Hadoop works, and what roles the various
components play in that functionality.
Understanding how Hadoop works will help you to learn how to use HDP more
effectively. Knowing what roles each component plays will also help you in
learning about the new features added by Hadoop core, and which one is best for
your use cases. Read on to get a better understanding of these topics.
What is Hadoop?
Hadoop is an open-source framework for distributed computing and storage.
It was designed by Yahoo to help handle their data needs across a distributed
infrastructure.
This framework consists of HDFS (Hadoop Distributed File System), MapReduce,
and YARN. These are the three primary components of the Hadoop ecosystem, which
were developed to address the common challenges faced while dealing with
large-scale distributed data sets, such as low-cost storage, managing massive
amounts of data, scalability, and fault tolerance. Each component plays a
specific role in how Hadoop works,
Let's explore them one by one,
• HDFS is the storage layer of Apache Hadoop, and it consists of two
primary components: NameNode and DataNode. All data written to HDFS is stored
in the form of files, and files are organized into a directory hierarchy.
• YARN (also known as MapReduce 2) refers to a cluster resource manager
that manages NodeManager services. It acts as a scheduler that distributes the
tasks among all the compute and storage resources.
• MapReduce: As we discussed, in Hadoop, "Map" refers to a
job, and reduce is used for an action where one job produces multiple outputs.
It's also used for writing programs written in Java, Scala, or any other
programming languages that can be executed against an HDFS dataset.

Comments
Post a Comment