This document is intended for anyone seeking for work prospects in Big Data. It contains the most frequently asked interview questions that I encountered between November 2023 and January 2024. It includes topics from Hadoop, Spark, and Hive.
Easy to crack the Big data interview:
Topics covered:
1.Hadoop
2.Spark
3.Hive
1.HADOOP
Q1. what is Hadoop? why ?
Hadoop is an open source framework that manages the storage and processing of large amounts of data for
applications.
Q2.what are the main components of Hadoop?
Storage – HDFS
Batch processing – MapReduce
Resource Management – YARN
Q3. What is HDFS? What are the functions of name node and data node?
HDFS (Hadoop Distributed File System). Instead of keeping all data on a single node (machine), HDFS distributes it across
multiple nodes with the default replication factor of 3.
It follows master and slave topology.
NameNode works as Master in Hadoop cluster. Main function performed by NameNode:
1. Stores metadata of actual data.
2. Manages File system namespace and executes operations like opening/closing files, renaming files and directories.
3. Regulates client access request for actual file data file.
4. Assign work to Slaves(DataNode).
DataNode works as Slave in Hadoop cluster . Main function performed by DataNode:
1. Actually stores Business data.
2. This is actual worker node were Read/Write/Data processing is handled.
3. Upon instruction from Master, it performs creation/replication/deletion of data blocks.
4. As all the Business data is stored on DataNode, the huge amount of storage is required for its operation.
Q4. What happens to a NameNode that has no data?
There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.
Q5. What happens if namenode fails?
Since Hadoop 2.x, HDFS cluster has two NameNodes: active and passive. The Active NameNode is the NameNode
that works and runs in the Hadoop cluster.
, Passive NameNode is also known as Standby NameNode. It comes into action only when the active NameNode
fails.
Whenever the active NameNode fails, the standby NameNode takes over the responsibility of the failed
NameNode and keep the HDFS up and running. The passive Namenode takes the edit logs (meta data file) from
NameNode and merges it with the FsImage (File system Image) to produce an updated FsImage as well as to
prevent the Edit Logs from becoming too large.
Q6. what are the process of MapReduce?
Map Phase:
The input data is divided into smaller chunks called "splits."
A "Mapper" function is applied to each split independently. The Mapper takes the input data and produces a
set of key-value pairs.
Shuffle and Sort Phase:
The output key-value pairs from all Mappers are shuffled and sorted by key to ensure that all values with the
same key are grouped together. This is essential for the subsequent Reduce phase.
Reduce Phase:
The sorted key-value pairs are passed to a set of "Reducer" functions. Each Reducer receives a group of key-
value pairs with the same key.
The Reducer processes this data and produces an output, typically aggregating or summarizing the values
associated with each key.
The output of the Reduce phase is typically written to an external storage system, like HDFS (Hadoop
Distributed File System).
2.SPARK
Q7. What are the features of Apache Spark?
High Processing Speed
In-Memory Computation
Reusability
Fault Tolerance
Stream Processing
Lazy Evaluation
Support Multiple Languages
Hadoop Integration
Q8. What does DAG refer to in Apache Spark?
DAG stands for Directed Acyclic Graph with no directed cycles. There would be finite vertices and edges. Each edge
from one vertex is directed to another vertex in a sequential manner. The vertices refer to the RDDs of Spark and the
edges represent the operations to be performed on those RDDs
Q9. How is Apache Spark different from MapReduce?
MapReduce spark
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller rbabyshri. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $8.39. You're not tied to anything after your purchase.