Hadoop Tutorial: Components, Architecture, Data Processing, Interview Questions

Here’s a detailed expansion of your Hadoop Core Concepts – Interview Q&A, now with conceptual answers, examples, commands, and added advanced questions covering HDFS CLI, Python-Hadoop integration, and Hive interaction.

✅ Hadoop Core Interview Questions with Answers & Examples

1. What is Hadoop?

Answer: Hadoop is an open-source framework that allows for distributed storage and parallel processing of large datasets across clusters of computers using simple programming models.

Use Case: Processing 10 TB of server logs spread across 20 nodes using MapReduce.

2. What are the core components of Hadoop?

HDFS: Distributed storage
YARN: Cluster resource manager
MapReduce: Computation engine

3. What is HDFS?

Answer: HDFS stores large files by breaking them into blocks (default: 128 MB) and replicating them across multiple DataNodes for fault tolerance.

Example: A 512 MB file is split into 4 blocks, each stored on 3 nodes.

4. What is the default block size in HDFS? Can it be changed?

Default: 128 MB (Hadoop 2.x+), earlier it was 64 MB
Change via config:

<property>
  <name>dfs.block.size</name>
  <value>268435456</value> <!-- 256 MB -->
</property>

5. What is NameNode and DataNode?

NameNode: Stores metadata (filename, block locations, permissions)
DataNode: Stores actual data blocks

6. What is a Secondary NameNode?

Misconception: It is not a backup
Correct: It periodically checkpoints the NameNode’s in-memory metadata to prevent large fsimage+edits log files.

7. What is the role of YARN in Hadoop?

Answer: YARN (Yet Another Resource Negotiator) manages cluster resources and schedules job execution via:

ResourceManager
NodeManager

8. What is MapReduce?

Answer: A programming model with:

Map(): Filters/sorts data
Reduce(): Aggregates/interprets results

9. What are Mappers and Reducers?

Mapper: Converts input → intermediate key-value pairs
Reducer: Aggregates keys

Example: Word Count

Mapper: ("Hadoop", 1)
Reducer: ("Hadoop", 5)

10. What is a Combiner?

Mini-reducer that performs local aggregation before shuffle.
Used to reduce network traffic.

11. What is a Partitioner in Hadoop?

Determines how key-value pairs are distributed to reducers.

Default: hash(key) % numReducers

12. What is a Hadoop InputSplit?

Logical chunk of data given to a mapper.

13. Difference between InputSplit and HDFS Block?

InputSplit	HDFS Block
Logical	Physical
Used by MapReduce	Used by HDFS
Can span multiple blocks	Fixed-size

14. What file formats are supported in Hadoop?

Text
SequenceFile
Avro
Parquet
ORC

15. Difference between SequenceFile and Avro?

SequenceFile: Binary format for key-value pairs (Hadoop native)
Avro: Row-based, schema-based, good for serializing structured data

16. What is speculative execution in Hadoop?

Executes duplicate tasks for slow-running tasks to avoid stragglers.

17. What is rack awareness in Hadoop?

Hadoop stores replicas across racks to prevent data loss from rack failure.

18. How does HDFS ensure fault tolerance?

Replication (default 3)
Heartbeats
Data rebalancing

19. What is the role of JobTracker and TaskTracker?

(For Hadoop 1.x)

JobTracker: Assigns tasks
TaskTracker: Executes them

20. Difference between Hadoop 1.x and 2.x?

Feature	Hadoop 1.x	Hadoop 2.x
Resource Mgmt	JobTracker	YARN
Scalability	Limited	Massive
Multi-tenancy	No	Yes

💡 Bonus Practical Questions + Answers

21. How do you load data into HDFS?

hdfs dfs -put myfile.txt /data/

22. Command to list files in HDFS

hdfs dfs -ls /user/hadoop

23. What happens when a DataNode fails?

Missing blocks reported to NameNode
NameNode schedules replication on other healthy nodes

24. How do you tune MapReduce performance?

Set number of mappers/reducers properly
Use Combiner
Tune memory and I/O
Enable compression

25. Difference between Hadoop and Spark?

Hadoop (MapReduce)	Spark
Disk-based	In-memory
Slower	Faster
No DAG	DAG-based engine
Good for batch	Good for batch + streaming

🔗 Extra Questions: Python, Hive, and CLI

26. How do you connect Hadoop with Python?

a) Using PyDoop

import pydoop.hdfs as hdfs
hdfs.ls('/data/')

b) Using Snakebite (lightweight HDFS client)

snakebite ls /data/

27. How do you connect Hive with Python?

from pyhive import hive
conn = hive.Connection(host='localhost', port=10000, username='hadoop')
cursor = conn.cursor()
cursor.execute("SELECT * FROM my_table LIMIT 10")

28. Important HDFS CLI Commands

Command	Usage
`-ls`	List directory
`-put`	Upload to HDFS
`-get`	Download from HDFS
`-rm -r`	Delete recursively
`-du -h`	Show directory size
`-cat`	View file contents

📘

Great! Below is an enhanced Hadoop Interview Q&A list, now expanded to cover important technical terms like fsimage, edit logs, checkpoint, and other key Hadoop ecosystem components like Hive, Sqoop, Flume, HBase, Oozie, and Zookeeper.

✅ Extended Hadoop Interview Questions — Technical Terms & Ecosystem Components

🗂️ Core Hadoop Internals (Technical Terms)

29. What is fsimage in Hadoop?

Answer:
fsimage is a snapshot of the Hadoop filesystem metadata stored on the NameNode. It contains the entire directory structure and file-to-block mapping at a specific point in time.

30. What is edit log in Hadoop?

Answer:
The edit log records every change made to the HDFS metadata since the last fsimage was saved.

31. What is a checkpoint in Hadoop?

Answer:
The process of merging the fsimage and edit log to create a new, updated fsimage. Done by the Secondary NameNode to reduce NameNode startup time.

32. What happens when NameNode restarts?

Answer:

Loads fsimage
Applies edit logs to bring metadata to current state
Rebuilds the namespace in memory

33. What is Safe Mode in Hadoop?

Answer:
A read-only mode during NameNode startup. It waits for block reports from DataNodes before exiting safe mode and allowing writes.

34. What is a heartbeat in Hadoop?

Answer:
A signal sent by DataNodes every 3 seconds to inform the NameNode they’re alive. If no heartbeat is received in 10 minutes, the DataNode is considered dead.

35. What is data locality in Hadoop?

Answer:
Moving computation to the data rather than data to computation. This reduces network IO and improves job performance.

🌐 Hadoop Ecosystem Components — Key Interview Questions

36. What is Hive? How does it work with Hadoop?

Answer:
Hive is a SQL-like engine on Hadoop. It converts HiveQL into MapReduce, Tez, or Spark jobs.

Example: SELECT COUNT(*) FROM sales; becomes a MapReduce job behind the scenes.

37. What is the difference between Hive and Pig?

Feature	Hive	Pig
Language	SQL-like (HiveQL)	Script-based (Pig Latin)
Use case	Reporting/BI	Data transformation
Learning curve	Easy for SQL users	Easy for programmers

38. What is Sqoop?

Answer:
A tool for transferring data between Hadoop and RDBMS.

sqoop import --connect jdbc:mysql://dbhost/sales --table orders --target-dir /hdfs/orders

39. What is Flume?

Answer:
A distributed service to collect, aggregate, and move large volumes of log data into HDFS or Hive.

40. What is HBase?

Answer:
A NoSQL columnar database that runs on HDFS. Ideal for real-time random read/write operations.

41. What is the difference between HDFS and HBase?

Feature	HDFS	HBase
Type	File system	Database
Access	Batch	Real-time
Structure	Flat files	Key-Column-Value

42. What is Oozie?

Answer:
A workflow scheduler for Hadoop. Helps manage dependencies between jobs like Hive → MapReduce → Pig.

43. What is Zookeeper?

Answer:
A coordination service used in Hadoop ecosystem (like HBase, Kafka) for leader election, configuration, and distributed locking.

44. What is Parquet and ORC file format?

Parquet: Columnar storage format, supports nested data (best with Spark).
ORC: Optimized Row Columnar format (best with Hive), better compression and read performance.

45. What is Hadoop Archive (HAR)?

Answer:
A method to compress many small files into a single large file to overcome HDFS small files problem.

46. What is small file problem in HDFS?

Answer:
Too many small files overwhelm the NameNode’s memory as it stores metadata for each file.

Solution: HAR, SequenceFile, or CombineFileInputFormat.

47. What is a spill in MapReduce?

Answer:
When the in-memory buffer is full during Map phase, intermediate data is written (spilled) to disk before being shuffled to reducers.

48. What is input format in MapReduce?

Answer:
Defines how input files are split and read.
Example: TextInputFormat, SequenceFileInputFormat, ParquetInputFormat

49. What is counters in Hadoop?

Answer:
Built-in or custom metrics for tracking job progress (e.g., number of records, skipped lines).

50. What are shuffle and sort in MapReduce?