Apache Hive- Overview, Components, Architecture, Step by Step Execution Via Apache Tez or Spark

elow is the Hive Deep Dive Series, delivered inline, one module at a time with real-world relevance, use cases, syntax, and interview insights.

✅ Module 1: 📘 Hive Basics & Architecture

🔹 What is Hive?

Apache Hive is a data warehouse system built on top of Hadoop for querying and analyzing structured data using a SQL-like language (HiveQL).

Converts SQL queries into MapReduce, Tez, or Spark jobs
Stores metadata in an RDBMS (Metastore)
Ideal for batch processing, data summarization, and ETL workflows

📌 Use Case: Run daily/weekly summary reports on clickstream data stored in HDFS.

🔹 Hive Architecture Overview

+--------------+     +-----------+     +----------+     +--------------+
|   Hive CLI   | --> |   Driver  | --> | Compiler | --> | Execution Eng|
+--------------+     +-----------+     +----------+     +--------------+
                                                |              |
                                     +----------+              |
                                     |                         v
                               +-----------+           +---------------+
                               | Metastore | <-------- | Hadoop / Tez  |
                               +-----------+           +---------------+

📦 Components:

Component	Role
Driver	Manages lifecycle of HiveQL statement (parsing → execution)
Compiler	Converts HiveQL → DAG of stages (MapReduce/Tez)
Metastore	Stores metadata (schemas, table locations, partitions)
Execution Engine	Executes plan using MR, Tez, or Spark

🔹 Hive vs RDBMS

Feature	Hive	RDBMS (e.g., MySQL, Oracle)
Data model	Schema-on-Read	Schema-on-Write
Latency	High (batch-oriented)	Low (OLTP)
Indexing	Limited	Extensive indexing, constraints
Update/Delete	ACID (since Hive 0.14+, with config)	Native support
Joins	Expensive (uses MapReduce or Tez)	Optimized joins with indexes
Transactions	Supported via ORC + ACID tables	Fully transactional
Use case	Big Data ETL, Reporting	OLTP, Fast Lookup

📌 Use Case: Hive for offline reporting, MySQL for real-time application needs.

🔹 Hive vs Impala vs Presto vs Spark SQL

Engine	Query Latency	Execution	Storage	Best For
Hive	High (seconds–mins)	MR/Tez/Spark	HDFS/Hive Tables	Batch ETL, complex joins
Impala	Low (sub-seconds)	Native	HDFS/Hive	Interactive SQL, BI dashboards
Presto	Very low (ms-sec)	In-memory	HDFS, S3	Ad-hoc queries, federated sources
Spark SQL	Medium (sec)	Spark	HDFS, Hive	ETL + Machine Learning integration

📌 Use Case:
Use Hive for daily aggregations, Presto for ad-hoc joins from Hive + MySQL + Kafka.

🧠 Interview Highlights

Question	Key Points to Answer
Explain Hive Architecture	Metastore, Driver, Compiler, Execution Engine
Difference between Hive and traditional RDBMS	Schema-on-read, batch, slow vs fast, indexing
Which execution engines does Hive support?	MR (legacy), Tez (default), Spark (configurable)
What is the role of Metastore?	Metadata: DBs, Tables, Partitions, File Paths
When to use Hive vs Spark SQL vs Presto?	Hive = ETL, SparkSQL = ML+ETL, Presto = ad-hoc BI

📂 Real-World Design Tip

💡 In a typical Enterprise Data Lake, Hive is used to store data in curated and partitioned format (ORC/Parquet), while:

Presto is used by analysts for fast querying
Spark is used to read/write Hive tables for transformation and ML

Online Transaction Processing (OLTP) vs. Online Analytical Processing (OLAP)

OLTP:
- OLTP systems are designed for managing transactional data.
- Characteristics include:
  - Low-latency and fast query response times.
  - High volume of short online transactions.
  - Examples include banking systems, e-commerce websites, and customer relationship management (CRM) systems.
OLAP (where Hive excels):
- OLAP systems are designed for analyzing and querying large volumes of data.
- Characteristics include:
  - High latency but capable of processing complex queries over large datasets.
  - Batch processing and data warehousing.
  - Examples include data warehousing solutions, business intelligence tools, and big data analytics platforms.

Apache Hive is not suited for OLTP due to its high latency and batch-oriented processing model. It is best used for traditional data warehousing tasks, where the emphasis is on reading and analyzing large datasets rather than handling numerous short transactions. By leveraging Hive for OLAP, organizations can efficiently perform complex queries and data analysis on vast amounts of data stored in Hadoop.

Hive Architecture

Hive's architecture is designed to provide a user-friendly SQL interface to the complex underlying Hadoop infrastructure. Here’s a detailed breakdown of its components and architecture:

1. HiveQL

HiveQL is the query language used by Hive, similar to SQL. It allows users to perform data manipulation and querying operations. HiveQL queries are converted into MapReduce or Tez jobs for execution on a Hadoop cluster.

HiveQL Components

Parser: Breaks down HiveQL queries into understandable components.

Semantic Analyzer: Checks query syntax and semantics.

Optimizer: Optimizes queries for efficient execution.

Executor: Executes optimized queries on the Hive database.

HiveQL Functions

Data Definition: Creates, alters, and drops databases, tables, views, and stored procedures.

Data Manipulation: Inserts, updates, and deletes data in Hive tables.

Data Querying: Retrieves data from Hive tables using SELECT statements.

Data Analysis: Performs data analysis using aggregate functions, joins, and subqueries.

2. MetaStore

The MetaStore is a critical component- a central repository in Hive, responsible for storing metadata about tables, columns, partitions, storage locations and the data types in the Hive environment. It can use relational databases like MySQL or Derby to store this metadata. It acts as a single source of truth for Hive metadata, enabling efficient data management and query optimization.

Hive Metastore Components

Metastore Database: A relational database that stores Hive metadata, such as MySQL or PostgreSQL.

Metastore Server: A service that manages access to the metastore database and provides metadata to Hive clients.

Hive Client: A component that connects to the metastore server to retrieve and update metadata.

Hive Metastore Functions

Metadata Storage: Stores metadata about Hive databases, tables, partitions, and storage locations.

Metadata Retrieval: Provides metadata to Hive clients for query optimization and data management.

Metadata Updates: Updates metadata in response to changes in the Hive database, such as new table creation or partition addition.

Hive Metastore Benefits

Improved Query Performance: Optimizes queries by providing metadata for efficient data retrieval.

Data Management: Enables efficient data management by storing metadata about data locations and formats.

Scalability: Supports large-scale data storage and processing by providing a centralized metadata repository.

Hive Metastore Configuration

Metastore Database Configuration: Configures the metastore database connection, such as username, password, and JDBC URL.

Metastore Server Configuration: Configures the metastore server settings, such as port number and memory allocation.

Hive Client Configuration: Configures the Hive client settings, such as metastore server connection and metadata retrieval settings.

Hive Metastore Best Practices

Secure Metastore Access: Controls access to metastore data and ensures secure metadata management.

Regularly Backup Metastore Data: Ensures data recovery in case of metastore failure or corruption.

Monitor Metastore Performance: Identifies performance bottlenecks and optimizes metastore configuration.

Components of MetaStore:

Database: Stores metadata for tables and partitions.

Tables: Metadata information about tables.

Partitions: Data split into segments for easy querying.

Buckets: Sub-partitions for fine-grained data organization.

3. Driver

The Hive driver is a component of the Hive architecture that allows users to interact with Hive databases and execute queries. The Driver manages the lifecycle of a HiveQL statement. It is responsible for creating sessions, parsing, compiling, optimizing, and executing the queries. It also manages the metadata and interactions with the MetaStore.It acts as an interface between the user and the Hive metastore, enabling users to perform various operations such as:

Query execution: Execute Hive queries, including SELECT, INSERT, UPDATE, and DELETE statements.

Metadata management: Create, alter, and drop databases, tables, views, and stored procedures.

Data manipulation: Load data into Hive tables, export data from Hive tables, and manage data partitions.

Types of Hive Drivers

Hive JDBC Driver: A Java-based driver for connecting to Hive databases using JDBC.

Hive ODBC Driver: A driver for connecting to Hive databases using ODBC.

Hive Thrift Driver: A driver for connecting to Hive databases using Thrift.

Hive Driver Configuration

To use the Hive driver, you need to configure the following:

Hive metastore connection: Specify the Hive metastore URI, username, and password.

Hive execution engine: Choose the execution engine, such as MapReduce or Tez.

Query optimization: Configure query optimization settings, such as caching and indexing.

Hive Driver Usage

Java applications: Use the Hive JDBC driver to connect to Hive databases from Java applications.

Python applications: Use the Hive Thrift driver to connect to Hive databases from Python applications.

SQL clients: Use the Hive ODBC driver to connect to Hive databases from SQL clients, such as SQL Server Management Studio.

Hive CLI: Use the Hive command-line interface to execute queries and manage Hive databases.

By understanding the Hive driver and its configuration, you can effectively interact with Hive databases and execute queries to analyze and manage your data.

4. Compiler

The Compiler translates HiveQL statements into a directed acyclic graph (DAG) of MapReduce jobs. It breaks down the high-level queries into smaller tasks and ensures they are optimized for execution. The Hive compiler is a crucial component of the Apache Hive architecture, responsible for translating HiveQL queries into a series of MapReduce, Tez, or Spark jobs that can be executed on the Hadoop cluster. Here’s an in-depth look at how the Hive compiler works and its role in the overall Hive system:

Components of Hive Compiler

Parser:

Function: The parser takes the HiveQL query and converts it into an Abstract Syntax Tree (AST).

Process: The AST represents the structure of the query in a tree format, identifying different clauses (SELECT, FROM, WHERE, etc.) and their relationships.

Semantic Analyzer:

Function: The semantic analyzer checks the AST for semantic correctness. It ensures that the query makes sense within the context of the database schema.

Process: This step involves verifying table and column names, checking for type consistency, and validating expressions.

Logical Plan Generator:

Function: Converts the validated AST into a logical plan.

Process: The logical plan represents the sequence of operations (like joins, filters, aggregations) required to execute the query. It is independent of the execution engine.

Optimizer:

Function: Optimizes the logical plan for efficiency.

Process: The optimizer performs various transformations and optimizations, such as predicate pushdown, join reordering, and eliminating redundant operations, to improve query performance.

Physical Plan Generator:

Function: Converts the optimized logical plan into a physical plan.

Process: The physical plan details the specific operations to be performed by the execution engine (MapReduce, Tez, or Spark).

Execution Engine:

Function: Executes the physical plan.

Process: The execution engine (MapReduce, Tez, or Spark) takes the physical plan and executes it on the Hadoop cluster, processing data as per the plan.

Hive Compilation and Execution Process

Here’s a step-by-step explanation of how a HiveQL query is compiled and executed:

Query Submission:

The user submits a HiveQL query via the Hive CLI, Beeline, or any other Hive client.

Parsing:

The query is parsed to generate an Abstract Syntax Tree (AST).

Semantic Analysis:

The AST is analyzed to ensure it is semantically correct. This involves checking the database schema, table names, column names, and data types.

Logical Plan Generation:

An initial logical plan is created from the validated AST. This plan outlines the sequence of operations required to fulfill the query.

Optimization:

The logical plan is optimized to improve efficiency. Optimizations include predicate pushdown, join reordering, and cost-based optimization.

Physical Plan Generation:

The optimized logical plan is converted into a physical plan that specifies the execution steps for the selected execution engine (MapReduce, Tez, or Spark).

Query Execution:

The physical plan is executed by the execution engine on the Hadoop cluster. The engine runs the necessary jobs to read, process, and write data as per the query.

Result Retrieval:

The results of the query execution are collected and returned to the user.

Example: Hive Query Compilation

Consider the following HiveQL query:

SELECT emp_name, emp_salary
FROM employees
WHERE emp_salary > 70000;

Here’s how this query would be processed:

Parsing:

The query is parsed into an AST:

SELECT clause: emp_name, emp_salary

FROM clause: employees

WHERE clause: emp_salary > 70000

Semantic Analysis:

Verify employees table exists.

Check columns emp_name and emp_salary are valid.

Validate data types and consistency.

Logical Plan Generation:

Initial logical plan:

Scan employees table.

Filter rows where emp_salary > 70000.

Project columns emp_name and emp_salary.

Optimization:

Apply filter pushdown to optimize data scanning.

Optimize projections and eliminate unnecessary columns.

Physical Plan Generation:

Convert the optimized logical plan into a physical plan for execution:

MapReduce jobs to scan, filter, and project data.

Query Execution:

The execution engine (e.g., Tez) runs the physical plan:

Map tasks scan and filter employees table.

Reduce tasks collect and project the results.

Result Retrieval:

The results are gathered and returned to the user.

The Hive compiler plays a critical role in transforming HiveQL queries into executable jobs on a Hadoop cluster. By leveraging various stages such as parsing, semantic analysis, optimization, and physical plan generation, the Hive compiler ensures that queries are executed efficiently and correctly. Understanding this process helps users optimize their queries and better utilize Hive’s capabilities for big data processing.

5. Optimizer

The Optimizer enhances the logical plan generated by the compiler by applying various transformation and optimization techniques. These techniques include predicate pushdown, partition pruning, and join optimization to improve the efficiency of the execution plan.

6. Execution Engine

The Execution Engine takes the optimized query plan and executes it as a series of MapReduce, Tez, or Spark jobs on the Hadoop cluster. It is responsible for task execution, resource management, and job monitoring.

7. Hive Server

The Hive Server provides a Thrift interface for clients to execute Hive queries. There are two main versions:

HiveServer1: Older version, supports limited concurrency.

HiveServer2: Improved version with enhanced concurrency and security features.

The command line client currently only supports an embedded server. The JDBC and Thrift-Java clients support both embedded and standalone servers. Clients in other languages only support standalone servers.

HiveServer is an optional service that allows a remote client to submit requests to Hive, using a variety of programming languages, and retrieve results. HiveServer is built on Apache Thrift, therefore it is sometimes called the Thrift server although this can lead to confusion because a newer service named HiveServer2 is also built on Thrift. Since the introduction of HiveServer2, HiveServer has also been called HiveServer1.

8. Hive Client

Hive clients interact with the Hive Server to submit queries and retrieve results. There are various interfaces for clients:

Command Line Interface (CLI): Text-based interface for executing HiveQL commands.

Web Interface (Hive Web UI): Web-based interface for managing and querying Hive.

JDBC/ODBC: Standard database connectivity interfaces for integrating Hive with other applications.

9. HDFS

HDFS (Hadoop Distributed File System) is the storage layer where Hive tables and data files are stored. Hive leverages HDFS for scalable and fault-tolerant storage.

Detailed Step-by-Step Workflow

Step 1: Query Submission

User Interaction: The user submits a HiveQL query through one of the Hive clients, such as the CLI, Web UI, or JDBC/ODBC interface.

Step 2: Session Creation

Hive Server: The query is received by the Hive Server, which manages the session and query execution context.

Step 3: Query Parsing

Driver: The Driver component of Hive initiates the query execution process. It creates a session for the query and manages its lifecycle.

Compiler (Parsing): The query is sent to the Compiler, where it is parsed to check for syntax errors. The parsed query is converted into an Abstract Syntax Tree (AST).

Step 4: Logical Plan Creation

Compiler (Logical Plan): The AST is transformed into a logical plan. This plan represents the sequence of operations that need to be performed to execute the query, but in an abstract manner without specifying the execution engine.

Step 5: Query Optimization

Optimizer: The logical plan is passed to the Optimizer, which applies various optimization techniques such as:

Predicate pushdown: Filtering data as early as possible.

Partition pruning: Only reading relevant partitions.

Join optimization: Reordering joins to reduce data shuffling.

Aggregation pushdown: Performing aggregations early.

Step 6: Physical Plan Creation

Compiler (Physical Plan): The optimized logical plan is converted into a physical plan, which includes the detailed operations and their execution order. This plan is tailored to run on the chosen execution engine (MapReduce, Tez, or Spark).

Step 7: Plan Execution

Execution Engine: The physical plan is submitted to the Execution Engine. Depending on the execution engine used, the following happens:

MapReduce: The plan is divided into a series of Map and Reduce tasks that are executed across the Hadoop cluster.

Tez: Tasks are executed as a directed acyclic graph (DAG) of processing stages, which is more efficient than MapReduce for many operations.

Spark: Tasks are executed using Spark's in-memory processing capabilities, which can be faster than both MapReduce and Tez for certain workloads.

Step 8: Data Retrieval and Storage

HDFS Interaction: During execution, the Execution Engine interacts with HDFS to read input data and write output data. Hive tables are essentially directories in HDFS, and each table or partition is stored as a collection of files in these directories.

Step 9: Result Compilation and Return

Driver: The Driver collects the results from the Execution Engine. If the query produces an output, the results are returned to the user through the Hive client.

Step 10: Session Termination

Driver: The session is terminated, and any resources allocated for the query are released.

Example Workflow

Let’s go through a simple example to see how these steps play out in a practical scenario.

Example Query

SELECT department, AVG(salary)
FROM employees
GROUP BY department;

Query Submission: The user submits the query through the CLI.

Session Creation: Hive Server receives the query and creates a session.

Query Parsing: The Driver sends the query to the Compiler, which parses it and generates an AST.

Logical Plan Creation: The AST is converted into a logical plan that includes reading from the employees table, grouping by department, and calculating the average salary.

Query Optimization: The Optimizer applies optimizations like predicate pushdown (if there are filters), and partition pruning (if applicable).

Physical Plan Creation: The optimized logical plan is converted into a physical plan, specifying tasks for MapReduce.

Plan Execution: The physical plan is executed on the Hadoop cluster using MapReduce, reading data from HDFS, processing it, and writing intermediate and final results back to HDFS.

Data Retrieval and Storage: The Execution Engine interacts with HDFS to perform read/write operations.

Result Compilation and Return: The Driver collects the results and returns them to the CLI.

Session Termination: The session is terminated, and resources are cleaned up.

HintsToday

recent posts

about