Top 20 Apache Hive Interview Questions (With Detailed Answers)

Top 20 Apache Hive Interview Questions

Curated Q&A with clear explanations, tables, and examples—perfect for last‑minute prep or a technical blog post.

HiveQL Hadoop Partitioning ORC/Parquet Tez/Spark

This guide covers the most common Apache Hive interview questions with crisp, practical answers. It includes differences (Hive vs. RDBMS, ORDER BY vs. SORT/DISTRIBUTE/CLUSTER), file formats, execution engines, ACID, UDFs, and performance tips. Copy, tweak, and publish as a standalone blog post.

What is Apache Hive?
Key features of Hive
Hive vs. RDBMS
Internal vs. External tables
Partitions and Buckets
Hive Metastore
SerDes
Query execution flow
File formats
Hive modes

Indexes
Schema‑on‑read
ORDER BY vs SORT/DISTRIBUTE/CLUSTER
Hive vs. Pig
NULL handling
Static vs Dynamic partitions
UDF/UDAF/UDTF
Execution engines
ACID transactions
Performance optimization

1) What is Apache Hive?

Hive is a data warehouse system on Hadoop that lets you query large datasets in distributed storage using SQL‑like HiveQL. Hive translates queries into MapReduce, Tez, or Spark jobs and returns results to the client.

2) What are the key features of Hive?

SQL‑like querying with HiveQL
Schema‑on‑read for flexible ingestion
Partitioning and bucketing for pruning and joins
Extensibility via UDFs and SerDes
Works over HDFS, S3, and compatible stores

3) Hive vs. RDBMS

Feature	Hive	RDBMS
Schema	Schema‑on‑read	Schema‑on‑write
Workloads	Analytics, batch	OLTP, transactions
Transactions	Limited (ACID tables)	Full ACID
Engine	MR/Tez/Spark	RDBMS engine
Storage	HDFS/S3	Local/Network FS

4) Internal vs. External tables

Internal (managed) tables are fully owned by Hive—dropping the table deletes both metadata and data. External tables store data outside Hive’s warehouse; dropping only removes metadata, preserving files.

-- External table example
CREATE EXTERNAL TABLE sales_ext (
  id BIGINT, amount DOUBLE
)
PARTITIONED BY (dt STRING)
STORED AS PARQUET
LOCATION 's3://bucket/sales/';

5) What are partitions and buckets?

Partitioning splits data into directories by key (e.g., country, dt) so queries can skip irrelevant files. Bucketing hashes rows into a fixed number of files per partition—useful for map‑side joins and sampling.

CREATE TABLE events (
  user_id BIGINT, action STRING
)
PARTITIONED BY (dt STRING)
CLUSTERED BY (user_id) INTO 32 BUCKETS
STORED AS ORC;

6) Hive Metastore

Stores table schemas, locations, partitions, and stats. Deployment modes:

Embedded (Derby): single‑user dev.
Local (MySQL/Postgres): prod on one HiveServer2.
Remote: shared across multiple HiveServer2 instances.

7) SerDes

Serializer/Deserializer modules read/write custom formats. Built‑ins include OpenCSVSerde, JsonSerDe, RegexSerDe; you can implement your own for bespoke logs.

8) How a Hive query executes

Parse & analyze HiveQL
Optimize and create a logical/physical plan
Translate to MR/Tez/Spark DAG
Submit to YARN; read data from HDFS/S3
Return results through HiveServer2/CLI

9) Supported file formats

ORC: best with Hive; compression, indexes, vectorization
Parquet: columnar, cross‑engine
Avro: row‑based, schema evolution
Text/CSV, JSON, SequenceFile

CREATE TABLE users_orc (...) STORED AS ORC;
SET hive.exec.orc.default.stripe.size = 268435456; -- 256 MB

10) Hive modes

Local: single JVM, quick tests
Distributed: MapReduce/Tez/Spark on a cluster

11) Indexes in Hive

Indexes (compact/bitmap) can prune data scans, but are less common with modern columnar formats and partition pruning. Prefer ORC/Parquet with stats.

12) Schema‑on‑read

Hive applies the schema during read time, allowing ingestion of semi‑structured data without strict loading rules. This trades write simplicity for read‑time validation.

13) ORDER BY vs SORT BY vs DISTRIBUTE BY vs CLUSTER BY

ORDER BY: total order; single reducer → slow on big data.
SORT BY: sorts within each reducer; partial ordering.
DISTRIBUTE BY: controls reducer partitioning (hash).
CLUSTER BY: DISTRIBUTE BY + SORT BY on the same key.

SELECT * FROM t DISTRIBUTE BY country SORT BY country, dt;

14) Hive vs. Pig

Hive targets SQL users and BI; Pig targets developers with data‑flow scripts. Hive is the de‑facto standard today for SQL‑style analytics on Hadoop.

15) NULL handling

Use IS NULL/IS NOT NULL for predicates
Aggregations typically ignore NULLs (e.g., AVG)
Sorting places NULLs first by default

16) Static vs. Dynamic partitioning

Static: partition values provided in the DML. Dynamic: Hive infers partition keys from data—enable carefully to avoid tiny files.

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE sales PARTITION (dt)
SELECT user_id, amount, dt FROM staging_sales;

17) UDF, UDAF, UDTF

UDF: per‑row transformation
UDAF: aggregates multiple rows → one value
UDTF: one row → multiple rows (e.g., explode)

SELECT user_id, item FROM clicks LATERAL VIEW explode(items) e AS item;

18) Execution engines

Hive can run on MapReduce (stable, slower), Tez (DAG‑based, faster), and Spark (in‑memory, fastest for iterative/interactive workloads).

19) ACID transactions

Enable row‑level INSERT/UPDATE/DELETE on transactional ORC tables with snapshot isolation. Requires compaction and proper configs.

SET hive.support.concurrency=true;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=2;

CREATE TABLE acct (
  id BIGINT, balance DECIMAL(18,2)
) CLUSTERED BY (id) INTO 16 BUCKETS
STORED AS ORC TBLPROPERTIES ('transactional'='true');

20) Performance optimization tips

Prefer ORC/Parquet with compression; enable vectorization
Partition on common filters; bucket on common join keys
Use ANALYZE TABLE ... COMPUTE STATISTICS
Switch engine to Tez/Spark
Avoid small files; use compaction and sized output

SET hive.vectorized.execution.enabled=true;
SET hive.execution.engine=tez;
ANALYZE TABLE sales PARTITION (dt) COMPUTE STATISTICS;
SET mapreduce.job.reduces=64; -- example sizing

Pro tip: Pair Hive with ORC + Tez/Spark and good partition design. That combo answers 80% of performance questions in interviews.

S D L

Hive Short Tutorial

Top 20 Apache Hive Interview Questions

Table of Contents

1) What is Apache Hive?

2) What are the key features of Hive?

3) Hive vs. RDBMS

4) Internal vs. External tables

5) What are partitions and buckets?

6) Hive Metastore

7) SerDes

8) How a Hive query executes

9) Supported file formats

10) Hive modes

11) Indexes in Hive

12) Schema‑on‑read

13) ORDER BY vs SORT BY vs DISTRIBUTE BY vs CLUSTER BY

14) Hive vs. Pig

15) NULL handling

16) Static vs. Dynamic partitioning

17) UDF, UDAF, UDTF

18) Execution engines

19) ACID transactions

20) Performance optimization tips

Post a Comment

Popular Libraries

Popular Posts

हिंदुस्तान कॉपर लिमिटेड में भर्ती : 103 पदों के लिए सुनहरा अवसर

भारतीय वायु सेना ग्रुप 'C' सिविलियन भर्ती 2025

महिला पर्यवेक्षक एवं आंगनवाड़ी परीक्षा 2025: संपूर्ण जानकारी

संघर्ष और सपनों की कहानी: रिंकू सिंह और प्रिया सरोज की प्रेरणादायक जोड़ी

Categories

Social Plugin

Author Profile

Ads

ADDRESS:

Company

Footer Copyright

Contact form

S D L

Hive Short Tutorial

Table of Contents

1) What is Apache Hive?

2) What are the key features of Hive?

3) Hive vs. RDBMS

4) Internal vs. External tables

5) What are partitions and buckets?

6) Hive Metastore

7) SerDes

8) How a Hive query executes

9) Supported file formats

10) Hive modes

11) Indexes in Hive

12) Schema‑on‑read

13) ORDER BY vs SORT BY vs DISTRIBUTE BY vs CLUSTER BY

14) Hive vs. Pig

15) NULL handling

16) Static vs. Dynamic partitioning

17) UDF, UDAF, UDTF

18) Execution engines

19) ACID transactions

20) Performance optimization tips

You may like these posts

Post a Comment

Footer Copyright

Contact form