Top 20 Apache Hive Interview Questions
Curated Q&A with clear explanations, tables, and examples—perfect for last‑minute prep or a technical blog post.
This guide covers the most common Apache Hive interview questions with crisp, practical answers. It includes differences (Hive vs. RDBMS, ORDER BY vs. SORT/DISTRIBUTE/CLUSTER), file formats, execution engines, ACID, UDFs, and performance tips. Copy, tweak, and publish as a standalone blog post.
Table of Contents
1) What is Apache Hive?
Hive is a data warehouse system on Hadoop that lets you query large datasets in distributed storage using SQL‑like HiveQL. Hive translates queries into MapReduce, Tez, or Spark jobs and returns results to the client.
2) What are the key features of Hive?
- SQL‑like querying with HiveQL
- Schema‑on‑read for flexible ingestion
- Partitioning and bucketing for pruning and joins
- Extensibility via UDFs and SerDes
- Works over HDFS, S3, and compatible stores
3) Hive vs. RDBMS
Feature | Hive | RDBMS |
---|---|---|
Schema | Schema‑on‑read | Schema‑on‑write |
Workloads | Analytics, batch | OLTP, transactions |
Transactions | Limited (ACID tables) | Full ACID |
Engine | MR/Tez/Spark | RDBMS engine |
Storage | HDFS/S3 | Local/Network FS |
4) Internal vs. External tables
Internal (managed) tables are fully owned by Hive—dropping the table deletes both metadata and data. External tables store data outside Hive’s warehouse; dropping only removes metadata, preserving files.
-- External table example
CREATE EXTERNAL TABLE sales_ext (
id BIGINT, amount DOUBLE
)
PARTITIONED BY (dt STRING)
STORED AS PARQUET
LOCATION 's3://bucket/sales/';
5) What are partitions and buckets?
Partitioning splits data into directories by key (e.g., country
, dt
) so queries can skip irrelevant files. Bucketing hashes rows into a fixed number of files per partition—useful for map‑side joins and sampling.
CREATE TABLE events (
user_id BIGINT, action STRING
)
PARTITIONED BY (dt STRING)
CLUSTERED BY (user_id) INTO 32 BUCKETS
STORED AS ORC;
6) Hive Metastore
Stores table schemas, locations, partitions, and stats. Deployment modes:
- Embedded (Derby): single‑user dev.
- Local (MySQL/Postgres): prod on one HiveServer2.
- Remote: shared across multiple HiveServer2 instances.
7) SerDes
Serializer/Deserializer modules read/write custom formats. Built‑ins include OpenCSVSerde
, JsonSerDe
, RegexSerDe
; you can implement your own for bespoke logs.
8) How a Hive query executes
- Parse & analyze HiveQL
- Optimize and create a logical/physical plan
- Translate to MR/Tez/Spark DAG
- Submit to YARN; read data from HDFS/S3
- Return results through HiveServer2/CLI
9) Supported file formats
- ORC: best with Hive; compression, indexes, vectorization
- Parquet: columnar, cross‑engine
- Avro: row‑based, schema evolution
- Text/CSV, JSON, SequenceFile
CREATE TABLE users_orc (...) STORED AS ORC;
SET hive.exec.orc.default.stripe.size = 268435456; -- 256 MB
10) Hive modes
- Local: single JVM, quick tests
- Distributed: MapReduce/Tez/Spark on a cluster
11) Indexes in Hive
Indexes (compact/bitmap) can prune data scans, but are less common with modern columnar formats and partition pruning. Prefer ORC/Parquet with stats.
12) Schema‑on‑read
Hive applies the schema during read time, allowing ingestion of semi‑structured data without strict loading rules. This trades write simplicity for read‑time validation.
13) ORDER BY vs SORT BY vs DISTRIBUTE BY vs CLUSTER BY
- ORDER BY: total order; single reducer → slow on big data.
- SORT BY: sorts within each reducer; partial ordering.
- DISTRIBUTE BY: controls reducer partitioning (hash).
- CLUSTER BY: DISTRIBUTE BY + SORT BY on the same key.
SELECT * FROM t DISTRIBUTE BY country SORT BY country, dt;
14) Hive vs. Pig
Hive targets SQL users and BI; Pig targets developers with data‑flow scripts. Hive is the de‑facto standard today for SQL‑style analytics on Hadoop.
15) NULL handling
- Use
IS NULL
/IS NOT NULL
for predicates - Aggregations typically ignore NULLs (e.g.,
AVG
) - Sorting places NULLs first by default
16) Static vs. Dynamic partitioning
Static: partition values provided in the DML. Dynamic: Hive infers partition keys from data—enable carefully to avoid tiny files.
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE sales PARTITION (dt)
SELECT user_id, amount, dt FROM staging_sales;
17) UDF, UDAF, UDTF
- UDF: per‑row transformation
- UDAF: aggregates multiple rows → one value
- UDTF: one row → multiple rows (e.g.,
explode
)
SELECT user_id, item FROM clicks LATERAL VIEW explode(items) e AS item;
18) Execution engines
Hive can run on MapReduce (stable, slower), Tez (DAG‑based, faster), and Spark (in‑memory, fastest for iterative/interactive workloads).
19) ACID transactions
Enable row‑level INSERT/UPDATE/DELETE
on transactional ORC tables with snapshot isolation. Requires compaction and proper configs.
SET hive.support.concurrency=true;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=2;
CREATE TABLE acct (
id BIGINT, balance DECIMAL(18,2)
) CLUSTERED BY (id) INTO 16 BUCKETS
STORED AS ORC TBLPROPERTIES ('transactional'='true');
20) Performance optimization tips
- Prefer ORC/Parquet with compression; enable vectorization
- Partition on common filters; bucket on common join keys
- Use
ANALYZE TABLE ... COMPUTE STATISTICS
- Switch engine to Tez/Spark
- Avoid small files; use compaction and sized output
SET hive.vectorized.execution.enabled=true;
SET hive.execution.engine=tez;
ANALYZE TABLE sales PARTITION (dt) COMPUTE STATISTICS;
SET mapreduce.job.reduces=64; -- example sizing