Hive Short Tutorial

Admin, Student's Library
0
Top 20 Apache Hive Interview Questions (With Detailed Answers)

Top 20 Apache Hive Interview Questions

Curated Q&A with clear explanations, tables, and examples—perfect for last‑minute prep or a technical blog post.

HiveQL Hadoop Partitioning ORC/Parquet Tez/Spark

This guide covers the most common Apache Hive interview questions with crisp, practical answers. It includes differences (Hive vs. RDBMS, ORDER BY vs. SORT/DISTRIBUTE/CLUSTER), file formats, execution engines, ACID, UDFs, and performance tips. Copy, tweak, and publish as a standalone blog post.

1) What is Apache Hive?

Hive is a data warehouse system on Hadoop that lets you query large datasets in distributed storage using SQL‑like HiveQL. Hive translates queries into MapReduce, Tez, or Spark jobs and returns results to the client.

2) What are the key features of Hive?

  • SQL‑like querying with HiveQL
  • Schema‑on‑read for flexible ingestion
  • Partitioning and bucketing for pruning and joins
  • Extensibility via UDFs and SerDes
  • Works over HDFS, S3, and compatible stores

3) Hive vs. RDBMS

FeatureHiveRDBMS
SchemaSchema‑on‑readSchema‑on‑write
WorkloadsAnalytics, batchOLTP, transactions
TransactionsLimited (ACID tables)Full ACID
EngineMR/Tez/SparkRDBMS engine
StorageHDFS/S3Local/Network FS

4) Internal vs. External tables

Internal (managed) tables are fully owned by Hive—dropping the table deletes both metadata and data. External tables store data outside Hive’s warehouse; dropping only removes metadata, preserving files.

-- External table example
CREATE EXTERNAL TABLE sales_ext (
  id BIGINT, amount DOUBLE
)
PARTITIONED BY (dt STRING)
STORED AS PARQUET
LOCATION 's3://bucket/sales/';

5) What are partitions and buckets?

Partitioning splits data into directories by key (e.g., country, dt) so queries can skip irrelevant files. Bucketing hashes rows into a fixed number of files per partition—useful for map‑side joins and sampling.

CREATE TABLE events (
  user_id BIGINT, action STRING
)
PARTITIONED BY (dt STRING)
CLUSTERED BY (user_id) INTO 32 BUCKETS
STORED AS ORC;

6) Hive Metastore

Stores table schemas, locations, partitions, and stats. Deployment modes:

  • Embedded (Derby): single‑user dev.
  • Local (MySQL/Postgres): prod on one HiveServer2.
  • Remote: shared across multiple HiveServer2 instances.

7) SerDes

Serializer/Deserializer modules read/write custom formats. Built‑ins include OpenCSVSerde, JsonSerDe, RegexSerDe; you can implement your own for bespoke logs.

8) How a Hive query executes

  1. Parse & analyze HiveQL
  2. Optimize and create a logical/physical plan
  3. Translate to MR/Tez/Spark DAG
  4. Submit to YARN; read data from HDFS/S3
  5. Return results through HiveServer2/CLI

9) Supported file formats

  • ORC: best with Hive; compression, indexes, vectorization
  • Parquet: columnar, cross‑engine
  • Avro: row‑based, schema evolution
  • Text/CSV, JSON, SequenceFile
CREATE TABLE users_orc (...) STORED AS ORC;
SET hive.exec.orc.default.stripe.size = 268435456; -- 256 MB

10) Hive modes

  • Local: single JVM, quick tests
  • Distributed: MapReduce/Tez/Spark on a cluster

11) Indexes in Hive

Indexes (compact/bitmap) can prune data scans, but are less common with modern columnar formats and partition pruning. Prefer ORC/Parquet with stats.

12) Schema‑on‑read

Hive applies the schema during read time, allowing ingestion of semi‑structured data without strict loading rules. This trades write simplicity for read‑time validation.

13) ORDER BY vs SORT BY vs DISTRIBUTE BY vs CLUSTER BY

  • ORDER BY: total order; single reducer → slow on big data.
  • SORT BY: sorts within each reducer; partial ordering.
  • DISTRIBUTE BY: controls reducer partitioning (hash).
  • CLUSTER BY: DISTRIBUTE BY + SORT BY on the same key.
SELECT * FROM t DISTRIBUTE BY country SORT BY country, dt;

14) Hive vs. Pig

Hive targets SQL users and BI; Pig targets developers with data‑flow scripts. Hive is the de‑facto standard today for SQL‑style analytics on Hadoop.

15) NULL handling

  • Use IS NULL/IS NOT NULL for predicates
  • Aggregations typically ignore NULLs (e.g., AVG)
  • Sorting places NULLs first by default

16) Static vs. Dynamic partitioning

Static: partition values provided in the DML. Dynamic: Hive infers partition keys from data—enable carefully to avoid tiny files.

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE sales PARTITION (dt)
SELECT user_id, amount, dt FROM staging_sales;

17) UDF, UDAF, UDTF

  • UDF: per‑row transformation
  • UDAF: aggregates multiple rows → one value
  • UDTF: one row → multiple rows (e.g., explode)
SELECT user_id, item FROM clicks LATERAL VIEW explode(items) e AS item;

18) Execution engines

Hive can run on MapReduce (stable, slower), Tez (DAG‑based, faster), and Spark (in‑memory, fastest for iterative/interactive workloads).

19) ACID transactions

Enable row‑level INSERT/UPDATE/DELETE on transactional ORC tables with snapshot isolation. Requires compaction and proper configs.

SET hive.support.concurrency=true;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=2;

CREATE TABLE acct (
  id BIGINT, balance DECIMAL(18,2)
) CLUSTERED BY (id) INTO 16 BUCKETS
STORED AS ORC TBLPROPERTIES ('transactional'='true');

20) Performance optimization tips

  • Prefer ORC/Parquet with compression; enable vectorization
  • Partition on common filters; bucket on common join keys
  • Use ANALYZE TABLE ... COMPUTE STATISTICS
  • Switch engine to Tez/Spark
  • Avoid small files; use compaction and sized output
SET hive.vectorized.execution.enabled=true;
SET hive.execution.engine=tez;
ANALYZE TABLE sales PARTITION (dt) COMPUTE STATISTICS;
SET mapreduce.job.reduces=64; -- example sizing
Pro tip: Pair Hive with ORC + Tez/Spark and good partition design. That combo answers 80% of performance questions in interviews.
Back to top
© Your Brand. All rights reserved.

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.
Post a Comment (0)
Our website uses cookies to enhance your experience. Learn More
Accept !