Hive as the SQL engine
SQL is a widely adopted standard for ad-hoc querying, and it’s a great way to make it easy for users to interact with Datasets. While there are a number of existing SQL engines in the Hadoop ecosystem, we decided to go with Apache Hive because one of its core components, Hive Metastore (which stores metadata about Hive tables), is also used by other existing SQL engines like Presto and Impala. This means that later integrations of those engines will be easier.
Datasets are non-native, external tables from Hive’s point of view. Hive has a well-defined interface for interacting with non-native tables via Storage Handlers. To integrate Hive with Datasets, we implemented a custom Dataset Storage Handler to manage the interactions. The Dataset Storage Handler is composed of:
- a DatasetInputFormat that knows how to retrieve Dataset Records from Datasets, and
- a DatasetSerDe that converts the records into objects that Hive can work with.
The data flow between Hive and Datasets is depicted below:
Hive as a Service in Reactor
Typically, a Hive installation consists of a Hive Metastore server that contains the metadata and a Hive Server that runs the user queries. In our case, we can connect to an existing Hive Metastore for metadata. However, connecting to an existing Hive server has the following limitations:
- Hive Server does not have interfaces that allow access to the complete query life cycle. This limits transaction management while dealing with Datasets.
- Hive Server runs user code in server space, which makes the service vulnerable to bad code.
- Hive Server’s HTTP interface does not support security and multi-tenancy.
To overcome these limitations, we created a new HTTP-based Hive Server that addresses the above issues by wrapping Hive’s CLIService and integrating with Reactor’s Datasets and transactions engine Tephra. We chose to run the Hive Server in YARN containers for proper user code and resource isolation, as well as scalability.
Hive-Reactor Integration challenges
Since Hive jars bundle disparate libraries like Procotol Buffers and Guava, we came across a recurrent classloading issue. This is because Reactor uses different versions of these libraries. We had to make sure that the classpaths of the different components that run Hive in Reactor have their jars in a particular order. We fixed the classpath order in various places by doing the following:
- In Reactor Master: since we only launch Hive service from Reactor Master, it was sufficient to isolate Hive jars in a separate classloader to avoid any library version conflicts.
- In the YARN container running Hive service: we set the classpath of the container in the right order during its initialization since we have control over it.
- In the mapreduce jobs launched by Hive in the YARN cluster: we set the Hadoop configuration setting mapreduce.job.user.classpath.first to true so that Reactor jars contained in the hive.aux.jars configuration would come earlier than Hive jars in the classpath of those mapreduce jobs.
- In the local mapreduce jobs launched in the same YARN container as the Hive service: we changed the way Hadoop classpath was set by modifying the HADOOP_CLASSPATH environment variable. We also set the environment variable HADOOP_USER_CLASSPATH_FIRST to true so that the HADOOP_CLASSPATH content would come earlier than Hive jars in the classpath of the local mapreduce jobs.
We are working hard to enable a more fun and productive user experience for data exploration in Hadoop. One of the features that we plan on introducing in the future is a JDBC driver to connect third-party BI tools to Reactor Datasets. This will make our platform even more accessible to all users who want to work with data in Hadoop.
And if this work sounds exciting to you, check out our careers page and send us your resume!