Data lakes comprise of unstructured, semi-structured and structured data. A common practice for a lot of companies is to bring in data from relational data stores so that various consumers can use it. In this blog post, we will describe how to easily and quickly build a data pipeline with Cask Hydrator. We will take a look at getting data from a database source and storing it in time partitioned files, and then enable SQL access on top of the data.
Cask Hydrator, a 100% open source extension of the Cask Data Application Platform (CDAP), allows building data pipelines through a simple drag and drop user interface. Here is an earlier blog post introducing Hydrator.
Let’s consider a scenario where an e-commerce company is storing registration information in a MySQL database. The company wishes to take a daily dump of registration data and store it in the Hadoop file system in a way that can easily be addressed by time and explored in an ad-hoc manner using SQL queries.
Let’s see how to build a data pipeline to ingest this data. Hydrator Studio view is a visual editor for creating a logical data pipeline.
The first step in building the data pipeline is configuring the source. Hydrator supports Database sources out-of-the-box using the Database plugin.
Loading JDBC drivers
The Database Source plugin uses JDBC drivers to talk to the database and bring the data in. For this purpose, you need to deploy the appropriate JDBC driver into Hydrator. This particular use case deals with loading the data from a MySQL database, and the steps below highlight loading the database driver.
Step 1: Download the appropriate version of the JDBC driver JAR. We will use MySQL driver v5.1.38.
Step 2: Load the JDBC driver using CDAP REST APIs
Configuring the Source
To configure the source choose the Database source and provide the required parameters:
- The user name and password to connect to the database.
- The JDBC Plugin name to be used for the database connection. That is, the name we used in Step 2, namely “mysql”.
- The JDBC connection string
- The SELECT query used for importing the data
You can optionally set the number of splits to 1 to fetch all the data in a single map task.
Configuring the Destination
The next step in building this pipeline is to configure the sink or destination for the imported data. Hydrator provides capabilities to store data in time-partitioned directories via a built-in CDAP Dataset called Time-partitioned File Set. Once the data is stored in the fileset, CDAP automatically adds a partition which can be queried using Hive.
In this use case we will configure a Time-partitioned File Set that stores data in Avro format by using TPFSAvro as the sink.
Connecting the Source and the Destination
Now let’s connect the source and the sink. The schema configured in the source is automatically copied to the sink and will be used.
And finally, running the data pipeline
We have now built the logical pipeline for bringing data in from the database. The next step is to publish the pipeline, which converts the logical pipeline into a physical execution plan using workflows of either MapReduce or Spark jobs.
Simply click on Run to execute the pipeline once. A pipeline run launches a workflow that runs a MapReduce program that will read the data from the database source and write to the Time-partitioned File Set sink. The pipeline can also be scheduled to run periodically on a time base.
Exploring the data in Hadoop
The data that is fetched can be explored using Hive queries from the Explore page of CDAP. Users can give a Hive QL query to explore.
We demonstrated how Hydrator can be used to easily build data ingest pipelines with a few clicks. Learn more by downloading CDAP, which includes Hydrator, and try out your own use cases. If you have any questions, please do reach out to the CDAP user community group. And if you are in San Jose on June 28-30 for the Hadoop Summit, please swing by the Cask booth and ask me or my team for a demo!