One of Cask’s core goals is making a reasonably-experienced Java developer’s life much easier when building Hadoop applications. My summer project was aligned with the company’s effort to take this to the next level by lowering the barrier to entry for using Hadoop even further — Java proficiency not required.
I spent my summer writing an extensible and configurable CDAP application that helps its users to aggregate the quality of their Datasets over periods of time. The Datasets can range from HBase, HDFS, and other data storage systems. High-level objectives for building this application were:
- Data scientists and developers can assess the quality of any Dataset before experimenting on it or building an application using it,
- Users can specify the types of metrics they want to gather (e.g. Histogram, Unique, Mean, Standard Deviation, etc) for each Dataset field,
- Users can specify and assess the quality of their data at regular intervals, and
- Advanced users can build custom metrics — to be recorded for the fields of a given Dataset.
Let’s take a concrete example to describe how this application can be used. Assume we have a popular website that is serving web pages through a web server to its users. User activity on the website is logged (ex. Combined Log Format, or CLF) and transported to a Hadoop cluster using one of the many technologies such as Kafka, Flume, etc. Once the data has landed in Hadoop, one can perform analytics, build a 360 degree view of users, etc. The website activity log, at a minimum, includes activity time, status code, pages accessed, response size, and referrer. Once the activity data is loaded into Hadoop, there could be multiple teams of developers that would like to build different pipelines (ETL), perform exploratory analysis, or build models. In order to build an application or perform experimentation, it’s important that one uses the right quality data and the quality of the data should be easily accessible.
The data quality application provides an extensible application to help users determine the quality of their data. Users can use out-of-the-box aggregation functions to get started. The resulting quality-of-data dataset can then be easily queried using a RESTful API.
Getting started with the application is very simple. User can deploy it using either the CDAP RESTful API or CLI along with these parameters specified as JSON:
- Name and format of the Dataset,
- Frequency at which the quality of data should be measured, and
- Configuration specifying the quality measure to be used for the fields in the dataset. One or more quality measures can be associated with a field.
To set this up, no code is required.
Once the application is created and started, the Workflow in the Application is scheduled to run at the frequency specified by the configuration. The Workflow triggers a MapReduce (and in future, a Spark) program that fetches data from the source, applies the pre-specified aggregations to their respective fields, and writes the aggregated data to a Timeseries Table. The data aggregated can be queried using a Service included in the Application.
The application comes with a library of aggregation functions for users to use right off the bat: histogram on a set of discrete values, histogram with bucketing, average, standard deviation, and unique values. Aggregation functions can be easily be extended using the plugin APIs exposed by the data quality application.
One thing I was mindful of throughout the course of the project was that this application needed to support extremely large amounts of data. As such, when I was building the aggregation function APIs and the library of aggregation functions, I had to keep two priorities in mind: store as little data as possible in memory and make as few passes through the data as possible. I started to favor streaming algorithms (I used Welford’s streaming algorithm for computing standard deviation, for example) when writing the aggregation functions.
I also had to keep the design details of Hadoop in mind when designing the aggregation function interfaces. For instance, the reducer receives a field and an iterable of values, and the iterable can only be traversed once. This means we ideally want to design an API that only looks at one value at a time and keeps a running aggregation. This also means that aggregations that need multi-pass algorithms would require storing all the data in memory — which is best to avoid.
Overall, I had a great time creating and playing around with this application. I got to experience CDAP from a number of different perspectives throughout the course of building this application. I found the whole concept of lowering the barrier to entry for using Hadoop extremely exciting, and I’m looking forward to seeing what Cask has in store within the realm of higher-level abstractions.