This summer, we joined Cask as interns to work on Cask Data Application Platform (CDAP). Our project, internally codenamed Caskalytics, was creating an internal Operational Data Lake. In reality, a data lake is just a concept focused on storing data from disparate sources (real-time or batch, structured, unstructured or semi-structured) in a single big data store, with the intent of allowing increased information generation and analysis.
Cask generates a variety of data from many sources, in various formats and with different velocities. Examples include datasets of page views and clicks from Cask websites (e.g. cask.co, docs.cask.co), community project pages (cdap.io and the like), product download statistics and Github activity for all Cask supported projects. The objectives of this project were to:
- Aggregate data from various sources into a central Data Lake
- Generate analytics on the data collected, and
- Dogfood CDAP: use CDAP in the same capacity as our customers would for creating a data lake, to test it before it is made available to our customers and users.
The creation of a central lake and the ability to generate analytics provided Cask with greater insights into how the users consume our website and use our products and services.
This blog describes the approach we took building the version 0.1 of a Cask Data Lake using CDAP. There is a lot more to be done, but we are very excited as we were able to learn CDAP and implement this within a few weeks — while having fun building it.
Data Flow Architecture
Data is aggregated from various sources in real-time and batch. Sources include Amazon Simple Service Queue (SQS), S3 buckets (Cloud Front Logs – CFL) and REST APIs for Github. Data is aggregated from sources into Time-Partitioned Fileset (TPFS). Datasets are then periodically analyzed, and statistics are generated and stored into Cube Datasets. Data from Cube Datasets are then served viaREST APIs using services and visualized using a custom UI built on NodeJS.
The real-time pipeline sources are web beacons, which are scripts on web pages. Every time a user visits any Cask page,these scripts push events containing access information into Amazon’s Simple Queue Service. The events are then Extracted, Transformed, and Loaded (ETL) into the Data Lake. The SQS plugin is used to ingest the events in real-time into a Stream (Streams are CDAP abstractions over HDFS with an HTTP endpoint; events in a Stream are time ordered and logically time partitioned). Events from the Stream are then parsed and processed into an OLAP Cube Dataset in real-time using a Tigon flow. The same events are also stored in the Data Lake using a separate batch ETL job for future analysis.
The Data Lake also includes batch pipelines that read events of various formats from Amazon S3 buckets. The data in these buckets are page views and clicks generated from CloudFront Logs. An Amazon S3 plugin is used in the batch ETL job to periodically aggregate data from S3. The ETL job keeps track of the data downloaded, optimizing the S3 download and further preventing counting duplicate downloads. This ETL job is a scheduled MapReduce job.
Another batch job periodically fetches repository statistics through the Github API. We use a custom action in a CDAP workflow for a lightweight method of executing specific code on a cron schedule. Similarly, this custom action loads events into the Lake whenever it fetches the statistics.
Exposing the Data
Through either a Tigon flowlet or an ETL batch job, the processing layer parses each log record and extracts desired information, including the URL, timestamp, IP, device, and browser. Once these details are written to a Cube Dataset, all aggregations of each data field across various time intervals are automatically created. From here, the user interface can query and display the information using the REST APIs exposed by a CDAP service.
Besides developing Caskalytics, we had many opportunities to also contribute back to CDAP. We developed new ETL plugins to read data from Amazon S3 and SQS. Furthermore, to simplify our data fetching from GitHub, we implemented a completely new CDAP feature: accessing datasets from custom actions in a workflow.
CDAP is a powerful platform, and using it, we were able to take an idea from inception to production in just 12 weeks! We had to deal with many challenges ranging from architecting to actually implementing and packaging what was necessary to complete the project. We also went through extensive reviews with senior engineering folks at each step during the project. Using CDAP helped us deal with Hadoop in a simplified way so that we could focus on other more pressing aspects of our project.
As interns, it was incredibly rewarding to be able to dive into the backend components of CDAP and implement new features. And towards the end of our 12-week tenure with Cask, we are proud to share that Caskalytics is now running in production delivering real-time statistics to the company!
If you are interested in solving such challenging problems and would like an amazing internship experience, apply to be an intern.