I am very happy to announce the general availability of our flagship product, the Cask Data Application Platform (CDAP), version 3.4. This release introduces a fresh new look for Cask Hydrator, and improvements to it that extend beyond data ingestion use cases, such as building aggregations and performing data science on the ingested data. The release adds Cask Tracker, a new CDAP extension that automatically captures rich metadata and provides users with visibility into how data flows into, out of, and within a Data Lake. This release also adds significant improvements to support Spark and Spark Streaming in CDAP, with simple and easy to use APIs as well as a new runtime that provides access to core CDAP features like Datasets and Transactions in Spark programs. We have also made number of platform improvements to metrics, logging and workflow infrastructure to provide improved usability to CDAP users.
Cask Hydrator is a CDAP extension that was first introduced a few months ago to address the common problem of continuous data ingestion for batch and streaming use-cases. With the earlier release we learned that our customers often need to solve bigger problems that extend Hydrator’s capabilities beyond data ingestion use cases. With CDAP v3.4, Hydrator provides the following new capabilities:
- Perform aggregation to collect insights at a feed level; examples include performing groupBy operations, computing Distincts, Deduping of data and the likes
- Including data science operations as a part of data pipeline to build machine learning models and the ability to score the model
- Provide post-action hooks to run after the pipeline is run, allowing users to send an email, perform a Database operation or call an HTTP endpoint at the end of the pipeline run
- Ability to run the data pipeline in Spark or MapReduce.
All of the above features are implemented as plugins – a powerful Hydrator feature that allows users to extend the capability of Hydrator for implementing their own aggregation, model, compute and post-run actions.
We have also refreshed the Hydrator UI in v3.4 to provide a cleaner, sleek interface and added capabilities to upload user plugin artifacts. We have included many usability enhancements to improve the user-experience of creating data pipelines using Hydrator.
While Hydrator enables organizations to build frictionless, easy to use data pipelines, it is equally important to provide users with visibility into the data flowing into and out of Data Lakes created with Hydrator. Cask Tracker enables data discovery and provenance, and allows users to automatically perform an audit trail of any data flowing through Hydrator, so that they can answer questions such as:
- What datasets are available in the system that match a specific criteria?
- How is this data being generated?
- Which programs are accessing my data?
Tracker has been built as a CDAP extension with the backend app that provides the necessary indexing, computes the audit trail and surfaces other relevant information like schema and system/user properties. Tracker offers a clean UI that presents data engineers and data scientists pertinent information about the data regarding provenance, audit and metadata. It also offers integration with Cloudera Navigator out-of-box.
Enhanced Spark support
In this release, we have provided a number of enhancements to support running Spark for both streaming and non-streaming use-cases. We have integrated Spark with Core CDAP components such as datasets and transactions, which allows users to read/write data transactionally. This feature provides platform support for developing Spark streaming applications with exactly-once processing guarantees. We have enhanced the support for Spark programs by integrating them with metrics and logging subsystems, and by making it easy for application developers to operate applications involving Spark programs. We have also added support for forking and running multiple spark programs in parallel with CDAP workflows.
Additional platform improvements
We have made a number of other platform improvements to the metrics and the logging systems, with the goal of providing reliable system and application metrics and logs to our users. We enhanced workflow capabilities by adding lifecycle methods that help users perform pre and post workflow actions. Furthermore,we increased the user’s visibility to workflow operations by providing per workflow node status. Last but not least, ODBC drivers are included with CDAP v3.4, that allows accessing CDAP datasets from Windows applications.