We are excited to announce the Cask Data Application Platform (CDAP) 3.2 release.
This release brings many enhancements to existing CDAP features as well as lays the foundation for upcoming, advanced features—all designed to further simplify data application development.
CDAP 3.2 introduces Cask Hydrator—a highly functional framework and UI to support self-service batch and real-time data ingestion, and ETL. Hydrator provides CDAP users a code-free way to configure, deploy, and operationalize ingestion pipelines from different types of data sources.
This release also adds multiple new features to the existing ETL framework in CDAP, including writing to multiple sinks and an extensible data validation layer. With enhancements to the existing ETL template, CDAP 3.2 formalizes the concept of Application Templates—first introduced in CDAP 3.0—to all applications. To achieve this, we’ve introduced Artifacts, which is application code that can be extended with configuration and plugins. For example, an artifact can consist of a flow writing to a dataset, and multiple applications configured to write to different datasets can be created using that artifact. Without artifacts, users would’ve needed to write redundant application code with the dataset name hardcoded instead of configured at application creation time.
We have also added a Data Quality Application as the third built-in CDAP Application after ETL Batch and ETL Real-time. The Data Quality application helps users monitor the quality of data consumed or produced by other applications.
Metadata, Data Discovery, Auditing and Lineage
CDAP 3.2 includes features for capturing and curating metadata. With this release, users can annotate CDAP entities (applications, programs, datasets, and streams) with both tags (labels) and properties (key-value pairs).
Data discovery allows users to search for CDAP entities based on metadata. For this, we’ve added prefix-based search capabilities in CDAP for the first time. This paves the way for extended search capabilities, including integration with full-text search engines in upcoming releases.
CDAP 3.2 automatically captures an audit trail of program and dataset interactions. Lineage can be generated by CDAP and shows how datasets and programs are related to each other. This allows users to understand precisely how a dataset was modified or read-from in a given time interval, by whom, and with what parameters. This information can be extremely useful when tracking trusted or sensitive datasets, to identify which processes or data may have been impacted by an anomalous dataset, and so on.
CDAP 3.2 introduces Views. Designed to simplify schema-on-read, Views provide a read-only view of a stream, with a specific read format. Read formats consist of a schema and a format (including CSV, TSV, Avro, amongst others). In future releases of CDAP, this capability will be extended further to allow users to create their own custom parsers.
When CDAP Explore is enabled, a Hive table is automatically created for you to perform ad-hoc queries on a view. In 3.2, Views are only available on Streams. However, later versions of CDAP will add support for Datasets.
Datasets and MapReduce
This release features a major improvement to Datasets and MapReduce programs, enabling users to write to multiple datasets as well as to multiple partitions of a dataset from MapReduce programs. This capability enables writing to multiple sinks in the ETL framework (mentioned earlier in this blog), but also serves general-purpose use-cases which may require writing to different datasets or partitions based on the data being written.
Hadoop Distribution Support
Lastly, CDAP 3.2 supports the latest versions of Apache HBase™ (1.1) and Hortonworks Data platform (2.3). It also integrates with Apache Ambari, enabling Hortonworks customers to use Ambari to install CDAP.
Download CDAP 3.2 today and take it for a spin! Also consider helping us develop the platform by reaching out to the community with any comments, feedback, suggestions, or improvements or by creating and following JIRA issues and submitting pull requests.