A recent report suggests that every day, 2.5 quintillion bytes of data are created – 90% of the world’s data has been generated in the past two years. Data is fast becoming the centerpiece and cornerstone of a variety of businesses worldwide. Given this explosion of data, maintaining a catalog of data is of utmost importance.
It is often observed that terabytes, petabytes of data are available in data lakes or landing zones, but there is very little metadata about each contained dataset. Metadata can include taxonomy, schema, quality and other such characteristics. In the absence of such metadata, data scientists and business analysts can end up spending most of their time to discover what data is available or to discover a dataset that has a given characteristics. Their time would much rather be utilized to tune their algorithms to derive value out of data.
Data Discovery is the process of finding data with a given set of characteristics. A recent Gartner report emphasizes the importance of data discovery. Typically, data discovery is enabled using a search mechanism to identify datasets. This mechanism can help users look for data with particular business annotations, fields, field types and so on. To enable discovery, datasets in your data lake must first be annotated with metadata. Some of this metadata could be machine-generated while some could be provided by business users. Some examples of machine-generated metadata could include the data size, storage format, schema etc, whereas user-generated metadata is generally business-driven – identifying the business use-case that the dataset is used for.
Ultimately, annotating data with such metadata can help in adding advanced capabilities like search and discovery as well as lineage and provenance to data applications. As described earlier, search and discovery can help users find datasets with certain characteristics. It can answer queries like – find a dataset that has a field called employee_id of type long; or find a dataset that has been annotated with the term media. On the other hand, lineage and provenance can tell users the relationship between various datasets and processes, thereby helping maintain an audit trail of data. It can answer questions like – which other datasets were used to generate a given dataset in the specified time period; or which processes impacted a given dataset in the specified time period. This also make data compliance and governance possible, for instance, by providing an audit trail of sensitive data.
Implementing these advanced capabilities, however, involves great complexity. As a result, it is not desirable to add such capability into every single application or data pipeline independently. Another option is to add these to platforms like File Systems, Databases, Columnar Storage Systems, etc. However, these platforms are not built with every single application’s business use case in mind. Hence, it makes sense to develop these capabilities into a middleware layer that can interact with low level storage and processing platforms, but also has an application-centric view. With that in mind, as of release 3.2.0, metadata, data discovery and lineage tracking features were added in the Cask Data Application Platform (CDAP). These capabilities were further enhanced in the 3.3.0 release with the addition of machine-generated metadata, schema as metadata and enhanced search capabilities.
In CDAP, the concept of metadata and discovery has been extended beyond data, to include processes that read from and write to datasets. Artifacts, applications, programs, streams, datasets and stream views can be annotated with metadata. As mentioned before, CDAP automatically annotates these entities with machine-generated metadata (referred to as system metadata) and also allows users to annotate them with business metadata (referred to as user metadata).
Users can then search these CDAP entities using the metadata associated with them. This can help users answer a variety of questions to discover entities. Some examples are:
Along with discovery, metadata also enables lineage tracking in CDAP, which helps users find the relationship between CDAP programs, datasets and streams in the specified time window. It can answer questions like ‘What datasets were generated using the specified dataset, in a given period of time?’; or ‘What programs consumed a dataset during a given time interval?’.
CDAP can also optionally publish metadata changes to entities for consumption by external tools. This is designed to enable integration with external data governance tools like Cloudera Navigator.
We hope you enjoyed this introduction to metadata, discovery and lineage in CDAP. We will have follow-up blogs to describe the design and implementation various aspects of this feature like search, lineage and integration with external data governance tools in detail. In the meantime, please be sure to download CDAP 3.3.0, try out these features and let us know your feedback by emailing our user group.