Apache Oozie is a workflow scheduler system to manage Apache Hadoop™ jobs. It is one of the most popular open-source workflow scheduler systems for Hadoop. Cask Data Application Platform (CDAP) is an open-source platform to build and deploy data applications on Hadoop. CDAP provides abstractions on top of Hadoop that enable developers to rapidly build, deploy, and manage real-time as well as batch data applications. CDAP includes a built-in workflow and scheduler system. This blogpost aims to introduce CDAP workflows by way of a functional comparison with Oozie.
Architecture and Design
Both Oozie and CDAP workflows are essentially implemented as Directed Acyclic Graphs (DAGs) of actions.
The Oozie server is a Java web-application that runs in a Java servlet container such as Apache Tomcat, running on a client/gateway node for a Hadoop cluster.
Packaging, Distribution, and Installation
Since Oozie is a distinct Java web-application, it needs to be installed separately from other components in a Hadoop cluster, following a detailed, often cumbersome installation procedure. The CDAP workflow system, however, is simply a sub-component of CDAP; if you have installed CDAP on your cluster, you do not need to install another component for running workflows. Similarly, setting up Oozie in a secure environment requires you to secure an extra service (the Oozie server), whereas in CDAP, you do not need to secure an extra service just for running workflows.
Backing Metadata Store
Oozie uses a relational database as its backing metadata store. It currently supports HSQLDB, Apache Derby, MySQL, Oracle, and PostgreSQL for this purpose. CDAP workflows, on the other hand, use the CDAP Metastore as their backing metadata store. The CDAP Metastore uses Apache HBase in distributed mode and LevelDB in standalone mode. The program archives for CDAP workflows are stored on the file system, for which CDAP uses HDFS (or a similar distributed file system like MapRFS) in distributed mode or the local file system in standalone mode.
Oozie uses a map-only MapReduce job with a single mapper as a launcher job that is spawned for each action node. CDAP workflows use a YARN application developed using Apache Twill as its driver program.
Oozie workflow and coordinator definitions are written in hPDL (an XML Process Definition Language similar to the JBoss jBPM jPDL). Oozie also allows users familiar with Java to configure actions using Java code by implementing the OozieActionConfigurator interface. However, the Java code still must be referenced from within the workflow XML definition. Like all other CDAP abstractions, CDAP provides a Java programming API for defining the workflows.
To deploy Oozie applications, users first have to define their workflow and coordinator XML files along with a job.properties file (containing parameters, location of workflow/coordinator XMLs, etc.) This has to be packaged with the libraries (.jar files) that the application requires. Oozie provides a Java Client API and a Command Line Interface (the Oozie CLI) to submit these applications to the Oozie server.
CDAP workflows, on the other hand, are Java classes that are built into a CDAP application (also Java code). The CDAP application is then submitted to the CDAP service as a single jar file. CDAP provides REST APIs, Java Clients, the CDAP CLI, or the CDAP UI to deploy applications. Once an application is deployed, users can start, stop, and manage workflows in the application like any other program in CDAP, using any of the interfaces listed previously.
Workflow Functional Specification
Action nodes are nodes in a workflow that run computation or processing tasks.
Oozie supports these action nodes:
- MapReduce action: Runs a MapReduce program.
- Pig action: Runs an Apache Pig script.
- Hive/Hive2 action: Runs an Apache Hive job.
- Fs (HDFS) action: Manipulates files and directories on HDFS from a workflow application.
- Sub-workflow action: Runs a child workflow job, which can be in either the same Oozie system or another Oozie system.
- Java action: Runs a Java program inside a workflow action.
- Spark action: Though the 4.2.0 documentation has no mention of it, OOZIE-1983 added support for spark-action in Oozie.
- Sqoop action: Runs an Apache Sqoop job.
- SSH action: Starts a shell command on a remote machine as a remote secure shell and waits for it to complete.
- Email action: Sends an email from a workflow.
- Distcp action: Uses Hadoop distributed copy to copy files from one cluster to another or within the same cluster.
CDAP workflows support these action nodes:
Control nodes define the beginning and end of a workflow. They also provide a mechanism to control the workflow execution path.
Oozie supports these control nodes:
- Start control node: Defines the beginning of the workflow.
- End control node: Defines the end of a workflow, indicating that the workflow has completed successfully.
- Kill control node: Defines the end of a workflow, indicating that the workflow failed.
- Decision control node: Enables the workflow to make a selection on which path to follow. It can be seen as a “switch-case” statement. Predicates are defined in the JSP Expression Language (EL).
- Fork and Join control nodes: A fork node splits one path of execution into multiple concurrent paths of execution. A join node waits until every concurrent execution path of a previous fork node arrives to it.
CDAP workflows support these control nodes:
- Start node: Signals the start of a workflow.
- End node: Signals the end of a workflow. CDAP workflows do not distinguish between failure nodes and success nodes. The state of the workflow after it has completed execution indicates whether the workflow succeeded or failed.
- Condition node: Allows conditional execution of nodes in a workflow. A condition is defined as a named Java class that implements Predicate, whose success or failure determines the path to follow. As opposed to the “switch-case” model of condition nodes in Oozie, CDAP workflows follow an “if-else” model.
- Fork and Join nodes: Similar to Oozie, a fork node splits one path of execution into multiple concurrent paths of execution, while a join node waits until all concurrent paths from the corresponding fork node have completed execution.
Both Oozie and CDAP workflows can be parameterized. Oozie allows this using the JSP Expression Language (EL) with some built-in constants and functions, while CDAP workflows allow parameterization using Preferences and Runtime Arguments.
Workflow Jobs Recovery (or Re-Run)
Oozie provides a mechanism to restart a workflow from the node which failed in a previous run. This is especially useful when a workflow has nodes that were successful in a previous run but are too expensive to be re-run. CDAP workflows currently do not provide this feature, but it is a critical feature on the CDAP roadmap.
Passing Data between Nodes
Oozie nodes have a <capture-output> element which can capture the output of a command. You can then set the captured output as an argument of a subsequent node in your workflow XML. As of version 3.1.0, CDAP Workflows include a comprehensive Workflow Token, available to nodes in the workflow, and to which they can read and write to as they please. Users can:
- pass custom data between nodes;
- use the workflow token in Condition Predicates to determine the path of execution in a workflow;
- use the workflow token in Action nodes to determine the configuration of an Action;
- query the workflow token for a given scope, node, or for a specified key.
Scheduler Functional Specification
Both Oozie and CDAP have a built-in scheduler system to allow scheduling of workflows based on both time-intervals and data availability. Oozie provides a Coordinator for scheduling purposes. The time-based scheduler in the coordinator uses JSP Expression Language (EL) syntax. This can also be used to parameterize scheduler expressions using built-in constants and functions. The time scheduler in CDAP is a cron-expression based scheduler. Users specify a cron expression as part of the schedule definition.
For data-availability based schedules, the Oozie coordinator allows users to define datasets, and then schedule workflows based on their availability. CDAP provides a Stream Size Scheduler that can trigger the execution of a workflow based on availability of data in a Stream.
Concurrent Runs of a Workflow
Both the Oozie coordinator and CDAP allow multiple concurrent runs of a workflow. The coordinator allows applications to specify the concurrency level in the <concurrency> tag of the coordinator XML. CDAP also allows concurrent runs of a workflow, with the maximum number of concurrent runs governed by the scheduler.max.thread.pool.size parameter in cdap-site.xml.
Oozie comes with a read-only UI by default. It also requires a non-Apache licensed library, which makes it difficult to install and use. An alternate UI, Hue is catching up with more advanced features for managing workflows. However, it is a separate project not included with Oozie.
CDAP workflows are integrated into the CDAP UI, which exposes capabilities to deploy applications containing workflows; set preferences and runtime arguments for workflows; start, stop, suspend, resume, and view workflow runs; as well as view workflow metrics and logs. The screenshot below shows the workflow page for a run of the WikipediaPipeline example application that is bundled with CDAP:
The Oozie server is instrumented and can publish notifications to JMS Providers. Notifications can then be consumed by any JMS consumer. CDAP workflows use the CDAP monitoring system ‘Watchdog’ that publishes metrics and logs (for both applications as well as the CDAP system) via Apache Kafka. The CDAP monitoring system is well-integrated with the CDAP UI.
Both Oozie and CDAP are open-source Apache (V2) licensed software. Oozie is governed by the Apache Software Foundation, while CDAP is governed by the CDAP Community.
The following table summarizes the functional capabilities of Oozie and CDAP Workflows.
|Function||Apache Oozie||CDAP Workflow|
|Packaging/Deployment||Separate service that needs to be installed and managed independently||Integrated within CDAP|
|Backing Metadata Store||Relational DBs||Apache HBase|
|Application Packaging and Deployment||XML, properties files, libraries as a bundle||Named Java classes in the containing application archive|
|UI||Built-in read-only UI||Comprehensive UI including management and monitoring|
|Action Nodes||MapReduce, Pig, Hive, Spark, Sqoop, Fs, Java Actions, SSH, Email, Distcp||MapReduce, Spark, Custom Actions|
|Control Nodes: Condition||
|Control Nodes: Fork and Join||Yes||Yes|
|Workflow Parameterization||JSP EL Functions-based||Preferences/Runtime arguments-based|
|Workflow Recovery (re-run)||Yes||Not yet (on the roadmap)|
|Passing data between nodes||Via <capture-output> in workflow XML||Via Workflow Token|
|Scheduler||Time and Data-availability based||Time and Data-availability based|
|Concurrent instances of the same workflow||Supported via <concurrency> tag||Supported|
|Monitoring||Oozie Server publishes notifications to JMS providers||Integrated with CDAP metrics and logging system|
|License||Apache V2||Apache V2|
In summary, this blog introduces CDAP workflows and compares them functionally with Oozie. Stay tuned for future blogs that will discuss some of the capabilities of CDAP workflows in greater detail. In the meantime, feel free to try out our examples, reach out to us for any questions and consider helping us to develop the platform.