Code Walk-Through – A Simple Python Plugin
The diagram above shows how the GUI of CDAP represents a pipeline which uses the Python Evaluator plugin. In this pipeline, we read some HDFS files produced by a prior Spark job, process them with Python, and then write them to HDFS. Other than the Python code that I typed in, this was created entirely by Drag ‘n Drop.
This is the code I typed into the GUI panel for the Python Evaluator Plugin:
(1) CDAP has defined a standard contract for all your Python plugins. You receive, in sequence, each record being processed by the mapper. That’s because your Python code is actually co-located with the mapper and wired inside it as it processes each key,value pair. The record is presented to you as a Python dictionary, and will have a structure which corresponds to your configurable data schema. By default, a mapper has offset and body – corresponding to LongWritable and Text of TextInputFormat of a Hadoop job. The contract will be the same if you use this code in a Spark job. For Spark, we use your Python code as a closure to operate on a Spark RDD. For either Spark or Hadoop, the structure of your data is configurable using a simple GUI configuration pane.
(2) You also receive an emitter object, to which you can assign a different structure to emit for your returned records. In this sample code, we use the same schema for incoming and outgoing records. The context object you receive allows you to access existing Framework data structures.
(3) We ship Jython to the YARN cluster, and it will be able to use the Standard Library which ships with Jython. Right now, we don’t have support for dynamically shipping other python libraries, but we plan to add it in the future..
(4) The CDAP Development Framework redirects standard-out to log files on an edge node. They can be viewed from the GUI, and many people examine the cdap.log file directly when using the SDK – since it is easy to use an editor to find your logs while you are developing.
(5) This shows the offset (the default Hadoop LongWritable key field) being accessed in a Python dictionary. The three X’s in the debug output are there just to make them easy to find with an editor search. (The logs can grow quite large – especially if you are on a production cluster.)
(6) For a MapReduce job the emitter returns control back to the mapper. For a Spark job the emitter is the return from the end of the closure CDAP creates with your Python code. In either case, you use the emitter method to return your processed data.
Deep Dive into the Pipeline – Apache Hadoop Execution
The pipeline executing the code launches several MapReduce jobs on your YARN cluster – assuming that you selected the MapReduce engine rather than the Spark engine to execute your Python pipeline. CDAP will schedule a MapReduce job for each plugin you’ve dragged onto the Studio GUI. Each set of MapReduce jobs associated with a plugin is individually scalable from the GUI, making your application adaptable as data volume changes over time.
Deep Dive into the Pipeline – Apache Spark Execution
When you deploy your pipeline you are given a choice – use MapReduce or Spark to run the pipeline. The same code you write can be used by either one. If you notice, you wrote the Python code against a specific contract – and CDAP keeps that contract whether you are running with MapReduce, Spark RDDs, or Spark DStreams. CDAP is the first and only framework to provide this kind of flexibility.
Your Jython code has effectively become a Spark microservice – an independent service which can be scaled, rebuilt, redeployed, and re-architected as needed. For developers, that means that they can compile a new service and hand it off to Operations – and they can drag and drop it into the GUI. Here at Cask, we like to think that we are creating the microservice architecture for the future of Big Data – seamlessly integrating Apache Spark, Apache Hadoop, Apache Hive, Apache HBase,YARN and other Big Data technologies, so that developers can focus on solving the business problem – while CDAP seamlessly handles the integration and deployment issues for them.
I encourage you to download the latest CDAP and give it a spin. We actively welcome questions, comments and suggestions. Our user group is a great place to engage with the Cask team and the entire CDAP community.