I interned at Cask during the summer of 2015 (yes the summer of ‘15 – you see what I did there?) and it was in this summer where I learnt a whole lot – in terms of both life and technical skills. I had a great time working in the heart of Silicon Valley, where work did not seem like work; rather it was fun.
My project at Cask was to create a way of providing statistics for Workflows, which is an integral part of CDAP, so that a user can gain insights into how the runs of a Workflow have performed, how particular programs in a Workflow (MapReduce, Spark, etc.) have performed, and see their detailed statistics over time. I was given complete ownership of this feature and was tasked to come up with a design detailing how it would provide a better user experience and the implementation details to see it to fruition.
We wanted users to be able to compute statistics on a Workflow across a time range. In order to provide for this, we came to the following conclusions for the design:
- The user should have control over the context of the statistics going to be viewed. So, we decided to provide start and end times based on user inputs. This would help them narrow down outlier runs in the specified time interval. An example would be retrieving the runs of a workflow that were greater than the 99th percentile. The user, then, has the ability to dig deeper as to why that particular run fared poorly.
- The user should have the ability to specify the percentiles for the information they would like to view. So, we decided to return lists of run-IDs where the time taken by the runs were above the specified percentile values.
- The user should also have the ability to compare two runs of a Workflow side-by-side. We decided this is useful for times when the user would like to see how an outlier run compares to a normal run. It provides program level details on metrics and the user will be able to figure out why certain programs took longer, based on their input and output.
- Another feature was the ability to compare one run against previous/ next workflow runs. This is useful to determine whether a run was an outlier in a time range, or all the runs around said time exhibit weird behavior.
Additional details on the design and implementation of this tool has been documented here.
I decided to implement this idea by simply writing to a new dataset that would just store data of the completed workflows. When a run of a workflow is completed, we add the runtime statistics for the workflow and the programs that ran as a part of the workflow. Each record in the dataset had the run-id of the workflow, the time taken for that particular run, and all the programs present in that run of the workflow. Each Program Run holds, the name of the program, it’s type, it’s run-id and the time taken for it to complete. For the workflow dataset, the row keys are of the format –
When a user requests statistics for workflow for a time range:
- First, rowkeys are constructed based on the requested workflow and time range.
- We then scan the WorkflowStatsTable to obtain the workflow runs in that time range.
- Finally with the obtained run information, we calculate percentiles requested and return the results.
- Since the runs of the workflow are grouped together, the locality helps us in calculating the percentiles without affecting the performance.
If we want to compare an outlier with an average workflow run, we call the compare endpoint. This retrieves the program run-level information for these two runs from the workflow dataset and retrieves more detailed information for these runs from the CDAP Metrics System and Hadoop Job history server. The obtained information is aggregated and returned to the user for easy comparison to figure out probable issues.
The most exciting phase of this project was coming up with the design. Anything I came up with had to be thought of in a context of scalability. I got a lot of help from my mentor, Shankar Selvam, and my design went through a rigorous review process with the senior engineers at Cask. But given the time I took from start to end – I was astonished to see the result. What looked like a crude and amateur version of a design document transformed into a concrete plan with the outlined design, steps for implementation and user facing REST APIs. Getting complete ownership over a feature of CDAP was a unique experience and I am grateful for this opportunity. Code reviews at Cask are very streamlined and provide for an amazing learning process for interns like me. I got a lot of constructive feedback, which I will take with me as I move on. It has helped me become a better programmer.
So, if you are indeed looking to learn, work with awesome, talented, and fun-to-be-around folks and on a product that is going to make a difference to how people interact with Big Data, it would be in your best interests to steer your career towards working at Cask. Because before you know it you’ll start loving work more than your significant other 😉