The Need for Abstraction in Hadoop

Nitin Motgi is Founder and CTO of Cask, where he is responsible for developing the company's long-term technology, driving company engineering initiatives and collaboration. 

Prior to Cask, Nitin was at Yahoo! working on a large-scale content optimization system externally known as C.O.R.E.

Hadoop is a collection of 47+ components. Recently, Andreas Neumann (blog) and Merv Adrian (blog) in their respective blogs discussed what makes a Hadoop technology Hadoop. They both did a great job of asking the right questions and presenting the facts about what makes up Hadoop today. While Andreas focused on picking the right technologies for the use case, and Merv took an approach based on the support provided by the distros, in the end neither could clearly define what delineates a Hadoop technology.

To me, Hadoop at its core is still just HDFS (Storage Layer) and YARN (Compute Layer). But, with the distributions calling what they sell as Hadoop, a lot of things have been tagged along to that name. At the end it’s simply a collection of loosely federated technologies that might work together if you know how to make them work. Each component within so called Hadoop is a technology that was built and designed by group of geniuses in a Open Source Community. It’s a revolution that is currently driving innovation in this industry, but it’s also making it extremely hard for the end users to completely understand the landscape and figure out how the technology can be used to derive business value. Most  of these components are designed to mainly solve one problem and this is where the dilemma begins.

While each of these components get packaged into a distribution or a cloud service, the organization is left to figure out how best to integrate the components in a way that solves the use case at hand in the best way possible. As new uses for Hadoop are found or attempted, new integrations are needed and new code is written. When projects are small in scope and have the right team, this is not much of an issue. As projects grow in an organization, multiple different teams set up different environments. They move projects into production and need to operationalize systems and maintain them. The complexity of data, security, and management quickly becomes overwhelming to the organization.

In my experience when enterprises adopt new technology and technology as complex and loosely coupled as Hadoop, they attempt to deal with this complexity problem by insulating the programmers from the underlying technology by creating a lightweight abstraction layer, sometimes called a Shim Layer. The enterprise creates a team whose primarily responsibility is creating abstractions for other developers. Their goal is to provide a non-leaky abstraction for the application developers so that those developers can create great code and to put what they develop into the hands of customers who benefit from it.

The creation of an abstraction layer is common among complex technologies and the advantages and drawbacks are generally well known. Advantages for having an abstraction layer include:

  • Decreases the complexity and the coupling between technologies. This leads to more efficiency during the maintenance of the code.
  • Ease the transition to new technologies. For example, moving from HBase to Cassandra, HDFS to WASB, etc.
  • Force developers to create reusable components instead of monolithic applications. Creating reusable code increases organizational efficiency with every new project. The explicit separation between layers through the abstraction layer will prevent the developer from taking shortcuts and mixing the code.
  • Allow for independent testing. Because of separation of the logical layers it provides a high degree of decoupling between each layer and can be tested independently or reused by other applications.
  • Opens the door for dynamic loading of technologies. This enables your application to integrate with new technologies as they become available with no or few changes to the application.
  • Maintain integrity and metadata across multiple systems.

Apart from the advantages that the layer brings to the table, the layer also provides extended benefits like:

  • Avoid mixing and matching code from different technologies or bringing code from one layer into another layer.
  • Start filling in the gaps for the underlying technology.
  • Include domain specific or business adept APIs.
  • Provide a rendezvous point for new technology integrated along with your existing technology.

Despite such great benefits, the abstraction layer also presents net new challenges if not built correctly:

  • Introduces new security holes.
  • Make it less reliable as it’s another system to be monitored and tracked.
  • Degrade performance due to inefficiencies in building the abstraction layer.
  • Continually needing to test against different integrated components and versions of components.

Lastly, this layer will add a small performance penalty to the applications being built using them, but in end the benefits of such a layer will nearly always outweigh the shortcomings in the long run.

Building this abstraction layer can work well in large, technologically advanced enterprises. But as the need for this layer starts to grow, the team assigned to it grows large as well. As it becomes core to business, it goes from being just a lightweight abstraction layer to middleware. For companies where technology is not core to their business, and in many cases even for ones where it is, developing and managing this layer distracts from delivering your products to the market. I have personally experienced this at companies like Yahoo! and FedEx.

Companies that are aware of each of the pros and cons of building middleware know that it takes a great deal of care and effort to maintain this layer for distributed systems. But even at the best of companies, the team supporting the middleware eventually grows larger than the teams focused on building the apps and solutions that drive core business value. Smart organizations solve this issue through the use of an application development platform. This platform provides all of the abstraction needed to focus resources on developing business apps while keeping costs to a minimum.

CDAP is the middleware for Hadoop like the one you would have built if you wanted to invest 4 years and 50 engineers to do it. At Cask we developed the middleware that provides seamless integration with big data infrastructure while exposing simpler and performant APIs on the top for developers. This layer has been solidified over the years from the collective knowledge having solved many different big data use cases and patterns for clients in many different verticals.

You can use CDAP to build:

  • ETL Ingestion
  • Network analytics
  • Fraud Analytics
  • Customer 360
  • Analytics As a service
  • Data As a Service
  • Data Science As a Service
  • and any other typical use case that big data can solve.

This middleware includes many integrations with underlying infrastructure enabling businesses to develop high order functions at the top of the stack. These high order functions allow developers to be more focused on business logic rather than infrastructure integrations. CDAP allows you to focus on your core business solutions and leaves the integration and complexity to be dealt with by experts in the Hadoop industry.

Mike Olson in his blog alludes to the fact that the elephant needs to be hidden. It needs to be be hidden for Hadoop to become a mature technology that is widely adopted by businesses. You cannot hide if you see everything that’s within it. The Big Data industry needs a middleware to hide the elephant and CDAP is that layer.

<< Return to Cask Blog