As a data processing platform, Hadoop‘s popularity today is often attributed to its cost-effectiveness, derived equally from the usage of commodity hardware and from the ability to co-locate work on shared compute and storage resources. Sharing resources allows organizations to maximize the throughput and utilization of a small number of large clusters instead of managing a large number of private, smaller, under-utilized clusters. However, sharing these resources is not an easy task, given the diverse ecosystem of technologies that Hadoop consists of.
Organizations want to share a Hadoop cluster for
- Supporting workloads of multiple customers;
- Supporting different units in a single organization; and
- Supporting multiple LOBs (lines of business), on a single data lake.
On the other hand, developers want to share Hadoop clusters for
- Testing multiple versions of an application;
- Testing same applications on different data sets without reconfiguration; and
- Testing applications of multiple developers at the same time
While sharing has numerous advantages, it comes with its own set of challenges, such as resource contention and a lack of sufficient isolation and security. Multitenancy is the preferred way of addressing these challenges. In a shared environment, a typical multitenant system provides the following:
- Resource guarantees
- Quota management
- Quality of Service
- Additional (tenant-level) security
Most modern-day distributed systems implement varying degrees of multitenancy. At the bare minimum, multitenancy involves namespacing. Namespaces can be considered as the foundation of multitenancy. We’ll now look at what individual Hadoop technologies provide in terms of namespacing and what it would take to make them work together coherently.
Namespacing data access
In the Hadoop ecosystem, namespaces have been around since the beginning. In HDFS, namespace is a hierarchy of files and directories. The HDFS NameNode maintains the directory tree, which effectively defines the data namespaces. HDFS 2.x took this a step further with the introduction of Federation, although its main purpose was to solve scaling issues in the namenode.
Hive has the concept of databases. The entities in a database—such as tables, views, and indexes—are independent of entities in another database.
Namespaces have been supported as first-class citizens in HBase since version 0.96. Before that, tables could only be namespaced by naming convention, using a common prefix for table names.
MapReduce—and now YARN—allows administrators to isolate a group of users and jobs from each other using job queues. YARN extends that with a Capacity Scheduler that can be configured for multiple tenants to effectively and securely share a Hadoop cluster.
The typical data applications being built and deployed today involve a multitude of the above-mentioned systems, both as processing engines and data stores. These applications may contain ingestion layers, data processing layers, storage layers, reporting layers, or query layers. A variety of systems fit into each of these layers resulting in complex combinations. Each of these systems have their own implementation (and interpretation!) of namespaces or multitenancy, as we have just described.
However, developers and operations staff do not get a consistent view of namespaces across all these layers. For example, consider that an organization wanted to restrict access to its raw data pipeline by isolating it in its own namespace. To accomplish this, the organization would have to create a custom solution by using the namespace implementations in each of these systems and cobbling them together by adding a metadata and management layer on top. This solution could potentially involve
- Using a dedicated HDFS directory to store its data processing results and any temporary data;
- Using a dedicated Hive Database as a separate view on the data sets when using schema-on-read approach;
- Using a separate HBase namespace for the data sets it is accessing; and
- Using a dedicated YARN queue to isolate computation.
Such a namespacing solution would need to maintain the metadata of mapping between the above entities and make that available during the runtime lifecycle of the application. The organization will also be required to apply security policies, maintain integrity, and control access over data in the namespace.
As we can see, namespacing is a difficult problem to solve and is a heavy load for developers and operations staff to lift—especially if addressed on a case-by-case basis. The alternative to building a custom solution is to use generic data application platform that provides the needed functionality out-of-the-box, such as the open-source Cask Data Application Platform (CDAP).
Namespaces in Hadoop with CDAP
As a first step towards multitenancy, the latest version of CDAP (v2.8.0) provides namespaces as another abstraction for data applications on Hadoop. This abstraction deals with many of the issues around isolating data and applications and greatly simplifies sharing a Hadoop cluster in different use cases. All Applications, Datasets and Streams in CDAP now belong to a namespace. As a way to partition an instance of CDAP, namespacing allows an application, its data, and metadata to be stored independently of another instance of the same application in a different namespace.
In a follow-up post, we will get to the details of how we have implemented namespaces and the techniques that we used. In the meantime, download our latest CDAP 2.8v that includes namespacing and play with it using documentation. Feel free to ask any questions or consider helping us develop the platform.