Weblog Analytics on Apache Hadoop™

Hadoop provides specialized tools and technologies that can be used for transporting and processing huge amounts of weblog data. In this blog, we'll explore the end-to-end process of aggregating logs, processing them and generating analytics on Hadoop to gain insights about how users interact with your website. With the digitization of the world, generating knowledge

Multitenancy for Hadoop: Namespaces


As a data processing platform, Hadoop's popularity today is often attributed to its cost-effectiveness, derived equally from the usage of commodity hardware and from the ability to co-locate work on shared compute and storage resources. Sharing resources allows organizations to maximize the throughput and utilization of a small number of large clusters instead of managing a large

Data-driven job scheduling in Hadoop

Julien Guery

Triggering the processing of data in Hadoop—as soon as enough new data is available—helps optimize many incremental data processing use-cases, but is not trivial to implement. The ability to schedule a job (such as MapReduce or Spark) to run as soon as there's a certain amount of unprocessed data available—for instance, in a set of

CDAP v2.8.0 is out in the wild

I am very happy to announce that the latest release of our flagship product – the Cask Data Application Platform (CDAP) – v2.8.0 is now available for everyone to download. This release has a bunch of cool features that our customers, partners and the community want: Namespaces (provides application and data isolation that enables multi-tenancy)

How we built it: designing a globally consistent transaction engine


At Cask, we are committed to contributing back to the open source community. One of our latest open-sourced projects is Tephra, a system that adds complete transaction support to Apache HBase™. As an XA-style transaction system, Tephra is designed to be agnostic to the underlying data stores, so its usage is not limited to HBase.

The Shift to Realtime Processing: An Easier Way to Build Hadoop Apps

I was inspired by the recent Google I/O talk on Cloud Dataflow, a data processing service used internally at Google, which evolved from a model based on MapReduce and successor stream processing technologies such as MillWheel and FlumeJava. Based on the premise of focusing on your application logic rather than the underlying infrastructure, I set

How we built it: Making Hadoop data exploration easier with Ad-hoc SQL Queries

Please note: Continuuity is now known as Cask, and Continuuity Reactor is now known as the Cask Data Application Platform (CDAP). We are excited to introduce a new feature added in the latest 2.3 release of Continuuity Reactor – ad-hoc querying of Datasets. Datasets are high-level abstractions over common data patterns. Reactor Datasets provide a

Meet Tephra, An Open Source Transaction Engine

Please note: Continuuity is now known as Cask, and Continuuity Reactor is now known as the Cask Data Application Platform (CDAP). Our platform, Continuuity Reactor, uses several open source technologies in the Apache HadoopTM ecosystem to enable any developer to build data applications. One of the major components of our platform is Apache HBase, a