Weblog Analytics on Apache Hadoop™

Nitin Motgi is Founder and CTO of Cask, where he is responsible for developing the company's long-term technology, driving company engineering initiatives and collaboration. 

Prior to Cask, Nitin was at Yahoo! working on a large-scale content optimization system externally known as C.O.R.E.

Hadoop provides specialized tools and technologies that can be used for transporting and processing huge amounts of weblog data. In this blog, we’ll explore the end-to-end process of aggregating logs, processing them and generating analytics on Hadoop to gain insights about how users interact with your website.

With the digitization of the world, generating knowledge from the raw data aggregated by web servers has become increasingly important, particularly in a commercial context. Capabilities from optimizing the user experience to understanding users behavior to providing personalized experiences are prevalent.

The first step in gaining insights into users behavior on a site is instrumenting the web pages. They can be instrumented using  Page Tagging (snippet of Javascript code that web applications add to every page) to capture user interactions on the site – and the deeper the instrumentation, the greater the information captured, thus producing larger amounts of data. This enormous amount of weblog data presents challenges, such as transport across multiple datacenters and processing at scale to extract insights. The web analytics process is complex, involving analyzing weblogs for details such as URLs accessed, cookies, demographics, locations and date/time. This information is used to analyze website visitors, their usage, as well as browsing patterns and behavior. 

To understand the complete picture of what is happening on the web site, the system should support gathering and processing information in real-time. With a second-by-second view of visitor engagement information, one can react instantly to visitor trends. However, not every insight can be gathered in real-time – like computing bounce count ratio, unique users, etc. so, the system also needs to support generating insights in batch. Furthermore, as log data is collected and made available for analysis, it is also necessary to support ad-hoc queries via Apache Hive™, Impala and the like, for exploratory analysis to answer questions that come up that one might not know before hand.

Following is a high-level view of one way to architect a system for generating real-time, batch insights and also supporting ad-hoc analysis.




  1. The website or web application is instrumented to capture different user interactions on the page. The instrumentation is then logged to web server logs.
  2. Transport systems like Apache Flume™ or Apache Kafka™ are configured to extract logs from log files in real-time and transport them to a centralized Hadoop cluster.
  3. To support Batch processing to generate insights, the data is batched and written to HDFS
  4. For real-time processing, the data is directly fed into a processing system like Apache Storm, Spark, and the like.
  5. Workflows using MapReduce/Apache Pig™/Apache Spark™ are created to cleanse log data and generate insights periodically. The output data produced is then written back to HDFS. These scheduled scripts actually analyze the logs on various dimensions and extract the results. The results are by default stored onto HDFS, but we can also use storage implementation for other repositories also such as Apache HBase™, MongoDB, etc.
  6. Real-time aggregates are stored in HBase. The insights in HBase are also available for creating real-time dashboards.
  7. Hive is set up to expose the raw web log and output of data analysis to be accessed using SQL. Schemas for web log and insights have to be modeled and maintained.
  8. Reporting tools can then access the results or do exploratory analysis on web log data using widely available tools.

The above referenced architecture requires integration of multiple different frameworks and technologies. It also requires specialists in different technologies to setup and operate on an on-going basis. To mitigate the complexity and reduce the cost of doing so, you want to use tools that provide an integrated platform experience, such as the open-source Cask Data Application Platform (CDAP).


Weblog Analytics on Hadoop with CDAP 

CDAP has everything to support the described architecture: ways to ingest data, processing data in real-time or batch, running SQL ad-hoc queries, and serving the results natively on Hadoop.

You can use the following tutorial to try it out in just ten minutes.

You can start by setting up Standalone CDAP on your laptop or Distributed CDAP on a Hadoop cluster. After install use the CLI provided with CDAP to configure and manage ingestion and CDAP application.

  • Create a Stream to collect weblogs on HDFS in real-time or batch.
> create stream logEventStream
  • The created Stream exposes an HTTP endpoint that is ready to accept weblog data. For real-time ingestion Flume can be easily configured push data to Stream directly. For information on how to configure Flume click here.
> load stream logEventStream <path-to-apache-log>
  • Every Stream is registered with a default schema in HIVE, so you can immediately investigate the data ingested by running SQL.
> execute ‘select * from stream_logEventStream limit 2’
  • Once you know the type of data in the Stream, which in this case is Apache Log format, you can set the schema for the Stream body. In this example we will set the format to Combined Log Format (CLF). You can also set TTL (Time to Live) for the data in the Stream.
> set stream format logEventStream clf
> set stream ttl logEventStream 86400
  • After you have set the schema for the Stream body as CLF, you can see Schema-on-Read in action. So, run the SQL query once more to see the results parsed based on the schema specified.
> execute ‘select * from stream_logEventStream limit 2’
  • Now, that weblog data is flowing into Streams and you can do ad-hoc analysis, we will move to deploying a CDAP Application that performs some basic web analytics in both real-time and batch. The source and documentation on how the application was built is available here. Deploy the application (JAR) into CDAP.
> deploy app <path-to-wise-app-jar>
> describe app Wise
  • Start real-time and batch processing components of the application along with the service component that allows one to expose insights being aggregated in real-time through REST APIs.
> start flow Wise.WiseFlow
> start workflow Wise.WiseWorkflow
> start service Wise.WiseService
  • The application deployed will create instance of dataset BounceCountStore and PageViewStore. The instances can be listed and queried using SQL.
> list dataset instances
> execute ‘select * from dataset_pageviewstore’
> execute ‘select * from dataset_bouncecountstore’


That’s it! You now have an end-to-end system for processing weblogs, supported by the ability to also perform ad-hoc analysis on the data anytime. Have fun extending the application to incorporate more analytics!

Analyzing weblogs is very important to many organizations, but it is a challenging task that requires special tools due to large volumes of data and the complexity of analysis. To benefit most from the solution, you need to be able to process data in real-time, batch and execute ad-hoc queries and more. The typical weblogs analysis solution on Hadoop may be quite complex and require making different technologies to work together, while an integrated platform such as CDAP can help with most difficult parts.


<< Return to Cask Blog