The amount of data in our daily lives has been exploding – and analyzing large data sets, fondly termed as Big Data, will be the pivotal point of differentiation between competitors. As the size of data is expanding, so is the demand to process and analyze this data in real-time or near real-time. In areas like Online Gaming, Digital Marketing and Mobile Advertising, companies want to provide targeted marketing to customers in real-time or near real-time. There are many examples of use cases for leveraging Big Data insights to drive customer behavior and engagement. For example, if you have a social media feed, you might need to analyze it in real-time to trigger some action based on immediate sentiment and behavioral analysis; on the other end of the spectrum, you might want to track all of the user’s activities over a period of time to build predictive models which help you understand his lifetime value.

But real-time Big Data processing and analysis tools have limited penetration today. Why?  Real-time processing is often difficult to achieve, because several processes need to be combined to work quickly and efficiently to push out the data in an actionable way:

  1. The application which collects data from the different sources needs to be continually available. A good example of such a tool is Apache Storm, which provides real-time processing capabilities for disparate sources. Companies can then use Apache Kafka on top of Storm to provide a unified, high throughput, low latency platform for handling real-time data feeds.
  2. Some on-the-fly data cleanup is required. For example, you can use Splunk to provide the indexing and analytics layer for the data that is streaming.
  3. The data needs to get pushed out quickly (without network latency) into a dashboard tool or a database for BI reporting and analytics.  Today, this is typically achieved with a batch processing BI tool which needs to be connected to other real-time processing tools, often resulting in some network latency.

The above-mentioned services along with a few others like Apache Flume & SQLStream attempt to process and provide the analytics layer for the data in real-time. The major drawback of these services is that they are stand-alone tools which require users to manage operations overhead and to set up connectors to transfer data around their BI architecture. As a result, this creates 3 major problems for companies:

  1. Too much time and effort is required to set up and maintain the environment. For example, Kafka and Storm are two platforms that work with real-time data. However, as they are not fully managed services, they require almost twice the amount of time to maintain vs. a managed service, where the overhead of managing the infrastructure is practically zero.
  2. Multiple tools require connectors, which create network-related latencies, slowing down the source-to-analysis time.
  3. These tools require more than twice the amount of memory vs. the size of the actual data that’s being produced and collected. This is true because performing the necessary ETL manipulations on the raw data often means creating multiple staging tables, each of which requires additional memory.

As we’ll see, Amazon Web Services’ Kinesis solves many of these issues. Officially announced at the November 2013 Re:Invent conference, Amazon Kinesis is a fully managed service for real-time processing of streaming Big Data on a massive scale. Kinesis helps you collect data from hundreds of thousands of different sources into one location where you can filter, group, aggregate, and perform other simple input manipulations on the data as it is transferred from the source to your end location.

What are the benefits of using Amazon Kinesis vs. the other competitors in the market?

  1. It’s a fully managed service, so while your data is in the Kinesis stream, you don’t have to worry about maintenance, storage, load balancing the streaming data, and networking and processing the data at the required throughput. This enables you to:
    1. Be up and running in a couple of minutes with no operational overhead
    2. Scale your processing power up or down within seconds, while maintaining the data stream flowing
    3. Automatically create copies of the streaming data across all availability zones in a region for 24 hours to provide for availability and durability, as well as for backup
  2. You can incorporate this data with other AWS services without using connectors to reduce latency. For example, you can easily integrate Kinesis with other AWS services like S3, Glacier and Redshift for long-term storage, or with EC2 instances and DynamoDB for further data processing.
  3. You have practically unlimited data storage capabilities by leveraging services like Redshift & DynamoDB, even with hundreds of large staging tables.

Despite all of the above benefits which help solve some of the pressing issues in the current real-time processing environment, Kinesis comes with a few limitations as well:

  1. Any data that is older than 24 hours is automatically deleted; therefore, you need to ensure that your throughput is provisioned correctly so you can process the data within that timeframe. If you want to work with the data over a longer period of time, you can send it to S3/Glacier for storage or to EC2 or DynamoDB for further processing.
  2. Another drawback is that every Kinesis application consists of just one procedure, so you can’t use Kinesis to perform complex stream processing unless you connect multiple applications. If heavy-duty processing is required, you will need to save the data to other AWS services before you can perform such processes.
  3. Kinesis can only support a maximum size of 50KB for each data item streaming in as part of a single PutRecord operation.
  4. Finally, the Amazon Kinesis Client Library is only available in Java today to help with the heavy lifting associated with distributed stream processing, thus limiting the number of tools users can leverage to perform functions on Kinesis.

As discussed, there is a broad spectrum of use cases for Big Data applications which require functionality spanning from real-time applications to storage and analysis of data over time. As a standalone solution, AWS Kinesis will be most impactful in the former case – for companies focusing on taking action based on real-time data insights. Nonetheless, one of the biggest advantages of using the AWS platform is that it offers a specialized service for each use case in the above spectrum. As a result, even companies that are more focused on extensive data analysis can leverage Kinesis with other AWS services to create a complete, centralized data transfer and processing architecture. This will be more attractive for companies that are already leveraging other AWS services (or intending on doing so in the future); otherwise implementing Kinesis as a standalone AWS service requires moving large data quantities between various locations with additional data processing capabilities  (see our previous blog for a detailed discussion of the challenges this creates).

AWS Kinesis is thus not a one-size-fits-all Big Data solution. Depending on your use case, you will see different returns from using this tool. Nonetheless, it offers many advantages over other existing tools which make it a viable and affordable solution for companies seeking to start making sense of the explosion of Big Data generated by modern life.