Gravity Blog Image

The debate over cloud based solutions vs on-premises solutions continues to focus on cost and speed versus control and security.  That debate is all but over for many software applications, as most lines of business from web analytics to CRM are now using SaaS.  It seems only natural for BI to follow the same path, but what will drive the change?   We argue that companies will build their next generation BI systems to incorporate big data sources like web-log and social information and put them in the cloud where that data is generated or captured.  In practice, this means most companies will put aside their worries and learn to love the cloud for big BI.

Why?

First, companies will integrate more data into BI systems as a result of new database technologies and increasing 3rd party data availability.  High throughput databases continue to decline dramatically in cost even while their capabilities continue to increase.  For example, Redshift, the new multi-node column based database from AWS costs approximately $1000/year per terabyte.  This is a fraction of the cost of on-premises solutions on a fully costed basis.  In addition, you only pay for what you need and have no up-front cost for licenses and servers.  As a result, using large amounts of machine, social and web-log data has not only become cost effective within a BI context, it has become accessible for small and medium sized companies that previously had neither the scale nor the skills to implement the previous generation solutions.  In addition, powerful, low cost BI tools like Birst that target the commercial business community (vs. the developer community) can utilize these technologies today with fast implementation time to business value due to analyst friendly interfaces.

Second, the availability of these new database and big BI friendly technologies has made a vast array of use cases that employ these large data sets possible.  Many of these use cases already exist, but will find improved effectiveness and power from incorporating large data sets.  These changes will likely drive massive disruption – especially in AdTech, consumer marketing, gaming and social.  For example, why not simply download all the web-log information off the company website and link it directly to the CRM data?  This potentially cuts out analytic and reporting elements of various tech vendors and creates a direct path from analysis in the BI system, to action (if combined with cutting edge tools like Tealium’s AudienceStream).  Alternately, why buy an application to analyze your web data when you can simply load the data in your BI system using Redshift and have a highly flexible, low cost, infinitely scalable environment that supports analysis and modeling?

Third, these use cases put demands on data transfer that were unimaginable just a few years ago.  Machine data and social data, both of which have interesting uses when combined with more traditional BI information, can accumulate to very large and unmanageable sizes rapidly.  For example, at the TDWI conference in Scottsdale last November, an electronic billboard company that presented was generating tens of terabytes of information per month.  This scale of data generation is not at all uncommon and is happening across many industries.

Not surprisingly, most of the big data sources now get pushed into the cloud when collected.  Machine data often lives in distributed locations – the cloud’s omnipresence makes this a convenient caching mechanism.  Tag management vendors, which track web log data in standardized formats and then send it along to other vendors or to the company itself as required, often store their data in the cloud for distribution.  And, most tech vendors operate cloud based, SaaS models.  SaaS continues to grow in popularity across all software verticals.

Finally, these use cases often depend on near real-time or real-time execution, which has made latency a major issue.  Lots of web-generated data can provide significant value if acted upon quickly, but its value declines rapidly with the passage of time.  Take the case of travel and entertainment retargeting facilitated by companies like Sojern and Adaramedia.  They pick up travel intent information as prospects browse airline and other travel sites.  Immediate retargeting based on this data can drive good lift, but after a day many of these prospects will have completed booking their trip and the window of opportunity will have closed.

In this ‘big data’ world, setting the data pipeline requires carefully thinking through latency requirements and data size.  I saw a terrific presentation by Jason Grendus of Viki, Inc. a few months back.  Viki distributes video content in emerging markets around the world and generates large amounts of web-log data.  When Jason arrived at Viki, their data resided in two separate Hadoop clusters in different parts of the world.  Viki ended up restructuring their data pipeline to reduce the amount of data transfer required and dramatically reduce latency.

Clearly, data now has its own gravity.  Moving it around costs money and takes time.  If you want to take action quickly and work with your customers and prospects as they act, you have to build around where the data lives. For most Big BI use cases that will mean the cloud.  As they say, if the mountain will not come to you, you must go to the mountain.  This means thinking through a few key elements before you make your next BI selection –

  • Understand the location and size of your current and potential data sources
  • Have a strategy for managing your data pipeline before you start, including
    • Defining the acceptable level of latency
    • Setting a target cost for data transfer
    • Establishing the tools required to process the various types of data you plan to use

So what are the early adopters doing with all these powerful new tools?  Stay tuned for future posts on some ideas to get around this challenge.