Data Is the Biggest Challenge for Predictive Analytics

Traditionally, analytic applications have been largely focused on descriptive analytics (what has happened). More recently we've seen intense interest and big strides in predictive analytics (what will happen) and prescriptive analytics (what should we do about it). This has been spurred on by the tsunami of data available from IoT, social media, digital enterprises, search data, digitized knowledge, and digital media, combined with the tremendous compute power unleashed by grid/cloud computing and advances in machine learning and cognitive computing. The use cases are limited only by the imagination, in areas as diverse as predictive maintenance for equipment, outcomes predictions in medicine, behavior-based customer profiling, fraud detection and prevention, and much, much more. Estimates for the predictive analytics solutions market are generally in the $3bn to $4bn range, with high growth rates.

Within this context, machine learning is growing in importance. Machine learning reduces the need to have highly trained statisticians and data scientists using the analytic tools. Instead, the hard work is done behind the scenes. The system learns and figures out the best algorithms and data to use, rather than a person having to sort through and figure out which data is important. Thus, machine learning thrives on data … lots of data. Generally speaking, the more data points you can feed it, the better chance that some of it will be relevant. And the more fine-grained, complete and consistent the data is, the better job machine learning can do. Of course, feed in bad data and you get bad results.

In our research, practitioners said the number one challenge and by far biggest amount of work in making prescriptive analytics work was data hygiene and data preparation. Data from diverse sources often is incomplete, incorrect, inconsistent, and/or out-of-date (not received in time to be useful). Data quality problems are particularly intractable when the person or organization creating the data doesn’t experience (or understand) the consequences of the bad or incomplete or late data they are creating. Finding, acquiring, cleaning, normalizing, filling in missing data, and related activities can consume up to 80 percent of analytic project resources. The time and resources spent cleaning the data are often at the expense of time and resources for analyzing and getting value from the data.

A multibillion-dollar industry has grown around data hygiene (which is not a new problem). Companies will typically focus first on cleaning up their own internal data, since that is what they control, and most have enormous work to do there to start. But companies are increasingly relying on external data for predictive analytics, including data from trading partners, third-party providers (e.g., weather data, financial data, industry-specific data), social media, and public sources, especially for problems that are inherently multi-party and external facing (like logistics). So, besides their internal cleanup efforts, companies are working to get their trading partners’ data in line, as well as evaluating data quality from third-party sources before and after committing to use them for critical predictive analytics programs.

The Outlook

The explosion of new data has created big challenges in managing it all, ensuring cleanliness, completeness, accuracy, and timeliness. Organizations that put a lot of resources into all that “grunt work” to clean up the mess will reap the biggest rewards with successful predictive analytic programs. Third-parties that can help with these challenges will be valued — both data providers who gain a reputation for high-quality data, and tools and service providers that can help with the enormous cleanup task. Automation of data hygiene tasks will become increasingly important as the amount of data continues to increase exponentially.