When not to use machine learning & artificial intelligence

Too often I hear a CIO say things like “We’re considering moving to data lakes” and “We need a big data strategy for pricing.”

The harsh reality is that most of these ill-informed executives will spend millions on big data and artificial intelligence initiatives before realizing they don’t have enough data, nor “clean” enough data to achieve anything with AI/ML.

By the time they realize this, they will have spent $x00k+ on new data scientist hires, locked into a multi-year, six-figure contract with Azure &/or Snowflake, and wasted countless hours.

Before embarking upon such endeavors, organizations should first consider whether they have adequate data to successfully implement artificial intelligence and machine learning:

Case study: the negative impacts of junk data

I was consulting for an AI project at an internal startup at a large, public company. Without giving out anything confidential, at a high-level, the project scope was to develop a ML-based model for a price estimation tool in an industry driven primarily by custom "quotes" and the model was based on a dataset consisting of several million historical prices.

However, after much struggle, it ultimately became evident that upstream issues in how prices were recorded had rendered much of the data inaccurate & unusable; we ultimately had to identify alternative (albeit, smaller) data sources – both internal and external, to develop a sufficiently accurate model to start beta testing in-the-field.

Even larger companies struggle with "big data": an uncomfortable truth (albeit, inevitable in the early days of any technology) is the incredibly high rate of failure for data & AI/ML projects, particularly at the enterprise-level.

Nearly 8 out of 10 organizations say their big data projects in the last 12 – 24 months have stalled.

Why the majority of big data projects fail

Why is this happening? A recent survey found that 96% of companies are failing due to issues with their input data. The single point of failure for AI/ML projects is data quality. The high-level challenge is always to extract & transform high-quality training data for a given model, after querying various sources that often include junk data.

Data issues are often unknown to even the owners; the first step to success for any AI project is identifying & solving for “junk” data, whether that’s for business users (e.g. how pricing discounts are recorded in an ERP) or data scientists (e.g. how samples are chosen from a time series).

"Clean" data can be simply described by the principles of Big Data: volume, velocity, variety, and veracity

tl;dr: Artificial intelligence is best utilized by organizations that have truly big data (e.g. 2.5M historical price points = great candidate for price prediction via AI/ML; 50k orders of cost history = not so much ). At these organizations, the biggest challenge (and critical point of failure) for AI Projects is "clean" data.

There are several approaches to avoid and remove junk data:

Add validation to the data entry UI / software that prevents junk data

Integrate the data entered into daily operations to force users to only enter "real" numbers (don't let your software live in a silo)

Remove redundancies in data collection and entry

Source data directly from business operations (ie. use ERP not Excel for real-world data)

The key to successful AI/ML projects begins with eliminating junk data. Organizations must first analyze the quality and veracity of their data before exploring “big data” undertakings.