Commentary

Trained Data: The New Oil

by Sarah Fay , July 19, 2018

Data is often referred to as the “oil” for AI. Used correctly, it can help deliver the most relevant consumer experience in a nanosecond. And seconds saved can often result in significant revenues gained.

The algorithms that deliver these experiences require not just data, but trained data, and refining data involves time and investment. As an investor in AI solutions, the first questions I typically ask are “Where do you get your data?” and “How do you train your data?”

It may come as a surprise to some marketers that when they decide to use AI-based technology, they must first be sure their data is trained to deliver those experiences. This can involve some heavy lifting, which often includes manual tagging. It also requires access to large data pools to continually refine the accuracy of predicted outcomes.

It’s no coincidence that the companies with some of the most mature and successful AI implementations are those with huge data sets (Google, Facebook, Amazon, Netflix and the like), which gives them a substantial advantage when it comes to training data, as these companies are all able to access and leverage massive activity from their users to improve the accuracy of their algorithms.

This kind of data training is referred to as having a “human in the loop.” Think about Facebook’s facial recognition technology that identifies friends’ faces for tagging. The algorithm has been able to accurately learn to recognize faces through the billions of photos tagged manually by users. The algorithms of other large players are trained similarly through user behaviors, which refine the accuracy of predicting all kinds of outcomes.

We are entering an era of data democratization. Those with less access to big data pools have some options: They can buy access to trained data, or walk the longer road of training their own data sets.

A typical approach is to employ teams (sometimes hundreds of people) to manually check and tag huge volumes of images.

There are now services sprouting up to provide assistance with data training. One such company is Austin-based Alegion, which provides “supervisory” training sets: data sets pre-trained to specific categories, so an algorithm can have a head start in recognizing the most commonly used data before moving on to a client’s proprietary data, which will likely need to be trained.

Alegion also provides “supervised data training,” drawing on virtual teams of data category specialists, mostly living in countries with more affordable rates, such as India, or Ukraine -- sort of like a Mechanical Turk for tagging data.

For some categories like fashion or furniture, where much imagery is similar across providers, there are pre-trained algorithms e-commerce companies can use for visual targeting. For example, there’s GrokStyle, which serves the furniture retail space. When a shopper clicks on a white couch, for instance, GrokStyle tees up the retailer’s inventory of other white couches to check out, which increases the chance of a purchase.

While some customized data training may be required of a retailer to make sure images are paired correctly, most standard images fit with the trained data provided, and a furniture retailer can be deploying the technology and reaping the rewards in short order.

Sean Bell, co-founder and CEO of GrokStyle, notes the success of this business concept: “The more broad the training set the better, and the furniture category is broad enough to provide millions of images across the web. We’ve been training our model since 2015.”

If you can’t find big data sets to mirror your needs, you might just have to create them, and companies like Neuromation (“where androids dream of electric sheep”) are doing just that by creating “synthetic learning environments”: computer-generated simulations used to train algorithms.

As an example, if you want to train an algorithm to recognize a video of a girl doing a cartwheel, you likely wouldn’t have access to the volume of imagery needed for the task. But Neuromation can generate thousands of simulations of girls cartwheeling, using creative techniques similar to those used in online games.

This is a big step that puts smaller players on the same level as video data king Google, which has access to an unending supply of video footage.

These are just a few examples of how marketers can better prepare data for AI algorithms -- and access the new oil of trained data.

artificial intelligence

3 comments about "Trained Data: The New Oil".

Check to receive email when comments are posted.

Mark Milstein from VRmeta, August 31, 2018 at 3:49 a.m.
Thank you for this really informative and sobering look at all of the hard work and effort it takes to construct truly accurate datasets to train AI. I couldn't have said it better myself, and it is this narrative which lies at the heart of why I created VRmeta; the world's first and only SaaS solution to create those training datasets from video inputs
Reply

Sarah Fay from Glasswing Ventures, August 31, 2018 at 7:50 a.m.

I really appreciate the feedback, Mark, and glad to have sparked a conversation. It sounds like you are helping to solve the problem of generating more trained data which is a good thing.

Peter Duffy from Mercanto, September 9, 2018 at 7:12 a.m.

Hi Sarah,

Thanks for this really interesting article. Fab to see ML getting more practical/realistic coverage!

This is a terrific article, but I'd question the point regarding the value of external trained data vs in-house data.

Every e-commerce/ retail brand has their own datasets and this is usually the best place to start for ML training models. The average retailer can implement a 'minimum viable dataset' in under a month, depending on consumer engagement and historic data. It can be shorter if consumers are engaged with the brand.

First party data is (of course) the most reliable and (provided an e-commerce/retail brand is prepared to wait for c. a month for trained data) I'd suggest that's the best way to go :)

But I suppose it is not an either/or situation - brands can use their in-house data, and augment it with external data if that proves valuable.

V best,

Peter
CEO, Mercanto