Commentary

Data Scientists Swim, Surf, Pick And Juggle

As mobile advertising specialists, we depend on the work of our data scientists. They’re the ones who figure out exactly what our data can do for us, by applying algorithms that unlock immense value for us and for our clients.

But what exactly do data scientists do? Does their day consist of diving into deep blue oceans of data, swimming among shoals of algorithms and algebraic coral? Do they pick away at a rock face of solid data, to find diamond insights compressed over millions of years? Do they surf data? Do they juggle it? Tame it? Ingest it?

Given that I have a team of data scientists, I thought it would be interesting to give you a taste of what they do on a daily basis. What are their challenges, and how do they overcome them?

Picking at practicalities

To talk about practicality for data science would seem a contradiction in terms. We think of data as something ethereal and abstract, and science as concrete and unequivocal.

But on a daily basis, our data scientists are confronted with problems  that just don’t exist in  textbooks. Almost invariably data folks need to take the best models and use their ingenuity – and statistics, programming and machine learning – to tailor them so that they wrap neatly around the data. What they come out with is in a very real sense new knowledge.

There is also a practicality in that they’re dealing with not just data, but relationships, and their algorithms often have to be fair as well as profitable. To take an extreme example, imagine an algorithm that simply took the most profitable campaign and fed that to publishers constantly. This wouldn’t be fair on the advertisers whose campaigns are never shown, or the publishers whose relationships with said advertisers would be damaged.

So as well as the quantitative, really smart algorithms must take into account the qualitative. As well as making people feel great about making money, they need to make advertisers and publishers feel great about each other, too.

Juggling variables

For every platform there are many manufacturers (unless you’re Apple). For every manufacturer there are many devices. For every device there are many models. And the world is made up of many countries, with many ways of characterising their populations, from the simple male/female divide through to age, income and so on.

The statisticians call this a multi-variate problem: with so many variables, which of them actually influence outcomes, and to what extent?

Imagine an algorithm that tries to predict what the CTR will be for a given campaign. What’s more important? The device? The user? The context? The campaign? All of them, some of them, one of them – or none of them? Instructing the algorithm on how to determine the significance of these factors is what the data scientist needs to figure out. Typically they will spend 10% of their time putting together a model, and then 90% preparing the data and deciding which variables to put into it.

Another issue here is that we as human beings are designed to search for patterns. So data scientists also need to be aware that they may be seeing patterns that don’t exist, or that their models are seeing associations of data rather than causations.

This is known as the “nuisance correlation” problem. The way to solve it is to develop statistical methods of variable selection, assigning probabilities that a given input is responsible for a proportion of an outcome. It is statistics red in tooth and claw.

Swimming with the real-time tide

Finally, real-time is another situation in which the “optimal” solution needs to be a balance between sometimes conflicting extremes. In the case of RTB, the challenge is to make the best decision about whether to bid, and how much to bid, within the time periods that satisfy exchange standards.

While this presents obvious computational challenges, it also puts algorithms under stress. An algorithm may be able to give a 100% accurate prediction about winning a bid if it has one second to think about it – but one second is far too long. For RTB, we have to work to develop the algorithms that will give the highest possible accuracy given the time constraints.

This is where you really do need smart data scientists who can get around the practical problems, develop new ways of working, see real patterns in data, and do this all in real-time. So when we wonder whether they juggle the practicalities, pick away at problems, or swim as fast as the currents allow, the answer is: all of the above.

2 comments about "Data Scientists Swim, Surf, Pick And Juggle".
Check to receive email when comments are posted.
  1. Bruce May from Bizperity, March 18, 2013 at 12:53 p.m.

    Brilliantly explained… I was just struggling with a colleague about how to explain big data to clients. Now I will just forward this blog. Thanks, for the insight.

  2. Brendan Cooper from Adfonic, March 19, 2013 at 8:25 a.m.

    Glad you liked it Bruce! It's worth mentioning that data scientists also need to be able to explain things clearly to non-data scientists, which is how this piece came about....!

Next story loading loading..