What do we mean when we talk about data-driven products?

Jun 4, 2016

I’ve recently had a number of conversations about what it means to design and build data-driven products. They all started on a similar premise: data-driven is the ‘right’ way of doing things, it’s the future, etc., etc.; so let’s assume we’re all on the same page of the hype cycle and, instead, go a little bit deeper: “what do we mean by data-driven?”

Is a network router a data-driven product? Is a social networking app data-driven? Is an app that finds the nearest taxi for you data-driven? There are probably ways that you could envisage all of these as data-driven products — all of them certainly work with data. However, in many cases, I am not convinced that that popular examples of these things actually are data-driven. So what is it about a thing that makes it data-driven?

From the various conversations I had, it just turns out that we are all going around with a very different understanding of ‘data-driven.’ I think I’ve found three different meanings to date:

Experimentation. This group of people are usually talking about designing and building products via large-scale online tests. The stereotypical example here is picking a green button for your website because conversion metrics are significantly improved over the purple button. Data-driven means making design decisions based on behavioral evidence from users.
Machine Learning. This group of people are talking about building systems that learn from data in order to provide interesting features to users: recommendation, personalized ranking, people-you-may-know, products-you-may-like, etc. While this group is very likely to be conducting online experiments (as above) as well, the ‘data-driven’ part usually refers to the fact that the systems are learning from behavioral data generated by users, whether that data is explicit (ratings, reviews) implicit (clicks, views), or somewhere in between (purchases).
Databases or APIs. The final group (and also, unfortunately, the most common) are building systems that only use data; for example, they may query a database or use public APIs, such as those provided by social networking websites. To them, their system is data-driven because it uses data.

I disagree with the third group. The view that just using data equals ‘data-driven’ makes the whole concept of data-driven completely superfluous and useless — you may as well call it ‘computational software’ or ‘CPU-driven’ and it would probably have the same depth. The first two groups, however, share a critical common theme: they are about behavioral data that was generated as people use the service.

Of course, there is a gray zone between all three of the categories above, and ways that the same product could be built to encompass all three of the groups.

Exploratory Examples

Consider a mobile app that gives you information about public transport around you. A data- based version of this app would pull data from transport operator APIs to give you the latest status updates; perhaps it then filters them to give you location based results. Nothing really data-driven about that. A data- driven version could, instead, learn what part of the transport network is relevant to you and learn to predict when it is that you travel. It could learn when cycling and walking results are relevant, and when they aren’t. It could detect that you are traveling, and preempt your queries for information about waiting times at your next interchange. Perhaps it could even do away with the need for status updates from transport APIs — it could detect that there are delays automagically by sourcing data from its users’ phones. And how would it know that these data-driven features are working? Well, by conducting data-driven experimentation.

We could repeat this thought exercise for a number of data-rich services.

(Update: another example): Consider a website that provides blogging services to users. A data- based version of this site lets users write posts, subscribe to other users’ posts, and get a feed of (perhaps temporally sorted) recent blog posts by people that have been followed. An initial data- driven version could, instead, recommend who to follow, based on the content of your own posts and the posts that you have liked. But who cares about manually following people on a blogging platform? (What if we had to ‘follow’ directors on Netflix to get their movies?) A more advanced data- driven service could auto-tag your content to allow people to quickly find it, create a relevance-sorted feed of posts, and create digests of not-to-be-missed content — regardless of whether you follow someone or not.

What about a meta-search engine, that helps you find and compare flights? The data- based version of this product is an exercise in sourcing flight data from providers, getting you to fill out your trip’s origin, destination, and dates, and giving you a list of flights that you could take. What would the data- driven parts of this system do? Well, that’s a question I’ll be exploring at Skyscanner.