“Data is the new oil” is a phrase I’ve heard a few times recently, notably from Gerd Leonhard.
And it has the potential to be true. Data can be extremely powerful and extremely valuable (though I’m a researcher, so I would say that).
Yet few companies seem to be exploiting this to its full potential. The research industry is particularly poor in making the most of available data.
Google is an obvious example of a company on the right track, and there are isolated examples of companies doing interesting things – OK Cupid’s blog for example. Or advanced Sabermetrics, that looks to account for every possible variable that can influence a play.
Every possible piece of information around a decision, an action or an opinion should be sought to be captured. Only then can the context be understood.
There might be issues with analysis paralysis. But Moore’s Law is consistently providing the processing power that minimises this problem.
If data is the new oil, we need a bigger drill
It is a horrible cliché.
But clichés tend to propagate because there is some truth to their message.
And whether icebergs or oil fields, the analogy holds for data.
Most information we hold is taken at face value – what people say.
But formal research with questions and answers is only part of the solution. We don’t live in a vacuum where an individual reply is sacrosanct. There is our social, herd mentality. There is our willingness to be nudged. Randomness. Our subconscious. Our poor recall. Our aversion to admitting certain things. And, of course, our inability to answer certain questions.
I can’t remember who tweeted it but I really liked the comment:
“The problem with the general public is that they have an average IQ of 100″.
Pithy, and possibly a little patronising, but true. We need to account for the biases in our ability to answer questions – whether intentional or not. And we do that by capturing the context. Moving away from just what people say.
This is the Mehrabian principle (the 7-38-55). Where people are conflicted, we should pay the least attention to what they say. Instead, we should put more weight to the non-verbal communication.
What are people doing?
How are they doing it?
Where are they doing it?
When are they doing it?
And possible even modeling why they do it.
These are the hidden depths. We need to move from the stated to the observed, and to augment the recorded with the inferred.
With research methodologies, there are three main areas that need to develop. “Formal” research companies are already losing ground to analytics in the online area, and this problem will only get worse unless action is taken.
(Non-researchers: Thank your for persevering with this post up until this point, I hope my general point has resonated with you. If not, I’d love to hear why. However, the rest of the blog post concentrates on the operations side of research, and so you may want to call it quits here)
1. Centralised, historic panel information
Surveys are horribly inefficient, since the same questions are often asked each time. Online panel providers seem reticent to assume or re-use information from their profiling studies. Reasons I have heard – and my responses – include
- Information may be out of date – some things will be but some demographic details will remain constant (like date of birth and, barring the exception to the rule, gender) and I’d be fine taking some mesofacts for granted if – for example – the data held had been submitted within the previous 12 months
- Not all panel members will have completed all the profiling questions – surely some information is better than none. Additionally, the questions could only be asked to those where the information is missing
- Information needs to be collected within the survey in order to factor into quotas or screeners – this is an artificial technological limitation and there is no good reason why this can’t be achieved
Once these limitations are overcome, there is scope to do some quite powerful things. A central databank could build powerful profiles of (claimed) consumer behaviour across multiple dimensions. Individual surveys could be coordinated to add to this dataset. For instance, I might have a question on my survey on the brand of TV the respondent owns. If I were incentivised by the panel provider in some way, I would consent to framing this question in a standardised format, so that the data collected can be stored centrally and contribute to further profile information to be reused on additional surveys. Incrementally, extremely rich datasets of panel members can be built up, so that the number of questions needed to ask directly can be reduced
This idea of a central databank also has powerful implications for qualitative recruitment. Information on people tends to be collected informally by local recruiters – a centralised, formal databank comprising answers to all previous screener questionnaires would be a much more efficient – and accurate – form of recruitment.
Clearly, there are data protection issues with this. If consent were sought, and benefits made clear (shorter, more relevant and more varied questionnaires), then it could gain traction. And while it runs contrary to current guidelines, additional consent could result in highly targeted marketing messages – mimicking for instance the advance product testing P&G use their Tremor panel for.
In online surveys, some metadata is collected but its uses tend to be related to data quality. Examples include
- Average length of questionnaire
- Ensuring there is a certain gap between survey invitations
- Measuring drop-out rates
- Removing respondents that continually “straight-line” or answer in obvious patterns
Yet there are many ways in which metadata can be used for analysis – either within the survey itself or for general knowledge on segmenting the types of panel users. Examples include
- Deriving geodemographic data from IP addresses
- Measuring word count and time spent answering open-ended questions
- Using time spent making decisions (such as in conjoint surveys) to calculate “velocity of opinion”*
- Segmenting those that tend to complete surveys shortly after the initial invitation into a “fast response” panel
- Calculating preferences from the order in which modular surveys are completed (though most surveys remain linear)
*For velocity of opinion to be an effective measure, panel members would need to be segmented based on quality of response (including average time spent answering questions) so that normalised scores can be created, and deltas measured.
This has potential cross-over with cookie-tracking web use panels, while metadata can be a valuable form of meta analysis in online communities – used to both encourage participation and understand both individual responses and the community culture as a whole. Someone, probably Tom, once wrote about the types of data that can be collected in a community to help encourage response (primarily through gaming responses), and it really resonated with me.
3. Integrated analysis tools
This is more of a personal request to make my life easier, rather than something that is currently absent or overlooked, but surely multivariate analysis tools could be built into survey scripts and automated. The current system – sending completed datasets to statisticians to run through their proprietary tools – seems incredibly inefficient. If questions are asked in a consistent way, then the analysis method would be consistent and thus could be automated.
This post seems to have turned into a general moan about a lack of innovation in data collection, but my underlying point is that only a small fraction of the data collected is currently used, and used effectively. This seems a shame, but it is also an opportunity since there is room for significant improvement in the way we currently operate.
Image credit: http://www.flickr.com/photos/21734563@N04/2385731147/