• Follow Curiously Persistent on WordPress.com
  • About the blog

    This is the personal blog of Simon Kendrick and covers my interests in media, technology and popular culture. All opinions expressed are my own and may not be representative of past or present employers
  • Subscribe

  • Meta

Data should be used as evidence and not illustration

I read the Guardian article on journalist’s struggles with “data literacy” with interest. The piece concentrates on inaccurate reporting through a lack of understanding of numbers, and the context around them. “Honest mistakes”, of a sort.

Taken more cynically, it is an example of a fallacy that I see regularly in many different  disciplines (I’m loath to call it a trend as, for all I know, this could be a long-standing problem) – fitting data around a pre-constructed narrative, rather than deducing the main story from the available information.

This is dangerous. It reduces data to be nothing more than anecdotal support for our subjective viewpoints. While Steve Jobs may have had a skill for telling people what they really wanted, he is an exception rather than the rule. We as human beings are flawed, biased and incapable of objectivity.

Given the complexity of our surroundings, we will (probably) never fully understand how everything fits together – this article from Jonah Lehrer on the problems with the reductionist scientific method is fascinating. However, many of us can certainly act with more critical acumen that we currently do.

This is as incumbent on the audience as it is the communicator – as MG Siegler recently wrote in relation to his field of technology journalism, “most of what is written… is bullshit”, and readers should utilise more caution when taking news as given.

Whether it is due to time pressures, lack of skills, laziness, pressure to delivery a specific outcome of otherwise, we need to avoid this trap and – to the best of our abilities – let our conclusions or recommendations emerge from the available data, rather than simply use it to illustrate our subjective biases.

While I am a (now no more than an occasional) blogger, I am not a journalist and so I’ll limit my potential criticisms of that field. However, I am a researcher that has at various points worked closely with many other disciplines (some data-orientated, some editorial, some creative), and I see this fundamental problem reoccurring in a variety of contexts.

When collating evidence, the best means to ensure its veracity is to collect it yourself – in my situation, that would be to conduct primary research and to meet the various quality standards that would ensure a reliable methodology, and coherent conclusions

Primary research isn’t realistic in many cases, due to limited levels of time, money and skills. As such, we rely on collating existing data sources. This interpretation of secondary research is where I believe the problem of illustration above evidence is most likely to occur.

There are two stages that can help overcome this – critical evaluation of sources, and counterfactual hypotheses.

To critically evaluate data sources, I’ve created a CRAP sheet mnemonic that can help filter the unusable data from the trustworthy:

  • Communication – does the interpretation support the actual data upon scrutiny? For instance, people have been quick to cite Pinterest’s UK skew to male users as a real difference in culture between the UK and US, rather than entertain the notion that UK use is still constrained to the early adopting tech community, whereas US use is – marginally – more mature and has diffused outwards
  • Recency – when was the data created (and not when was it communicated)? For instance, I’d try to avoid quoting 2010 research into iPads since tablets are a nascent and fast-moving industry. Data into underlying human motivations is likely to have a longer shelf-life. This is why that despite the accolades and endorsements, I’m loath to cite this online word of mouth article because it is from 2004 – before both Twitter and Facebook
  • Audience – who is the data among? Would data among US C-suite executives be analogous to UK business owners? Also, some companies specialising in PR research have been notoriously bad at claiming a representative adult audience, when in reality they are usually a self-selecting sub-sample
  • Provenance – where did the data originally come from? In the same way as students are discouraged from citing Wikipedia, we should go to the original source of the data to discover where the data came from, and for what purpose. For instance, data from a lobby group re-affirming their position is unlikely to be the most reliable. It also helps us escape from the echo chamber, where myth can quickly become fact.

Counterfactual hypotheses are the equivalent of control experiments – could arguments or conclusions still be true with the absence of key variables? We should look for conflicting conclusions within our evidence, to see if they can be justified with the same level of certainty.  This method is fairly limited – since we are ultimately constrained by our own viewpoints. Nevertheless, it offers at least some challenge to our pre-existing notions of what is and what isn’t correct.

Data literacy is an important skill to have – not least because, as Neil Perkin has previously written about, it is only the first step on the DIKW hierarchy towards wisdom. While Sturgeon’s Law might apply to existing data, we need to be more robust in our methods, and critical in our judgements.  (I appreciate the irony of citing an anecdotal phenomenon)

It is a planner trope that presentations should contain selective quotes to inspire or frame an argument, and I’ve written in the past about how easily these can contradict one another. A framing device is one thing; a tenet of an argument is another. As such, it is imperative that we use data as evidence and not as illustration.

sk

Image credit: http://www.flickr.com/photos/etringita/854298772/