Making decisions: metrics and judgement

The conversation around impact factors and the assessment of research outputs, amplified by the recent ‘splash’ boycott by Randy Shekman, is turning my mind to a different aspect of science – and indeed society – and that is the use of metrics.We are becoming better and better at producing metrics: more of the things we do are digitised, and by coordinating what we do more carefully we can ‘instrument’ our lives better. Familiar examples might be monitoring household electricity meters to improve energy consumption, analysing traffic patterns to control traffic flow, or even tracking the movement of people in stores to improve sales.At the workplace it’s more about how many citations we have, how much grant funding we obtain, how many conferences we participate in, how much disk space we use… even how often we tweet. All these things usually have fairly ‘low friction’ instrumentation (with notable exceptions).

This means there is a lot more quantitative data about us as scientists out there than ever before, particularly our ‘outputs’ and related citations, and mostly with an emphasis on the traditional (often maligned) Impact Factor of journals and increasingly on “altmetrics”. This is only going to intensify in the future.

Data driven… to a point

At one level this is great. I’m a big believer in data-driven decisions in science, and logically this should be extended to other arenas. But on another level, metrics can be dangerous.

Four dangers of metrics

  1. Metrics are low-dimensional rankings of high-dimensional spaces;
  2. Metrics are horribly confounded and correlated;
  3. A few metrics are more easily ‘gamed’ than a broad array of metrics;
  4. There is a shift towards arguments that are supported by available metrics.

The tangle of multidimensional metrics

A metric, by definition, provides a single dimension on which to place people or things (in this case scientists). The big downside is that we know that science is considered “good” only after evaluating it on many levels. It can’t be judged usefully along any single, linear metric. On a big-picture, strategic level, one has to consider things within the context of different disciplines. Then there is  an aspect of  ‘science community’ – successful science needs both people who are excellent mentors and community drivers, and the ‘lone cats’ who tend to keep to themselves. Even at the smallest level, you have to have a diversity of thinking patterns (even within the same discipline, even with the same modus operandi) for science to be really good. It would be a disaster if scientists were too homogeneous. Metrics implicitly make an assumption of low dimensionality (in the most extreme case, of a single dimension), which by its very definition, cannot capture this multi-dimensional space.

Clearly, there are going to be a lot of factors blending into metrics, and a lot of those will be unwanted confounders and/or correlation structures that confuse the picture. Some of this is well known: for example, different subfields have very different citation rates; parents who take career breaks to raise children (the majority being women) will often have a different readout of their career through this period. Perhaps less widely considered is that institutions in less well-resourced countries do not actually have poorer access to the ‘hidden’ channels of meetings and workshops of science.

Some of the correlations are hard to untangle. Currently, many good scientists like to publish in Science, Nature and Cell, and so … judging people by their Science, Nature and Cell papers is (again, currently) an ‘informative proxy’. But this confounding goes way deeper than one or two factors; rather, it is a really crazy series of things: a ‘fashion’ in a particular discipline, a ‘momentum’ effect in a particular field, attendance at certain conferences, the tweeting and blogging of papers…

Because of the complex correlation between these factors, people can use a whole series of implicit or explicit proxies for success to get a reasonable estimation of where someone might be placed in this broad correlation structure. The harder question is: why is this scientist – or this project proposed by this scientist – in this position in the correlation structure? What happens next if we fund this project/scientist/scheme?

Gaming the system

I’ve observed that developing metrics, even when one is transparent about their use, encourages more narrow thinking and opens up the ability to game systems more. This gaming is done by people, communities and institutions alike, often in quite an unconscious way. So… when journal impact factors become the metric, there is a bias – across the board – to shift towards fitting the science to the journal. When paper citation numbers (or h-indexes) become the measure by which one’s work is judged, communities that are generous in their authorship benefit relative to others. When ‘excellent individuals’ are the commodity by which departments are assessed, complex cross-holdings of individuals between institutions begin to emerge. And so on.

In some sense there is a desire to keep metrics more closed (consider NICE, who have a methodology but are deliberately fuzzy about the precise details, making it hard to game the system). But this is at complete odds with transparency and the notion of providing a level playing field. I think transparency trumps any efficiency here, and so the push has to be towards a broader array of metrics.

Making the judgement call

One unconscious aspect of using metrics is the way it affects the whole judgement process. I’ve seen committees – and myself sometimes when I catch myself at it – shift towards making arguments based on available metrics, rather than stepping back and saying, “These metrics are one of a number of inputs, including my own judgement of their work”.

One needs to almost read past the numbers – even if they are poor – and ask, “Is the science worth it?” In the worst case, the person or committee making that judgement call will be asked to justify the decision based entirely on metrics, in order to present a sort of watertight argument. But there are real dangers of believing – against all evidence – that metrics are adequate measures. That said, this is the counter-argument to ‘using objective evidence’ and ‘removing establishment bias’ – the very thing that using metrics helps counter. There has to balance.

So what is to be done here? I don’t believe there is an easy solution. Getting rid of metrics exposes us to the risk of sticking with the people we already know and other equally bad processes.

I would argue that:
  • We need more, not fewer, metrics, and to have a diversity of metrics presented to us when we make judgements. This might make interpretation seem more complicated, and therefore harder to judge. And that is, in many cases, correct – it is more complicated and it is hard to judge these things.
  • We need good research on metrics and confounders. At the very least this will help expose their strengths and weaknesses; even better, it will potentially make it possible to adjust for (perhaps unexpected) major influencing factors.
  • We should collectively accept that, even with a large number of somewhat un-confounded metrics, there will still be confounders we have not thought about. And even if there were perfect, unconfounded metrics, we would still have to decided which aspects of this high-dimensional space we want to select; after all, selecting just one area of ‘science’ is, well, not going to be good.
  • We should trust the judgement of committees, in particular when they ‘re-rank’ against metrics. Indeed, if there is a committee whose results can be accurately predicted by its input metrics, what’s the point of that grouping?

Acknowledgements

My thinking on this subject has been influenced by two great books. One is Daniel Kahneman’s “Thinking, Fast and Slow“, which I’ve blogged about previously. The other is Nate Silver’s excellent “The signal and the noise“. Both are seriously worth reading, for any scientist.

4 Replies to “Making decisions: metrics and judgement”

  1. Good post. You should come to an altmetrics workshop someday, it's a community that is very strongly in tune with you on these points. There are also an increasing number of contributions from bibliometricians looking at exactly the questions of confounders and diversity in the metrics that we currently have access to. They find some metrics correlate well with others (mendeley readers correlate well with citations, tweets not at all), some tell us very different pieces of information, and overall they have been very balanced in not overreaching with claims, it's all very early days at the moment.

    A call for more metrics, used in a balanced and rational way, should also, I feel, be balanced with a remark on the problems with the existing metrics, and the one metric out there that tends to get the most focus in that conversation is the impact factor. Björn Brembs and coauthors have a new contribution on that topic – http://arxiv.org/pdf/1301.3748v3.pdf – which seems to indicate that impact factor is a pretty poor indicator of the value of a research output.

    I think we know as a community that there are many people and institutions out there that are not badly influenced by the impact factor, but there remain a very large number of players who still fixate over this. We need to give people tools to help them break away from this number, as it's just pretty unscientific as an indicator. I don't mind if people pick journal prestige on it's own, but please let's just get past the goddamnd impact factor.

  2. I think you make a lot of good points, but then draw the wrong conclusions.

    You say

    "When paper citation numbers (or h-indexes) become the measure by which one's work is judged, communities that are generous in their authorship benefit relative to others".

    What I think you should have said is that judging by citations encourages dishonest guest authorship, a problem that has become only too common. Why mince your words? It's a form of cheating.

    The conclusion that we need more metrics, not fewer, seems illogical given your perspicacious comments about the tangle of intercorrelated variables.

    We need NO metrics. We need to read the papers. Metrics inevitably lead to corruption. They are just a lazy shortcut. They are promoted to make money for the vast number of people who hang around the edges of real science, and who do nothing but harm.

  3. The key is to separate out different uses. Use bibliometrics (and altmetrics) at an aggregate level as flawed, proxy indicators (e.g. in science policy research trying to understand the growth and development of disciplines, etc) or even in management of research performing institutions, trying to understand the overall organisational profile.

    But never use them as a measure, even a partial one, of an individual's performance, or to drive indvidual incentives.

  4. Thanks for citing your sources. I'd heard of "Thinking, Fast and Slow" before but did not think of buying it.

    Luckily it is available now as an audio book on audible and just bought it. I am still intrigued as to how this book will help me understand your views on metrics. It nevertheless should help my daily commute be more inspiring…

Comments are closed.