Raw Data Provenance

In my never-ending search for good way to view data distributions, one task is to find take apart box plots seen in the wild and compare them with different visualizations. A recent such subject was this chart from the paper, “The hidden costs of energy and mobility: A global meta-analysis and research synthesis of electricity and transport externalities

The paper summarizes previous papers that study the external costs of electricity production for common fuel types. I was pleasantly surprised to see the raw data behind the chart available for download (given that this is a summary paper using data from other papers), even if it was a table within a Word document.

The data looks quite complete, even including the DOI/URL of the original source of each number.

With only minor data clean-up, I created the following remake chart:

This includes the main idea I had after seeing the original. One downside of box plots is that they can hide the vast differences in counts among different boxes. For instance, there are 67 estimates for coal but only 5 for waste incineration. My treatment here sizes the box width in proportion to the data count to make the disparity more apparent. (Btw, having only 5 data values is why my box doesn’t agree with the original there: different tie-break rules for computing quantiles come into play with such small data sizes.)

However, there is a bigger issue. My version shows a much wider range of estimates than the original (150 vs 25 ¢/kWh). Eventually I noticed the small print at the bottom of the original chart “Excludes outside values”. What?! Ok, maybe that let’s use focus on the median, but it needs better supporting notation. For instance, the X axis title seems wrong to say “Range of Estimates” when it’s not the full range of estimates. And it seems wrong that the figure caption says the box whiskers show the minimum and maximum.

Beyond the charting oddity, I was still puzzled about the wide range of estimates, so I decided to check a few of the original papers, at least for the extreme values. I didn’t get too far because most of the papers are paywalled and too old to have preprint versions available. I put out a query on Bluesky and user Michiel Duvekot came through with the cost table from the paper with the most extreme values:

Picture of a table from the paper 10.1016/j.enpol.2010.06.006

The “ThCo” values are for different types of thermal coal generation; 0.2440 is very different from the summary paper’s value of 157.885, even adjusting for inflation since 2010 when the original paper came out.

The paper with the next most extreme values was from Germany in 1993, and I did find an online version with the table:

The cost values are in pfennigs from 1988, where a pfennig is 1/100th of a Deutsche Mark. If we start from the midrange for coal, of 0.44—2.35 = 1.395, from 1988 pfennigs to 1988 USD ×1.82 = $0.0254, and finally adjust for inflation to 2018 ×2.124 = $0.054. Once again it’s very far from the given extreme value of $1.48. Interestingly, that paper is the source of four other extreme values.

Admittedly, I’m an amateur here, but the fact that these values are such extreme outliers makes it worth double-checking. So I added a comment on PubPeer, hoping the authors or someone more knowledgeable can chime in (I had to use an anonymous account since I don’t qualify for a verified account).

I don’t want to gloss over the above step of taking the midrange of the estimate range. It wouldn’t surprise me if those estimate have a non-symmetrical distribution and the midrange is not a good representative estimate. And it also speaks to the usefulness of more detail in the shared data. That is, also include the source values in original form plus any central estimate calculations, currency conversions and inflation adjustments.

With the raw data trail becoming more and more murky, I didn’t pursue other views of this data. However, I’ll add one variation on the original box plot: include a mean indicator (orange diamond).

The mean diamonds provide some visual indication that something is odd with the distributions. Having some of the means appear beyond the whiskers tells you something is left out.


Leave a Reply

Discover more from Raw Data Studies

Subscribe now to keep reading and get access to the full archive.

Continue reading