Data Strips: Quintiles vs. Box Plots

After seeing a couple very skewed box plots in a PLOS Biology paper, I got to thinking again about box plot alternatives. Here are four box plots from Figure 1G (rotated to fit better).

Figure excerpt showing four box plots: green, orange, blue, and magenta.

Each box plot represents about 1300 data values, and even at that moderate size, the “outliers” are too numerous to distinguish. Besides that, the extremely short whiskers are hard to naturally parse and should trigger Tukey’s advice: “finding the whiskers so short, … become suspicious [and] see more detail”. I made this a detailed view of the raw data.

Four dot plots of 1300 tiny dots showing skewed distributions.

Each data value is a tiny dot placed on a hexagonal grid to avoid overstriking. A less raw version confines the dots to a violin plot (kernel density estimate) envelope with some overstriking in the densest parts.

Four dot plots of 1300 tiny dots showing skewed distributions.

While the original box plots are working just as designed, I wonder if authors should recognize the situation and choose a view that lets the suspicious reader see more detail. That got me wondering whether a box-plot-like summary could preserve simplicity while representing the skewed data more directly.”

Quintile Area strip plots

Naturally I wanted to try all the views in my Data Strips web app, and I came up with another view to try that might maintain the simplicity of a box plot: a quintile area box plot. That is, each quintile is a box whose height is proportional to its data density. As a result, each box has the same area (conditioned by lower and upper limits). That would address one of Nick Desbarats‘s primary shortcomings of the box plot: when a thin line represents as much data as a thick box.

I updated the app the included Quintile Area and Quartile Area strips; The latter should closely parallel a box plot without outliers. Here are all 16 views of the first box plot (green) above.

16 different data strip views of the same skewed distribution.

The new area plots (near the bottom just above the box plots) do indeed give a stronger indication of the skewness. The rightmost quintile box is height-constrained in this case, which violates the equal-area goal, but could be accounted for by a careful reader.

Regarding other views While I’m starting to see the folly of my inventions based on shortest-half metrics (generally too sensitive), they do a decent job here and even capture the smaller concentration of values at the low end. And I still like my “HDR 5/50/90” in general, and it does well here.

Here are all the views for the second (orange) original box plot.

16 different data strip views of the same skewed distribution.

The smattering of data values below the apparent low value is a complication, but the area charts do decently. The Shortest variants are best at dealing with that complication. Also, the outlier rule for “Grubbs Box” eliminated the outliers from the 1.5 IQR rule of “Tukey Box”.

Finally, here is the third (blue) original box plot. (The fourth is too similar to call out.)

16 different data strip views of the same skewed distribution.

The alternatives also do well for this moderately skewed case, and the outlier rule for “Grubbs Box” helps again.

Normal data

Just as the area box strips were looking promising as a general-purpose distribution view, testing them on random Gaussian data revealed a sobering result. Here are the Quintile Area, Quartile Area and Grubbs Box strips for several differently seeded random Gaussian draws.

Three data strip plots of the same 100 data values: Quintile Area, Quartile Area and Box Plot
Three data strip plots of the same 100 data values: Quintile Area, Quartile Area and Box Plot
Three data strip plots of the same 100 data values: Quintile Area, Quartile Area and Box Plot
Three data strip plots of the same 100 data values: Quintile Area, Quartile Area and Box Plot

“Less is more” strikes again! And also “more is less” because the more detail of the area plots gives less of an expectation that the underlying distribution is Gaussian. The heights, being essentially aligned positions on a common baseline, is too strong of an encoding channel to ignore, so any variation looks significant. And even when the heights are similar, it’s hard not to rank them.

Comparing the quartile area plot to the box plot in the last group highlights the issue. The fact that the Q2 box has a higher data density is encoded in both versions, by height + inverse width in the area plot and by inverse width in the box plot, but the box plot version somehow looks more Gaussian-compatible.

And it gets worse with small data sets: 13 values from a Gaussian:

Three data strip plots of the same 13 data values: Quintile Area, Quartile Area and Box Plot

These feel like deal-breakers for general usage, but I’m leaving them in the Data Strips app, at least as a cautionary example. I’ve also added the three “bacteria” data sets to the app.

Box Plots redeemed

Though shortcomings still exist, this exercise has given me greater insight into the strengths of box plots. Lower precision visual encodings (end-to-end lengths instead of side-by-side heights) is a strength instead of a weakness. And in a way, the forced focus on the center via Q2+median+Q3 gives a view of an aspirational Gaussian data distribution. Or another interpretation: the box plot has a high tolerance for “normal” variation and still gives some signal in case of extreme variation.

Fediverse Reactions

Leave a Reply

Discover more from Raw Data Studies

Subscribe now to keep reading and get access to the full archive.

Continue reading