After seeing a couple very skewed box plots in a PLOS Biology paper, I got to thinking again about box plot alternatives. Here are four box plots from Figure 1G (rotated to fit better).

Each box plot represents about 1300 data values, and even at that moderate size, the “outliers” are too numerous to distinguish. Besides that, the extremely short whiskers are hard to naturally parse and should trigger Tukey’s advice: “finding the whiskers so short, … become suspicious [and] see more detail”. I made this a detailed view of the raw data.

Each data value is a tiny dot placed on a hexagonal grid to avoid overstriking. A less raw version confines the dots to a violin plot (kernel density estimate) envelope with some overstriking in the densest parts.

While the original box plots are working just as designed, I wonder if authors should recognize the situation and choose a view that lets the suspicious reader see more detail. That got me wondering whether a box-plot-like summary could preserve simplicity while representing the skewed data more directly.”
Quintile Area strip plots
Naturally I wanted to try all the views in my Data Strips web app, and I came up with another view to try that might maintain the simplicity of a box plot: a quintile area box plot. That is, each quintile is a box whose height is proportional to its data density. As a result, each box has the same area (conditioned by lower and upper limits). That would address one of Nick Desbarats‘s primary shortcomings of the box plot: when a thin line represents as much data as a thick box.
I updated the app the included Quintile Area and Quartile Area strips; The latter should closely parallel a box plot without outliers. Here are all 16 views of the first box plot (green) above.

The new area plots (near the bottom just above the box plots) do indeed give a stronger indication of the skewness. The rightmost quintile box is height-constrained in this case, which violates the equal-area goal, but could be accounted for by a careful reader.
Regarding other views While I’m starting to see the folly of my inventions based on shortest-half metrics (generally too sensitive), they do a decent job here and even capture the smaller concentration of values at the low end. And I still like my “HDR 5/50/90” in general, and it does well here.
Here are all the views for the second (orange) original box plot.

The smattering of data values below the apparent low value is a complication, but the area charts do decently. The Shortest variants are best at dealing with that complication. Also, the outlier rule for “Grubbs Box” eliminated the outliers from the 1.5 IQR rule of “Tukey Box”.
Finally, here is the third (blue) original box plot. (The fourth is too similar to call out.)

The alternatives also do well for this moderately skewed case, and the outlier rule for “Grubbs Box” helps again.
Normal data
Just as the area box strips were looking promising as a general-purpose distribution view, testing them on random Gaussian data revealed a sobering result. Here are the Quintile Area, Quartile Area and Grubbs Box strips for several differently seeded random Gaussian draws.




“Less is more” strikes again! And also “more is less” because the more detail of the area plots gives less of an expectation that the underlying distribution is Gaussian. The heights, being essentially aligned positions on a common baseline, is too strong of an encoding channel to ignore, so any variation looks significant. And even when the heights are similar, it’s hard not to rank them.
Comparing the quartile area plot to the box plot in the last group highlights the issue. The fact that the Q2 box has a higher data density is encoded in both versions, by height + inverse width in the area plot and by inverse width in the box plot, but the box plot version somehow looks more Gaussian-compatible.
And it gets worse with small data sets: 13 values from a Gaussian:

These feel like deal-breakers for general usage, but I’m leaving them in the Data Strips app, at least as a cautionary example. I’ve also added the three “bacteria” data sets to the app.
Box Plots redeemed
Though shortcomings still exist, this exercise has given me greater insight into the strengths of box plots. Lower precision visual encodings (end-to-end lengths instead of side-by-side heights) is a strength instead of a weakness. And in a way, the forced focus on the center via Q2+median+Q3 gives a view of an aspirational Gaussian data distribution. Or another interpretation: the box plot has a high tolerance for “normal” variation and still gives some signal in case of extreme variation.
