Constrained jitter

In data visualization, jittering is a technique to avoid or reduce overstriking by applying an offset to point locations where otherwise points would be placed at the same location and obscure one another. When there is no overlap at all, sometimes the term dodging is used. It’s most often applied to one-dimensional data where the jitter offset can be applied in the other dimension from the dimension of the data. Sometimes the jitter offset is also in the data dimension, which is the case for the classic statistical dot plot codified by Lee Wilkinson and the cutting edge Dot Density plot from Dan Zvinca. Here are a few examples of jittering using a small data set (Penguin weights):

No jitter. Dot transparency provides a hint at density and overlap.
Random jitter. Easy but still susceptible to overstriking, some density indication.
Packed/dodged/beeswarm jitter. Decent density indication.
Wilkinson-style jitter. Some data-dimension offsetting for stack alignment.

These example show two-sided jitter centered on some horizontal line, but all of the techniques can also apply to one-sided jitter.

Constrained jitter example

The above discussion provides minimal context for the topic of dot jitter before looking deeper at this chart that appeared in a recent Nature paper. It’s Figure 4c from “Magnitude of venous or capillary blood-derived SARS-CoV-2-specific T cell response determines COVID-19 immunity“.

The chart is using beeswarm jitter except that the offsets are constrained to the width of the box plot in an irregular way that both obscures the box plot and undermines the task of understanding the density distribution of the data. “Irregular” may be a relative term. The final width is certainly regular. By “irregular,” I mean that each chain of beeswarm jittered points (forming Vs in the example) is compressed separately, so their true relative lengths are no longer seen.

Fortunately, the article includes all of the data to recreate the charts, so we can explore. Without any width constraints, the chart becomes very wide, but you can see the distribution better.

Here’s what I mean by a “regular” constraint on the width, applying the same compression across the entire chart (even on the righthand group).

Having given up on avoiding overlapping dots (the primary feature of packed or beeswarm jitter), we can try other jitter techniques such as this “density random” jitter where the offset is random within the bounds of a kernel density function. Here the focus is on the density shape rather than local spikes.

Digging deeper

While my explorations were mostly focused on the jitter pattern, they revealed an interesting feature of the data: a dense horizontal band of dots near the top. It’s present but not as apparent in the original view.

Looking at the data, we can see they’re all the same value, 21880, which is conspicuously an exact integer in a sea of precise measurements.

That’s a weird curiosity, at least. I wasn’t too worried about it being consequential to the paper findings since this variable was one they found to be not significant anyway. If the data is an error, then at best the variable may become significant and add a new finding to the paper but at worst it might indicate some broader data error.

I contacted the authors just in case it was relevant, and I got a quick and thorough response. 21880 was the maximum result from measurement process, and several samples hit that limit. Once they discovered the issue, they adjusted their process with a different dilution to allow higher results. There is mention of it in the paper’s Methods section:

Samples that recorded values above the limit of quantification were re-run at 1:1000 dilution.

Scurr, M.J., Lippiatt, G., Capitani, L. et al. Magnitude of venous or capillary blood-derived SARS-CoV-2-specific T cell response determines COVID-19 immunity. Nat Commun 13, 5422 (2022).

So basically, some of the values were censored.

Stepping back

From a data visualization point of view, it’s interesting to wonder about other views. I think the irregular jitter compression and the obscured box plots are inadvertent visual flaws, but I don’t know enough about the domain to know about the other decisions.

Why a log scale? Presumably, the process of T-cell growth multiplicative. Then again, the analysis appears to have been done on raw values rather than logged values.

Why show dots when there is that much overlap? I wondered if it was just a standard practice not intended for this many points, but at least it shows the cardinality difference between the two groups.

Though their R code is shared, I’m not familiar enough with R to match which part of the analysis goes with this figure, so I’m not sure if the box plots relate to the analysis or not. (My excerpt above doesn’t show it, but there are also statistical significance indicators with the charts.)

Just for curiosity, here’s what the data looks like without a log scale and with an ANOVA-style means comparison.

Leave a Reply