Beeswarm attack

Here’s another case in the wild of beeswarm jitter using clamped bounds that hide the distribution of the data (earlier Constrained Jitter post). This one has a twist in that a large proportion of the data values are zero.

This is Figure 1b from Profiling of HIV-1 elite neutralizer cohort reveals a CD4bs bnAb for HIV-1 prevention and therapy published in Nature Immunology. The subject matter is foreign to me, but I think the chart shows the performance of 800+ antibodies against six representative HIV-1 pseudoviruses (on the X axis), so there’re about 5000 dots total.

ChatGPT tells me this chart was likely made by GraphPad Prism, but the user manual is not online, so I can’t verify that. Regardless, I think this width constraint falls into a category of harmful features. Software developers (myself included) will sometimes add features that are more likely to be used in the wrong situation than in right situation. In this case, though, I’m having a hard time thinking of the right situation for such clamping. If squeezing/overlap is to be applied, I think it should be applied uniformly, so the distribution is not distorted.

Aside: Wilkinson beeswarm comment

Even without the overlap, the classic “smiley” beeswarm is not so great for representing distributions. In case the original post ever disappears, I’m including here a blog comment Lee Wilkinson made about beeswarms:

People interested in these plots need to read my Dot Plots [PDF] paper before programming dot plots. [The “beeswarm”] is misleading and should be avoided…. In the vertical version of the “beeswarm” plot, the Y values are placed at their proper locations but the X values are arbitrarily ordered by the Y values. This creates a visual artifact of U-shaped dot stacks that misrepresent the structure of the data. There are also other examples in the “beeswarm” R program that allow the dots to be asymmetric around a vertical center line. This, too, induces a visual artifact. Dot plots need to be a faithful representation of a density (this is a well-defined statistical concept) and need to converge to a population density as sample size increases.

Alternatives

The excess of zeros makes any representation of the distributions challenging. Even so, a box plot is not terrible. You can overlay mean values like in the original if important, but the medians communicate the same relative performance if that’s all the means are doing. And you can tell which treatment responses are half zeros and that the last treatment is the only one with less than a quarter zeros since the lower quartile side of the box is above the origin.

Just to show how problematic it is to show all the dots as a natural dot plot, here’s a (smoothed) dot plot for just one pseudovirus.

Note the dots are overlapping, but the same overlap effect is applied to the entire range.

Using a smooth hexagonal-grid dot placement allows for the zero stack to overflow into three stacks, but it still leaves the dot plot too wide for a panel of six of them.

Here’s a more elaborate idea that treats the zeros as a categorically different response and puts their dots in a separate space below 0%.

Stepping away from the dot plot requirement, here is a stacked bar chart of the counts in each 20% neutralization span, plus a separate group for zero.

In the less is more camp, this view provides much less detail that the original but in some ways more information. In particular, we can see

  • the same relative ranking that the original means lines provided
  • how dominant the zeros are
  • the relative number of zeros for each pseudovirus
  • how the top 20% group is always bigger than the next 20% group
  • more difference between the last pseudovirus and the others

Now that I think about it, in the original, almost all of the information being conveyed comes from the means lines.

At the risk of further exposing my ignorance of the subject matter, I thought it would be interesting to treat the antibodies as the factor (on the X axis).

Here are all 831 antibodies on the X axis ordered by their average neutralization value (blue line) over the six pseudoviruses. The light blue region shows the range of all six neutralization values for each antibody. At least we can see that there are a good portion of the antibodies with no neutralization at all, which might merit exclusion or other special treatment in subsequent analysis.

Raw Data

The paper has an elaborate Data Availability statement, mostly regarding the gene data. I think the data for this chart is “available upon request” which in my experience means “not available”. However, the point coordinates are all in the original chart, so I extracted them for this post, and have made the data available on GitHub in hopes others can explore interesting alternative views.

Fediverse Reactions

Leave a Reply

Discover more from Raw Data Studies

Subscribe now to keep reading and get access to the full archive.

Continue reading