All but significant trends

This was the publicity image for the paper “Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance” by Otte et al in PLOS Biology. What stands out?

A 5 x 2 grid of scatterplots with fitted lines. One plot per phrase. The Y axis is prevalence. The X axis is years (1990 - 2020). There is one dot every three years.

The chart has a lot going for it, such as multiple clear trends backed by lots of data, and it even has the meta notion of a positive trend for the phrase “a positive trend.” However, I did have a few questions.

  1. Why these 10 phrases?
  2. Why so few dots (not one per year)?
  3. Why are the Y axis scales so different?
  4. Why are all the slopes the same (except for sign)?
  5. Why no curved trends, such as phrases with a peak or a valley?
  6. What does “all but significant” mean?

I could answer most of them by reading the paper.

  1. 505 suspicious phrases were examined, and only those with the most prominent time trends (defined by linear regressions with the highest Bayes factor compared to the null) were plotted. These were the top 10 by that measure, in alphabetical order.
  2. The authors binned values in 3-year intervals “to increase the temporal robustness of individual phrase prevalence estimations ….” I’m not sure exactly what that means. I thought a weighted linear regression would handle a few outliers okay, but perhaps the Bayes factor is sensitive to them. I suppose binning the count data is one way to make it more Normal and less Poisson.
  3. Answered mostly by the first question’s answer, these phrases were chosen by the strength of the trends, ignoring popularity, so the prevalence values are quite varied.
  4. The slopes look about the same because the Y axis scale is adjusted so that the data and line bounds fill the graph (with some padding). So oddly, the only information graphically represented by the lines is the direction, up or down. Otherwise, the slopes provide no differentiation.
  5. The “why” is answered by the selection mechanism of the first question, but we can still wonder whether any curved trends exist. We’ll have to look at the data to answer that.
  6. That one requires even more sleuthing, and has a surprising answer ….

Remaking the trends panel

For a data sharing grade, I would give this paper an A+ even though there were a few challenges. I think the authors shared all the data they could and provided nicely organized R code to answer any ambiguities. The data they presumably couldn’t share was the text of the papers they analyzed. There were a few propriety-format files in the data set, apparently from the Apple Numbers app, but I think I pieced together what I needed to answer my questions.

I first set out to reproduce the original chart, or something close to it. Here’s my version.

A 5 x 2 grid of scatterplots with fitted lines. One plot per phrase. The Y axis is relative prevalence. The X axis is years (1990 - 2020).

Instead of separate axes, I added the mean prevalence as an annotation in each panel. I kept each panel’s Y axis anchored at zero, so the slopes are more comparable. I used a weighted linear regression and didn’t bin the years first (weighted relative to the number of papers analyzed in that year). Without the binning, there are more outliers, but I don’t see much harm. The ranking of the best 10 changes some, but these are all still in the top 15 for linear fits.

One blemish is the way the regression line goes below zero at the start of “nominally significant.” That’s partly because of the weighting—there were fewer papers analyzed from those early years, so the later increasing prevalence dominates the slope. I suppose I should have used a generalized regression with a different link function to account for the non-Gaussian count data. But why not use my favorite graphical method instead? Here is the same chart but with a spline smoother instead of a regression line. No more negative prevalence!

A 5 x 2 grid of scatterplots with fitted curves. One plot per phrase. The Y axis is relative prevalence. The X axis is years (1990 - 2020).

Embracing curvature

The previous panel is already suggesting that straight lines aren’t so great for some phrase trends. What if we took that into account when choosing the phrases to show trends for? Here’s the top 15 phrase trends ranked by how well they fit a cubic polynomial, graphed with a spline smoother.

A 5 x 3 grid of scatterplots with fitted curves. One plot per phrase. The Y axis is relative prevalence. The X axis is years (1990 - 2020).

Curiously, “an increasing trend” and “a decreasing trend” are both in the top three and look similar. Several phrases peak in the 2000-2010 range, particularly “a nonsignificant trend” and “a strong trend”.

It’s worth keeping in mind that the original and my remakes are looking at relative changes, and the graphs are not using the same prevalence scales. If we sort by prevalence (occurrences per 100 papers), we see that the phrases are rare overall and only two break the 1 in 100 threshold on average.

A 5 x 1 grid of scatterplots with fitted curves. One plot per phrase. The Y axis is relative prevalence. The X axis is years (1990 - 2020).

We have to adjust the Y axis a bit to see the curvature in the next 10:

A 5 x 2 grid of scatterplots with fitted curves. One plot per phrase. The Y axis is relative prevalence. The X axis is years (1990 - 2020).

Oddly, “a clear trend” has a rather unclear trend with a strange drop in the late 90s.

Bring on the p-values

While the trend lines make for a nice display, the really interesting thing about this data set is that they managed to associate p-values with many of the phrase occurrences. Now we can really test if the 505 suspicious phrases they studied really are associated with suspect p-values or not. Here are the 15 most commonly occurring phrases with p-values associations.

15 histograms of associated p-values, one per phrase. A red reference line is at p-value for 0.05.

Overall, there is indeed a tendency for the p-values for these phrases to cluster just above the critical values of 0.05. (Technical note: for these histograms, the bin edge values are included in the lower bin.) However, there are some exceptions. “A significant trend” looks to be not so suspicious after all, with generally low p-values.

The most obvious exception, though, is “all but significant,” which relates to my initial question about what the phrase means. Strange that such a vague phrase should be the second most popular phrase in this collection of papers. My first theory was that the occurrences were crossing two clauses with a comma after the “all,” so I set out to spot-check a few of the papers. Fortunately, the shared data includes PubMed ID numbers for all the papers analyzed.

Mystery solved

While not all the papers are publicly available, the first four that I found public text for all exhibited the same phrase: “small but significant.” Though the original scan apparently went to some lengths to avoid the situation I first guessed and probably others like negative prefix words, this one seems to have fallen through, matching “small” in the scan for “all.”

I posted a comment to the article’s site (a nice feature of the PLOS journals) and got a response within a few days confirming my findings. It’s unfortunate the phrase is so conspicuous in the paper, appearing in the abstract and graphs, but as the authors note, this error doesn’t detract from the overall message about the preponderance of suspicious phrases. It may even help the thesis since it explains this outlier phrase which is otherwise a counter-example.

It’s also unfortunate that there’s no good way to make corrections to published papers. There’s no 1.1 or 2.0 in the academic publishing world.


Leave a Reply