Pseudo-log axis in the wildlife

I’ve been experimenting with pseudo-log axis scales recently and found this apparent pseudo-log axis in a Science paper, Wildlife trade drives animal-to-human pathogen transmission over 40 years (now paywalled). This chart also has jitter and some non-linear fit – even more interesting!

Original chart from the cite paper with number of pathogens on the y axis and time in trade on the x axis. There's one dot for each of 5683 species and a model curve that slopes upward.

By pseudo-log, I mean a transform that behaves like a log transform for large values but still handles zero and maybe even negative numbers, which don’t have logs. A common method is to use a hyperbolic arcsin function, which is linear near zero and symmetric for positive and negative values.

Though the article is now paywalled the data is still public at Zenodo. The data is in an R-native RDS file, which I can easily convert to CSV, thanks to my comma, comma webapp. There’s also a preprint, but the analysis and graphs are quite different there.

It took a while to get the right filters to go from 6596 species in the data file to 583 species in the graph, but the R code helped. (So nice when papers share their code!) Here’s a first view with a pseudo-log axis and a spline smoother.

Adding jitter and a negative binomial model fit (using JMP Pro’s Generalized Regression) gets closer to the original, but still a but steeper.

That’s for a simple pathogens vs. time model, but the model in the paper is more complex, taking more variables into account, and the paper’s chart curve is just a slice of that multi-dimensional model. At least, that’s the closest explanation I can come to understanding the discrepancy.

The pseudo-log transform in the paper is simpler than I expected. They use an offset log transform (that is, log(x+1)) and then relabel the axis ticks with the original integers.

Censored data

Though I don’t read R code or understand wildlife records well enough to follow all of the modeling steps in the paper, I didn’t see any special mention of the values with time = 40. There are a lot of those, and since the records only go back 40 years those time = 40 observations really represent species with 40 or more years of human trade. And given the positive trend, the “or more” group would have higher pathogen counts. So the data set’s time = 40 group is higher than it should be. The effect is most obvious in the sharply increasing slope of my original spline fit. If I redo that graph without the time = 40 group, the first is more linear.

That endpoint of the curve is closer to the paper’s curve, so maybe they do account for the censoring somehow.

So, while I didn’t capture the details of the model, I’ll still count this as a successful chart reproduction, thanks to the shared data and code.


Leave a Reply

Discover more from Raw Data Studies

Subscribe now to keep reading and get access to the full archive.

Continue reading