I’ve been experimenting with pseudo-log axis scales recently and found this apparent pseudo-log axis in a Science paper, Wildlife trade drives animal-to-human pathogen transmission over 40 years (now paywalled). This chart also has jitter and some non-linear fit – even more interesting!

By pseudo-log, I mean a transform that behaves like a log transform for large values but still handles zero and maybe even negative numbers, which don’t have logs. A common method is to use a hyperbolic arcsin function, which is linear near zero and symmetric for positive and negative values.
Though the article is now paywalled the data is still public at Zenodo. The data is in an R-native RDS file, which I can easily convert to CSV, thanks to my comma, comma webapp. There’s also a preprint, but the analysis and graphs are quite different there.
It took a while to get the right filters to go from 6596 species in the data file to 583 species in the graph, but the R code helped. (So nice when papers share their code!) Here’s a first view with a pseudo-log axis and a spline smoother.

Adding jitter and a negative binomial model fit (using JMP Pro’s Generalized Regression) gets closer to the original, but still a but steeper.

That’s for a simple pathogens vs. time model, but the model in the paper is more complex, taking more variables into account, and the paper’s chart curve is just a slice of that multi-dimensional model. At least, that’s the closest explanation I can come to understanding the discrepancy.
The pseudo-log transform in the paper is simpler than I expected. They use an offset log transform (that is, log(x+1)) and then relabel the axis ticks with the original integers.
Censored data
Though I don’t read R code or understand wildlife records well enough to follow all of the modeling steps in the paper, I didn’t see any special mention of the values with time = 40. There are a lot of those, and since the records only go back 40 years those time = 40 observations really represent species with 40 or more years of human trade. And given the positive trend, the “or more” group would have higher pathogen counts. So the data set’s time = 40 group is higher than it should be. The effect is most obvious in the sharply increasing slope of my original spline fit. If I redo that graph without the time = 40 group, the first is more linear.

That endpoint of the curve is closer to the paper’s curve, so maybe they do account for the censoring somehow.
So, while I didn’t capture the details of the model, I’ll still count this as a successful chart reproduction, thanks to the shared data and code.
