These regression charts from the working paper, The New Era of Unconditional Convergence, caught my eye as having a few suspicious traits and warranting further investigation. This post expands on my initial investigation in a Twitter thread.
What’s suspicious about them?
- Each point in the graph is a country, which opens up to question of weighting given the disparate country sizes and possibly the amalgamation paradox.
- The x-axis uses a log scale.
- The y-axis is growth, with no sense of base size.
- The two time intervals are different lengths, 40 years on the left panel and 19 years on the right.
- The confidence intervals don’t look very confident. That is, it looks the null model, a horizontal line showing no change, might fit within the confidence regions.
- The panels appear to have a different number of points.
- The points don’t appear to follow the lines very well. It looks like a common situation where the right end of the line is anchored to a cluster of points, the middle is unconstrained due to wide variation and the left end is at the whim of relatively few observations.
- For growth, why not show time on the x-axis?
Baseline: reproduce the original
A common first step in my graphical explorations is to reproduce the original graph. That allows me to confirm that I have the right data and understand how the graph is made. (And to exercise my own software, JMP.) Sometimes getting the data is a challenge, but in this case the authors provided most of the data and code. My first attempts didn’t quite match up, and I had to dive into their Stata code to see how the graphs were constructed. It turns out there are few details not apparent in the original:
- Oil-producing countries were excluded. Presumably they have categorically different growth models.
- Small economies, less than 1M GDP, were excluded. That’s one reason the second panel has more points.
- The second panel actually runs from 1995 to 2019, not 2000 to 2019 as the paper states. I haven’t heard anything from the authors after tweeting that mistake.
Taking those into account, I can make a faithful reproduction of the original:
I don’t know anything about the subject of the paper, economic convergence, but I think it’s trying to show that poorer countries now (second panel) have higher growth rates than wealthier economies, and income levels will eventually converge. However, the paper admits that such convergence will take hundreds of years at its current rate, which is both of little practical consequence and a long time to extrapolate from a 25-year sample.
A mathematical aside: the paper uses log(end/start)/n for growth. I used (end/start)^(1/n)-1. They’re practically the same since log(x) ≈ x-1 for x near 1. I guess the former became an economics convention when logs were easier to compute than fractional powers.
Before exploring other questions, I always like to try replacing the regression fit with a smoother. Of course, you can think of a regression line as an extremely stiff smoother so it’s already a smoother in that sense, but often the stiffness is unwarranted.
From these exploratory fits, I might pursue a different hypothesis based on poor, middle and wealthy segments but the general trend of poorer countries growing faster is the same.
Sticking with the growth-by-income framing but removing the interval differences and log scale, we get:
I don’t know enough about the subject to know if the log scale is appropriate. Combining a log scale on the x axis and a regression line says that a change from 1k to 2k affects growth rates as much as a change from 10k to 20k. Another domain question I decided to stay away from is weighting. In these fits, each country above $1M GDP has equal weighting and all others have zero weight. Seems harsh, especially since it could become sensitive to political changes. For instance, that would make the former Sudan economy count twice as much after South Sudan split off into a separate economy. On the other hand, weighting by economy size or population is very skewed and the fits become heavily influenced by just a few countries.
Time on the x-axis
Getting back to one of my original questions, why use income as the x variable instead of time? Here’s what 20-year growth looks like over time after splitting the countries into terciles by income. I haven’t mentioned it, but I’m using the paper’s treatment of “income” as effective GDP per capita. I don’t know if that’s a standard country-level definition or not.
That’s a lot easier for me to the change in growth patterns. In the above graph, the countries are placed in income terciles based on their average income levels of the entire time span. That is, the countries in the low group are the same over time. What if we made the grouping dynamic? Here’s what it looks like if we reassign the groups each year based on the current income levels.
Interesting difference and might better indicate how a country’s current income level might affect its growth rate. However, sticking with the fixed categories allows us to look at a country’s growth within its group over time, such as in the following spaghetti plot where we look at income levels over time.
Each thin curve represents a country and the thick curves are the group averages. By using a log y axis, we’re still comparing growth rates, which are now represented by the slopes of the curves, and you can see how the low income group has been growing faster in the last 20 years. But we have a little context now since we can get a sense of the actual income levels. To really get the context, we need to lose the log axis, though.
I feel like with each step in this journey, we’ve gotten closer and closer to the raw data, and only now can we start to graphically understand why the “convergence” will take hundreds of years. Countries growing at 3% will eventually catch countries growing at 2% but it’s a slow process when the starting points are orders of magnitude apart.