This chart from Nathan Yau’s Flowing Data caught my attention, wondering about the confidence interval of ups and downs of the smooth trend line. And the overall percentage seemed high, partly because it took me a while to figure out what “percentage divorced” referred to.
The data is sourced as “2019 American Community Survey” which is a product of the US Census Bureau. I quickly ended up at the data.census.gov main portal.
I usually struggle with these web query interfaces when I really just want the raw data to slice on my own locally. Fortunately, after trying ACS in the search fields the results included an interesting button off to the side called Public Use Microdata that leads to a “beta” site for building custom queries. It still took me a while to figure things out, but the site had very good responsiveness and nice variable name filtering. In this screen capture, I typing “mar” into the variable name field for looking for marital info and then checked the variables I wanted.
I also found income fields, but those presented a challenge. Income is a continuous field and needs to be binned before using to generate counts. The default binning was basically only a bin for positive values and a bin for negative values. There is an interface for custom binning, but it seems you have to enter the details for each bin individually. Hopefully I’m missing something there, but I did stumble on a work-around. I noticed that when you make a table from those bins, all the bin details were in the URL. It was a minor mess to deal with the URL escaping, but I was able to write a script to generate the long URL for all 100 bins used in the original chart.
Most published charts, including the original here, only provide a general indication of the data source, so it’s nice to be able to share the link for my exact data. It did take a few tries to figure out which variables to use. My first attempt used marital status alone, but percent divorced was in the 10%-20% rage instead of 30%-40%. Nathan Yau responded on Twitter that he was also counting previous divorces, but I couldn’t find a field in the ACS data set for that, only a field counting marriages. Eventually it dawned on me that having more than one marriage is a good proxy for having been previously divorced. The Widowed category is not trivial, so there is some inaccuracy from that.
With the data in hand, I was almost ready to make a graph. I just had to combine the columns to calculate the number who had been ever divorced as those currently divorced or with more than one marriage, and those ever married as those with one or more marriages. “Separated” was also a marital status choice, and I counted 75% of those in the ever divorced group, based on a quick internet search.
My version mostly agrees with the Flowing Data original. I weighted by smoother by the count of each bin and added a bootstrap confidence interval to emphasize the greater uncertainly for the high income groups. My first bin is different because the original chart omitted the 0-$10k bin. It’s hard to say if the dip around $600k is real. Here’s a version with a stiffer smoother.
One thing you notice when you actually look at the data is the obvious fact that most people are in the lower income bins, mainly under $100k. Someone suggested sizing the dots by count, and here’s what that looks like:
That makes the upper income peaks and valleys seem even less relevant. Perhaps looks at income quantiles on the x axis would be more useful. Then again, I don’t think the original chart was trying to be too analytical. It came out shortly after the news of the divorce of Bill and Melinda Gates, and was likely playing up the idea of divorces among the wealthy.