I’ve subscribed to Christopher Ingraham‘s new Substack blog of graphical data dives called The Why Axis, and the first week hasn’t disappointed. I’m sure to learn a few things and get some ideas to make my own work better. The first data post, Getting our drink on, is about alcohol sales in the US and includes this choropleth map.
The two things that first stood out to me were the irregular color scale and the artificial dot for DC. Non-linear color scales are not that uncommon, but it’s still something to be aware of since it can shift the data focus. I hadn’t seen the DC dot before, but it makes perfect sense since DC would otherwise be too small to see in this map.
I downloaded the data from the NIH site and started exploring. I quickly recalled all the choices that needed to be made in making such a seemingly straightforward map. The choices might be grouped into three main areas: data transformation, map projection and color scale.
Data transformation choices
The first issue I ran into after getting the data was that the map shows “per-capita annual glasses of wine consumed” but the data has gallons of ethanol sales by beverage type. Sales is not quite the same as consumption especially with people crossing state lines, but it’s close enough under Rumsfeld’s rule of data science:
You analyze the data that you have, not the data that you want or wish to have at a later time.Rumsfeld‘s rule of data science
In some ways, that’s the great strength and the great weakness of much data science.
Ingraham mentions New Hampshire sales being artificially high because of lower alcohol taxes than neighboring Massachusetts, and I imagine DC is artificially high because of its commuter city nature. Can’t explain Vermont, though.
There’s still the issue of converting gallons of ethanol to glasses of wine. I found that standard wine glasses vary from 5 to 6 ounces and ethanol per wine volume varies from 5.5 to 16%. The source data does have pre-calculated gallons of beverage, assuming 12% alcohol for wine, and I used 5 ounces for the wine glass size. The meaning of per-capita is also a choice; the data provided populations for 14-and-over and 21-and-over groups. The latter seems more relevant, but for comparison I followed the map’s use of the former group. Still, my numbers were a little less than the map shows (201 for DC instead of 215), so I made a scaling correction for better comparison of other choices.
Finally, the data set includes 50 years of data. Reasonable choices for a map display are a grand average, a recent average or the most recent data, which is what is shown here.
Map projection choices
The first choice for the map projection is geographical or not. Geographical is the most familiar and is useful for making geographical observations, such as Ingraham’s observation of low wine consumption in the middle of the country. However, it’s famously not so good for other insights because of skewed state sizes and populations. I’m counting the original map as geographical, even though it distorts Hawaii, Alaska and DC. The rest of it looks like a localized Albers projection, which is my usual preference since the areas are true.
For non-geographical maps, yet more choices need to be made: which shapes? equal or proportional area? Here’s an equal-area cartogram based on four-hexagon state shapes, designed by J. Emory Parker.
Color scale choices
Most of the choices for a choropleth lie is setting up the color scale. These include:
- Which color(s)? Greens in the original, perhaps alluding to grapes.
- Sequential or diverging colors? Sequential here; diverging would emphasize variation from the average.
- Continuous gradient or discrete colors? Discrete colors in the original; continuous gradient in my hex map.
- Linear, quantile, or other value mapping? The original uses a five-level quantile mapping, aka quintiles.
Regarding the last choice, I made this strip plot to understand the color mapping of the original.
Each dot is the wine consumption of a single state, and the background shading attempts to match the original map’s coloring. When first saw the irregular intervals on the map, I thought it was using something like Jenks natural break optimization to color clusters, but this scheme actually breaks up the clusters of values around 50 and 70. Now I realize it’s using quintiles, which means each color has about the same number of states assigned to it. That makes the last interval quite wide, which is probably a good thing since we suspect those high values for DC and NH are artificially inflated.
I tried to address that skewness a different way by using a continuous scale that’s linear except the darkest greens are stretched out in the color gradient.
I usually prefer a continuous gradient to avoid having small differences look big. For instance, New Mexico and Texas are both around 70 glasses per year, but they have fall on different sides of the color break at 70 in the original. Similarly, North Carolina and Virginia are on different sides of the break at 97. The low consumption values for the central states are still apparent though not as prominent as when they all had the same color.
It took some effort to achieve the DC dot effect. I suppose I could have overlaid a one-point scatter plot with DC’s longitude and latitude over a map, but I ended up making a custom shapefile with a circle-shaped polygon for DC. One feature of that approach which is more of a bug is that the dot is preserved on zooming in — reverting to the true shape would probably be better.
I had to adjust the circle to counteract the effects of the projection, and the projected version is still only a rough approximation of a circle. I like the circle, instead of, say, an enlarged DC shape, since the circle is obviously artificial. Definitely something to add to my toolbox.