A recent paper, Having Too Little or Too Much Time Is Linked to Lower Subjective Well-Being, received a bit of harsh criticism on Twitter and yet also enjoyed much press attention. I decided to look more closely at the data behind the main shared image.

The main criticisms I saw online were:
- Very low value, 0.003, from Nassim Nicolas Taleb
- Leverage from extreme points, from Levi Bowles
- Data not public, from lost source (I blame poor Twitter search)
TLDR: Surprisingly, my amateur analysis finds those criticisms to be not so well-founded but that there is a worse issue: extrapolation beyond the data domain.
What is the graph showing?
The first task is to understand the variables and units. The data for this chart comes from a National Study of the Changing Workforce of working people. The well-being rating on the y-axis represents the four responses: “very dissatisfied,” “somewhat dissatisfied,” “somewhat satisfied,” and “very satisfied.” The rest of the variation in the y position is from display jitter. The discretionary hours values come from the question, “On average, on days when you’re working, about how many hours [minutes] do you spend on your own free-time activities?” It looks like most answers are half-hour multiples with jitter applied.
I’m immediately suspicious of the free time values over 7 or 8 hours since the question is about workdays. Hard to imagine someone having 20 hours of free time on average, or even once, per workday. Higher values are likely miscodings or people who misunderstood the question or maybe people with extremely low work obligations. Regardless of the reasons, those data values aren’t informative for understanding the time/well-being relationship in general.
The good news is that those high values aren’t to blame for the criticized results. The bad news is they still get plotted and cause the x-axis to extend enough to draw attention to the extrapolated curve. Essentially a quadratic curve is being fit to the valid 0-6 hours data domain and then extrapolated into the nonsense domain all the way out to 20 hours of free time per day. Extrapolation bad. Quadratic extrapolation really bad.
Getting the data
The paper ominously states in a footnote that “Data are not publicly available,” which may be a restriction from the data source since I don’t see any data download options at the National Study of the Changing Workforce site. So maybe their hands were tied there, but they could have tried publishing the summary info needed to make the graph. That is, just the count at each of the rating/time combinations.
But I don’t give up so easily! Sometimes, a reasonable approximation of the data can be gleaned from the image. The image in this PDF looks like vector quality, so I looked inside the PDF to see if it was indeed drawing all the points with PostScript operations. That idea looked promising at first. The sparse areas are drawn that way, but the denser blobs are not. Here’s what I got out from that extraction.

Still not giving up, my plan C was to take a high-resolution screen capture of the graph and have a script count the darkness of each half-hour cell, adjusting for the line of fit itself. The combined darkness is not exactly proportional to the quantity of points in that cell since the overlaid transparency maxes out after some point and probably doesn’t mix linearly anyway. However, knowing the total count and the fitted curve equation allowed me to create a reasonable approximation. Here’s my simulated data and quadratic fit.

Notice that the curve and points look shifted vertically from the original. The curve is the same place, but the point jitter in my version is centered around the rating value rather than above it.
Fitting a smoother to the data is always a good starting point for me.

This spline smoother suggests that if you were to trust the extreme values (which you shouldn’t) then there is an upward trend at the right edge, so I don’t think the leverage from those values is the source of the downward curvature as one tweeter suggested.
If we ignore the extreme values by limiting hours to 6, here is a zoomed-in view of the linear and quadratic fits (on our simulated data, of course).

Though this isn’t much difference between the curves, I can understand the appeal of the quadratic fit because it suggests at least a leveling off of the ratings, which supports a concept the authors lead with about the diminishing effect of free time on well-being. That seems reasonable, especially since there is a hard cap on the rating, but I think they took it too far by extrapolating into the negative slope region. There are other curves that more explicitly model the leveling-off constraint without the negative slope region, such as this logistic regression fit. It’s generally flat with a small step between two and three hours.

Those tiny R2 values
The R2 in the previous fits seem hardly better than the original paper. All are in the 0.005 to 0.007 range. By the R2 definition, that means the model explains less than 1% of the data variation, which was the subject of the main criticism of the paper’s findings. However, given that the response is integer data with a few possible values, the variance of the residuals can only go so low. To get a sense, I tried simulating responses that were equal to the quadratic equation plus random noise and rounded to an integer. Depending on the injected randomness, the R2 of the quadratic regression was in the range of 0.001 to 0.010. So maybe 0.003 or 0.007 is no so bad for this kind of data.
A side note on R2. Adding a term to a linear regression model will always improve the R2 value, just as a fact of the math; otherwise the added term’s coefficient will be 0 and R2 will be the same. So it’s odd that the paper states, “We also examined whether the quadratic term explained more variance in the model than did the significant linear term alone.” An improved R2 by itself is not sufficient to justify adding the term, though they also add that the improvement was “significant” by some definition.
Dropping the assumption of continuousness
Is it even OK to model those integer responses as continuous? Seems weird, but I suspect the answer is “yes.” I’ve asked survey statisticians about similar cases in the past, and they tell me it’s generally sound to treat Likert data as continuous. It’s not perfect, but close enough.
Nonetheless, I thought I would see what JMP produced when told the response was ordinal instead of continuous. With that, I got a multi-level logistic regression with this image.

The blue lines are separating the response regions. For example, the top region is getting taller as hours increases, so the top response (labeled on right axis) is getting more likely. With higher ratings are becoming more common, the average rating, not shown, is becoming higher. The curves are so shallow it’s hard to sense any curvature, but I saved the probabilities and computed the predicted average rating, and it’s very similar to the quadratic curve for that region.

Media coverage
While I think the extrapolation is a serious problem, the diminishing returns finding in the 0-6 hours range may have real academic value. Unfortunately the press latched to the downward slope part with the store that too much free time is bad. And I imagine there are many more significant factors on well-being. One media article in USNews at least interviewed an outside researcher who offered this too-telling closing comment.
“But,” Maddux said, “if a study gets people to stop and consider what they do with their time, and why they do it, then it’s done its job.”
I really hope the job of a research paper is more than that.
Addendum
The paper discussed four different studies, and this commentary only covers the first one. The second is similar to the first, and the others sound more like group thought experiments that I can’t quite get my head around.