From radar charts to curve fitting and back

What started as an investigation into a few radar area charts and whether they were doing any real analytical work turned into a data adventure with a few new insights for me. The original view is from a Nature article, Designing allosteric modulators to change GPCR G protein subtype selectivity. I have no understanding of the content, but I wanted to try and determine the utility of these radar area charts from a data visualization perspective. At first scan, all I could see was Pac-Man, Germany, and Excalibur.

Getting the data

This article comes with several data files, and it even has a Source Data link paired with this figure, which is usually the best possible scenario. However, I looked through that file and many of the supplementary data files and couldn’t find anything that reasonably matched the quantities represented in the radar charts. Finally, upon reading the figure caption a little closer, I realized the linked data was for Figure 1b, and that the radar charts in 1c used data derived from the fitted curves in 1b.

So now I have a subgoal to recreate Figure 1b from the data. The provided data was in an Excel file containing 15 sheets and each sheet had 4 subtables arranged in a 2-by-2 layout.

Since the subtables were not the same size from sheet to sheet, I couldn’t think of a way to extract the data into a tidy representation using standard data import/reshape steps. Still, the structure is fairly regular, so I started to write a script to identify the subtables and extract the data.

But before I got too far, I had a better idea: see if ChatGPT could help me extract the data. And it could! After I described the structure I had already deduced, it almost succeeded on the first try. It just missed the 2×2 layout, but it handled that after a reminder. I then had it make a couple more refinements to normalize the date fields. Nicely, it provided the original Excel row/column coordinates, so I could trace each value if needed. I only spot-checked a few values since the graphs would be my real quality check.

Making graphs

Here’s a quick mock-up of Figure 1b using a constrained p-spline in lieu of a fitted logistic curve as a quick visual stand-in before worrying about the exact functional form. The other noticeable difference is that my version shows all the data values while the original only shows one point per X per overlay (the mean).

Showing all points makes within-X variability visible, and enables observation of other patterns in the data. For example, I noticed the blue dots in the center panel (sensor protein GoA) look separated into two clumps for each X, with some relatively far above the fitted curve. Zooming into that panel and selecting those points shows they’re all tagged with the same day value.

Hmmm, I don’t know if that’s anything to worry about, but it does make me realize the responses have some degree of serial correlation. Checking the paper’s Methods section confirms that each sequence of concentration readings was taken from the same sample at different points in time as the concentration increased. That might suggest some sort of fixed effects model, and maybe that’s part of the lengthy explanation in the supplementary material.

Logistic Fits

The paper’s main text mentions a three-parameter sigmoidal curve fit with unconstrained lower and upper asymptotes, which I didn’t understand at first because I was thinking in terms of logistic curves which would need four parameters in that situation. Later I noticed the details in the supplementary materials and realized they used a different form (Hill sigmoidal function) with many additional knobs beyond that. Furthermore, it sounds like they did fit four parameters for the basic Hill formulation but one was shared by all curves. They also made a few special modifications for cases where the fit was poor and reverted to linear models or clipped parameters.

Before knowing all that, I proceeded with three-parameter logistic fits (fixing the lower asymptote to 0), which comes close enough to the original chart for my purposes.

My curves extend beyond the data values, which does help see which curves never flatten out, and the poor fits like GsS are too close to the origin to discern much anyway.

I don’t know enough about the science to understand why they wouldn’t fix the lower asymptote at zero. Perhaps they wanted to minimize assumptions that would look like data tinkering.

I did have a hiccup for one of my logistic fits. My initial three-parameter logistic fit for the pink curve in G13 looked like:

I thought I had found a bug in JMP (the tool I help develop), but ChatGPT assures me that this is a well-known failure that can happen when the data never reaches a clear inflection point (appears exponential), calling it a “structural identifiability / leverage problem”. My simple fix was to pre-summarize the data before fitting which stabilizes the fit but obviously discards some information.. A better fix would be to use the Hill sigmoidal function (which JMP calls Logistic 4P Hill) like the paper does since it isn’t prone to this failure. However, there’s no convenient three-parameter version and what I have seems good enough to move to the radar chart step in the journey.

Radar data

Getting back to the radar charts, here is a bigger view of the original.

Oddly, for each compound (NT and PD149163) there are two variables shown, relative efficacy and log(EC50), but one variable is shown with a radar chart, and one is shown with a heatmap. Not sure why they couldn’t use a heatmap for both (and with common color scales even). I guess they can’t use radars for both because of the negative values.

Efficacy seems to correspond to the top of each curve, with some clipping for those curves that don’t level off. Log(EC50) is the X value of the inflection point of the fitted curve. Relative efficacy is the efficacy divided by the corresponding efficacy for compound NT. Which means the NT relative efficacy values all 1.0(!), except when 0, making the radar chart pretty vacuous.

JMP is intentionally not so great at round charts in general and doesn’t have a radar area chart built in. The closest I could easily get is these coxcomb charts.

I do prefer them to the filled radar charts since each wedge stands on its own and doesn’t rely on a connection with its neighbor. Pure speculation, but I wonder if that’s why they didn’t show the radar chart for the last compound in the paper. With no non-zero neighboring values the two non-zero sensors would have no area to fill.

Here’s what heatmaps look like for relative efficacy. There is still the pointless all-1.0 columns for NT (I removed the empty GsL).

Keeping with the filled area theme of the original, here’s a non-circular version of that.

Two versions, actually: one without the normalization and one normalized relative to NT.

Learnings

I don’t have enough domain context to assess how these visualizations communicate an appropriate message, but the exploration process still yielded useful insights.

ChatGPT can extract tidy data from semi-structured Excel spreadsheets
Certain constrained nonlinear models can exhibit structural identifiability problems. For three-parameter logistic, the most common problem can be avoided by Hill versions of the formula.
Radar area charts will fail if a non-zero value is surrounded by zeros. (I imagine even near-zeros would be problematic.)
(not mentioned above) My trick of using Unicode superscript digits for scientific notation doesn’t work so well for fonts that have different vertical positions for those characters.

Raw Data Studies

From radar charts to curve fitting and back

Getting the data

Making graphs

Logistic Fits

Radar data

Learnings

Fediverse Reactions

Leave a ReplyCancel reply

From radar charts to curve fitting and back

Getting the data

Making graphs

Logistic Fits

Radar data

Learnings

Fediverse Reactions

Leave a ReplyCancel reply

Discover more from Raw Data Studies