Data extraction challenge

Throughout my quests for raw data, I’ve learned a few techniques for find data lurking behind the charts. Though many journals now require a “data availability” statement to be present, the statement is not required to be useful, or complete, or even true. Before I digress into that rant, let’s walk through a data extraction exercise.

I’m interested in getting the data behind these box plots in the paper, Association of viral RNAs in the choroid plexus with bipolar disorder and schizophrenia and evidence for the hepatitis C virus involvement in neuropathology, in Nature’s Translational Psychiatry journal.

Why? I’m not sure. Partly because they look odd (are those really negative values on a log scale?) and might make good fodder for my data strips experiments.

First, we check the Data Availability statement. It says, “The sequencing data is available upon request from the Stanley Medical Research Institute.” Not helpful. How does the requester know they’re getting the original data, or if two requesters are getting the same data. I’m not even sure the Figure 1 data would be considered “sequencing data.”

Given the small number of data points, I could jump straight to my fallback option, which is to use WebPlotDigitizer and click on each point to get its coordinates. However, besides being tedious (I can rarely get the automatic options to work), it risks missing points hidden behind other points.

Inspect the HTML

Sometimes, a web version of an article will contain resolution-independent charts, either using SVG or JavaScript. In those cases, data values can sometimes be extracted. But this time, the images are fixed-resolution image files, as can be verified by zooming in and seeing things get fuzzy.

Check the PDF

Journal papers often have a downloadable PDF version, which might have resolution-independent charts. Sure enough, zoomed-in images in the PDF remain crisp, so there is hope.

I renamed the PDF file to have a “.txt” extension and then opened it in my favorite text editor, BBEdit (not that I’ve tried many other editors). Now we can see the PostScript source, but unfortunately the bulk of the content is compressed.

<</Type/XObject/Length 1716/Filter[/FlateDecode]/ColorSpace/DeviceGray/Height 90/Subtype/Image/Width 1100/BitsPerComponent 1>>stream
x⁄ÌŸÕn€F...QXÄπΩÂ`tÛAˆ˙:9yA=pü†yî“P˜Ê7h((ÄØ=Öª˝œÏí¢®JV`®H^}p˘#πúùŸx?íGÈ/íã‰"πH.íã‰"πHF+YÀùüØﬂÑáÍ˛˛·˚‚L…-û.Ûy˚&<ÍIÍ≥kø˜›O⁄çÎîw°cc@bâB.¡Kïykö7Ò]

PDF to SVG

Rather than trying to find a decoder, I used an online PDF-to-SVG convertor, and it worked well, producing one SVG file per page of the PDF.

SVG to CSV

SVG is an XML-based representation of Canvas-like drawing commands. PostScript and SVG are a close match, but PostScript doesn’t have a circle drawing primitive operation. Instead, it has paths made of arcs or Bézier curves that can be stroked or filled. I scanned the file for path commands that had curves and a fill color, and found some promising matches. Excerpt:

<path transform="matrix(2.7777777,0,0,-2.7777777,378.90556,394.22795)" d="M 0 0 C 0 .932 .767 1.699 1.699 1.699 C 2.64 1.699 3.406 .932 3.406 0 C 3.406 -.941 2.64 -1.708 1.699 -1.708 C .767 -1.708 0 -.941 0 0 " fill="#326895"/>
<path transform="matrix(2.7777777,0,0,-2.7777777,392.28889,287.21128)" d="M 0 0 C 0 .941 .767 1.708 1.708 1.708 C 2.64 1.708 3.406 .941 3.406 0 C 3.406 -.932 2.64 -1.699 1.708 -1.699 C .767 -1.699 0 -.932 0 0 " fill="#3d7fb5"/>

Conveniently, they all have the same path curve coordinates and use the transformation matrix to specify the location and size. So I only need to extract the offset values of the transformation matrix (and I might as well get the color, too). For that, I used BBEdit’s handy Extract option in Find/Replace with these regular expressions:

Find: path transform="matrix(-?[\d.]+,0,0,-?[\d.]+,(-?[\d.]+),(-?[\d.]+))" d="[^"]+" fill="#(……)"
Replace: \1,\2,\3

That creates a file where all SVG lines matching the Find regex will generate a new line based on the Replace expression, which perfectly fits a CSV file with columns “x,y,color”.

378.90556,394.22795,326895
392.28889,287.21128,3d7fb5

Pixel coordinates to data values

Importing that CSV file into JMP and plotting the x and y columns produces:

I’m reminded that SVG vertical coordinates go from 0 at the top, so if we flip the y-axis and ignore the dots in the upper-right (presumably some false matches for the regex) it starts to look a lot like the original eight pairs of box plot data.

Now I just need to separate out the each box plot’s points and scale the y coordinates. I would often make a formula for the former, but this time I just selected each group of points in the graph and used JMP’s Name Selection in Column feature to create the labels. For the y scaling, I assume the lowest value in each group was -1, and I eyeballed the upper value and scaled the other values in between those with a formula. I also wrote a JSL script to convert the HTML colors into JMP colors.

Recreation

My recreation looks close to the original. I didn’t preserve the jitter in the x direction or the way each frame had its own y axis range (the ranges are so close, why not use the same scale?). However, some of the box plots are noticeably different.

Box plots from different software will often vary slightly because of the different rules for estimating quartiles, and that’s especially prominent with small sample sizes. With 7 or 8 data points, the ideal upper quartile would include all but 1.75 or 2 data points, but the original shows 3 values above the upper quartile. And the upper whisker is not even close to 1.5 times the InterQuartile Range.

I say “7 or 8” data points because there’s something odd about the count, too. The caption and text of the paper says there were 7 HCV individuals, but the chart has 8 dots (only for this gene and one other). You have to look closely at the bottom of the original box plot to see a fifth bottom dot to go with the three upper dots to make eight.

Analysis

After all that, I fear any further exploration will be anti-climatic, but I at least want to see the non-log view. Nowhere in the paper does it explain the log y axis, so I don’t know if it was done just for the view or if the data is modeled better in log terms. Whatever the reason, it’s still weird to see negative values on the scale. It’s not too uncommon to see log(x+1) as a transformation when data can contain zeros, but the authors seem to have used a different method of using log₂ for positive counts and -1 for zero, which works since presumably the original count data are all integers, so there are no values with negative logs.

I tried reversing that transformation. My results were not close to integers as I expected, but I remembered I had eyeballed the upper dot locations during the initial conversion of SVG coordinates to axis coordinates (plus, there appears to be some vertical jitter in the dot locations—most noticeable in the bottom values of the upper left original box plot). If it were important to extract the original counts, I could adjust my original pixel scaling parameters until the un-logged values were integers. Skipping that step, here are the “raw” data values.

The last panel still stands out, though I’m not even sure it’s supposed to. There is little mention of the figure in the paper, just “Despite having a small sample size, we discovered robust and highly consistent changes in the expression of 14 genes (Fig. 1) ….” But the figure shows 8 genes instead of 14, and the last one does not show a big difference between the CTL and HCV groups, and it’s not even clear how they measured “highly consistent changes.”

If the log transform is purely for graphical display, a square root transform is another option which supports zero values and has less distortion. For better or worse, these box plot quartiles are based on the untransformed counts, which affects the whisker lengths.

While I’m still not sure if the difference seen for the last gene is important, I did make this view of median counts across groups, with a line for each gene, which highlights the difference in that gene.

All of the “oddities” I discovered are likely explained by my ignorance of translational psychiatry, but it also underscores the need to share source data. For instance, where is the data for the other six genes?