Convert to CSV webapp

Part of my hobby of digging into data shared by research articles is converting the data in a useable format. The most common data formats I see are text files like CSV and Excel files. I normally use my own tool, JMP, which can import many formats already, including those two.

However, sometimes I come across files using native binary formats of different stat packages: R (RDATA/RDS), Stata (DTA), or SPSS (SAV). For the R files, I can create an R session to convert them to CSV, but it’s infrequent enough that it usually means installing lots of R updates and searching for my old scripts.

I happened to see a TypeScript library for reading R files (rds-js) and decided to incorporate it into a webapp (with Claude Code doing most of the work). It’s called “Comma,Comma” (pronounced “comma comma comma”) and now handles several more formats. I created it for my own use, but I published it via GitHub at https://xangregg.github.io/commacomma/ in case it’s useful to anyone else.

RDS, RData

The rds-js library is in TypeScript, which is basically an annotated version of JavaScript. I don’t have any direct experience with it, but the language seems simple enough that browsers might support it directly, but they don’t. So instead, a build step is needed. I’m pretty sure GitHub has build actions which would allow me to use the TypeScript directly, but I haven’t learned them. For this app, Claude converted the library to a single JavaScript library that I could include in the webapp from GitHub with no further build steps. (I will occasionally need to manually rebuild that copy of the code.)

The interface of the app is just Open, Preview, Download. Here’s an RDS file from a recent study in the preview state.

CSV, TSV, SSV, WSV

The app can also import text data files with fields separated by comma, tab, semicolon or runs of whitespace characters. Even for those, the app can help skipping comment lines or just at normalizing the CSV files. Also, sometimes a CSV file uses semicolons instead of commas as the separator, especially coming from European sources where comma is common in decimal numbers. Here’s such a file Gutenberg-sample2b_metadata.csv from a recent research paper. Initially, the app trusts the file extension.

But clicking the SSV segment triggers a re-parse as semicolon-separated to get the correct fields for download as a “real” CSV file.

Fixed width text fields

Some text files rely on fixed field widths instead of delimiters. Getting the right widths could make for an elaborate UI, but I went for the simplest thing possible: an editable text box of field widths. Fortunately, the app makes a decent initial guess. This import required one width correction and the comment line adjustment.

DTA

Stata files were the trickiest. The specification is public, but there are several versions of it. I was initially able to get Claude to write a parser for the most recent (cleanest) spec, but then discovered that many published data files use older versions. Those older specs allow for tricks to simulate things like long text fields. With some effort, the app now does a decent job with old and recent DTA files, but the quality is at the “entertainment only” level.

SAV

SPSS files also required a custom parser, but the task was more straightforward thanks to the documentation for an existing open source version of SPSS, GNU/PSPP.

Parquet

I don’t come across Parquet files often, but there is an existing JavaScript library, hyparquet, which Claude was able to encapsulate into the app.

JSON

JSON is a wide-open format and not usually tabular. This is the least refined import option, but I did try to make it support obviously tabular JSON files and explicitly tabular variants like NDJSON (Newline Delimited JSON).

Metadata

A separate angle I wanted to explore with this app is the pairing of CSV files with metadata, which seems to be the biggest weakness of CSV compared to proprietary formats. By metadata, I mean things like column data types and other properties that might go in a codebook. Guessing data types usually works but not always. US ZIP code is the famous example of something that looks numeric but isn’t (the leading zeroes need to be kept).

There’s a standard called CWVS (CSV on the Web), and I’ve added a Download Metadata button to the app which creates a CSVW file. For the proprietary formats, the metadata at least includes column data types. Sometimes there is a description for each column. SPSS files sometimes have value labels which the app provides a few ways of exporting (in the data and/or in the metadata).

Here is the Combine option for a SAV file where the data values, such as “Y” and “N”, are combined in each cell with their value labels, such as “Yes” and “No”. Ideally statistical tools would read those labels from the metadata, but until that’s common, this provides a way to access it.