Visualization study webapp

In academics, data visualization has been tightly coupled to computer science because creating new visualization types generally required custom programming. Put another way, programming ability has acted as a gatekeeper for data visualization research. I wonder if AI coding agents will open up the field to non-programmers; however, having data visualization already embedded in computer science departments may still be a hurdle. Nonetheless, we non-academics can benefit.

While I’ve been a programmer my entire career, I’ve focused on C++ desktop software and have little experience in the web app universe, which is needed for an online data visualization study. Enter Claude Code, and I can build a web app with a questionnaire, training slides, visual perception trials, and a database for storing results. At the current stage, I still have to keep an eye on each code change, and I found a few errors to correct and inefficiencies to avoid, so maybe my initial musing about non-programmers isn’t literally feasible, yet.

The project is a paired visualization study web app on GitHub. It shows pairs of univariate charts and asks participants to rate how surprising it would be to get these charts from two samples from the same underlying distribution.

Origins

The theme is to assess how well different chart types are for exploratory data analysis where a common question is “is this anything”? The easier it is to detect real differences, the more effective the analyst will be. My original incarnation of the app used lineup displays, introduced in Graphical Inference for Infovis using a grid of 20 graphs to conceal one real (non-random) graph. However, early (self) testing showed it was taxing for a participant to scan a large number of images over many trials. Even after whittling it down, a grid of six pairs was taxing, especially for hard to find differences.

Pairs

That led me to just one pair, but then the framing had to change. Deciding on the question may have been the hardest part, especially given a target of non-technical participants.

I started with a “how much evidence” framing but settled on the current surprise framing as less technical. The right framing may be an eternally open question, but I do want to avoid visual-only framing along of the lines of assessing chart differences or suggesting a kind of difference to look for.

Chart types

The app has four basic chart types: violin, bands, box plot and dots, as shown in the top image, with several variations of each, such as with overlaid dots or not. Each chart type has a “How to read” training page.

It’s not much but it’s something, and I don’t expect participants to be willing to do a lot of training. Chart variations can be mixed for the same participants as long as the variants have the same training descriptions. The bands chart variants in particular don’t mix because the band cut-offs are different for each.

Treatment effects

For each pair, one sample has one of four treatment effects applied at varying magnitudes: location, spread, skew and bimodal separation. Some have no difference other than the natural randomness of sampling (that is, effect magnitude is zero).

Feedback

The last page the survey has a comment field, and about 20% of respondents left some sort of comment. Many were compliments describing the study as “nice, cool, excellent, great, interesting, …”, which helps keep me going. It was a feedback comment that suggested adding a score at the end. Though there really isn’t a truly correct response, I was able to make an “alignment score” reflecting how close the response aligned with the treatment effect magnitudes.

Online participants

After initial test runs from social media volunteers, I decided to go ahead and spring for a round of paid testers on Prolific. I got 100 participants for $325, and it took about an hour. Not bad, but I still need to assess the quality.

Early results

The work so far can be considered at best a pilot study, without a clear analysis plan. One idea for analysis is to use some statistical measure of difference and compare participant ratings against that. So far I’ve been computing a Kolmogorov-Smirnov distance and a p-value based score. Here’s rating vs K-S distance with one line per chart type/variant, by participant group.

The chart types aren’t labeled since there’s little distinction between them (for this sample size). The “friends” group performs well. For the sample sizes in the survey (n=100 per pair), a K-S distance of 0.27 is similar to a p-value of 0.05, so the curve should shoot up pretty fast. It’s odd how the Prolific group levels out. If we break it down by the treatment type, it seems the Prolific group was worse at detecting location shifts (in red below), and there were more of those with extreme K-S values.

Both breakdowns show that the Prolific group tends to rate the low K-S distance pairs with higher surprise.

Future

While it’s easy to focus on the disappointing results from the Prolific group, there is enough progress to have hope for improvement. I’m thinking of a several main directions:

Drop it — leave the research to the experts
More data, better participant filtering — to overcome noise/quality issues
Simplify — such as fewer chart types or just one type of treatment effect
Lineups — maybe easier to analyze; maybe ok with few trials
Gamify — to increase engagement