I built a “Data Strips” app to experiment with new ways of graphically summarizing the distribution of a single variable of a data set. You can try it at the above link and access the code on GitHub. This post will introduce the app and summarize the views. The app consists of:
- a control panel for configuration of the data to view and a couple algorithm tuning parameters
- 14 views that the same data set, 6 existing and 8 new. The new ones are drawn in green.
The screenshot below shows the control panel for a data set with 100 random Normal values and 2 additional values (simulating outliers). I’ll sprinkle views of other data sets throughout to reduce the need for scrolling as I discuss the details.

Why?
Every view has strengths and weaknesses, and my main goal is to explore alternatives that may address some weaknesses in common existing views. An ideal data strip view should be one dimensional and show:
- a central estimate (as a value or narrow region)
- the region(s) where the data is most concentrated
- an indication of shape: skewness, symmetry, normality, …
- potential outliers
- the range of the data
Additionally, especially for exploratory purposes, the view should prefer data values to calculated values and prefer robust calculated values to outlier-sensitive calculated values.
New ideas
Scalable outlier threshold
In searching for a scalable outlier threshold, I found Grubbs’s outlier test, which I’m applying to all the new views. The sensitivity can be tuned with the “outlier” parameter in the control panel, which is the alpha parameter explained nicely by Donald Wheeler. Grubbs’s test requires knowing the mean and standard deviation, which I’m trying to not depend on; instead, I infer the standard deviation that would correspond to the same spread region if the data were Gaussian. That may seem like a big assumption, but it just means that the presence of values beyond the outlier threshold suggests either that they are outlier candidates or that the data distribution is not Gaussian, which is all we can hope for. You might call them “Gaussian outliers” in that light. The box plot’s 1.5 IQR outlier threshold has the same property but doesn’t scale with the data size (Fig. 8).
Asymmetric outlier thresholds
Unlike the usual IQR outlier rule, these Grubbs tests are applied asymmetrically. The Gaussian-equivalent standard deviation is computed separately for the lower and upper intervals. Partly to avoid edge cases where the central estimate is very close to one side of the spread interval and partly because Tukey mentioned it as an implicit box plot central estimate, I used the trimean as the central estimate. Technically, shortest interval-based plots use an adapted trimean from the spread limits instead of the quartiles and the midrange of the central interval in place of the median. (Yes, the box plot IQR-based whiskers look asymmetric because they’re clipped to the data range, but the calculation is symmetric.)
Shortest half
I wanted to explore ideas around the Shortest Half, aka Shorth, concept for showing the densest areas. The shortest half is the shortest (max – min) span of data that contains at least half of the values. For Gaussian-distributed data, the shortest half will be the same as the interquartile range. I generalize the shortest half in two ways: to find the shortest intervals for any percentage (not just 50%) and to allow up to one split in the region. The former is trivial, but the latter comes with a big performance hit, going from O(n) to O(n^2). Still, computers are fast, and I wanted to see how it helps with bimodal data (Fig. 5).

Strip details
I’ll review the most common data strips first in a convenient order for exposition and then go into the new strips in the order they appear.
Rug
The simplest strip view is the Rug plot. It shows a short line for each data value. It’s not really a summary view, and it’s only here to help understand the other views. Besides lacking summary information, it becomes muddled for 1000s of data values, and duplicate values overstrike each other. On the other hand, it’s useful for highlighting any such discreteness.
KDE (Kernel Density Estimate)
The Kernel Density Estimate is another non-summary-strip view provided for comparison because some of the summary strips are derived from the KDE. You may notice there are three overlaid KDE areas, which highlights one drawback of the KDE, which is that you must choose a smoothing bandwidth (and a kernel shape). Another drawback is that the smoothed area extends beyond the data, sometimes non-sensical values (such as below zero for a count variable). Violin plots are just mirrored KDE areas, but more often truncated by some rule.

Density Strip
The Density Strip is the same as the KDE area but with the density encoded as color intensity instead of by height of the curve. It seems worse for understanding the shape, but it’s more compact, needing only a few pixels in the narrow dimension.
HDR (Highest Density Regions)
Highest Density Regions plots can be understood as coarse representations of the KDE/Density Strip. Instead of showing a continuum of colors like the Density Strip, they use a few discrete colors for ranges of densities. Ironically, by showing less detail, they tend to show more useful information since they draw your attention to the densest areas. They also show the single densest position (the density mode) as a dark line, which serves as a central estimate. Values beyond the widest density threshold (usually 99%) are shown as outliers. Being based on the KDE, the dense regions are not required to be contiguous, which can be useful for bimodal data (example below). HDR inherits the KDE drawbacks of having to choose parameters and overshooting the data values.

Tukey Box
This is the standard box plot invented by John Tukey, drawn in gray in the app. I’m only using the “Tukey” label to distinguish it from my new variation. The box plot may be the reigning champion of robust strip views. Using quartiles, it provides a robust summary view of the data, and the 1.5 IQR (interquartile range) rule for outliers works well for small data sizes, especially with the whisker ends being truncated to actual data values.
Nick Desbarats has written about box plot drawbacks (follow-up article): in particular, the non-intuitive meaning of the box and whisker shape to the uninitiated and the ability to hide bimodal distributions (Fig. 3) . In Tukey’s book Exploratory Data Analysis, he acknowledged the latter drawback but added “the experienced viewer—finding the whiskers so short, in comparison with box length—is likely to become suspicious that he should see more detail.”
After using thousands of box plots myself, I’ve experienced a different set of drawbacks.
- As data sets get larger, the 1.5 IQR rules starts to break down, which makes it hard to use a box plot as a quick summary of very large data sets. A box plot of 1 million perfectly Gaussian data values will still show about 6000 outliers, overwhelming the rest of the view. Even for a moderate size of 5000 values as in Fig. 2, you can see a distracting number of outliers. Letter-value plots are one attempt to address that (maybe I should add them to the app, though they are more 2D than other “strip” plots).
- The median is not always what I think of as the central estimate. At least for skewed distributions, it often seems that the mode is more representative of the population. It is the most common value (or range of values), after all. See the exponential distribution of Fig. 6.
- (Minor) For small data sets, the median and other quartiles don’t always fall on data values. Sometimes the two nearest values can be quite far away. For the Q1 and Q3 computations, there are ten different formulas different software packages might use.
I’m not as concerned about the bimodal hiding. Besides Tukey’s insight, I find that pure bimodal data is quite rare. More often multi-modal data has overlapping modes that merge. In those cases, like the classic Old Faithful data of Fig. 5, the box plot still gives an indication that things are “not normal.”

Heatmap
Finally, I included what Nick Desbarats calls a distribution heatmap. Just as a Density Strip re-encodes the KDE’s curve height as color intensity, a distribution heatmap re-encodes a histogram’s bar height as color. That is, fixed size intervals are colored based on the count of data values within them. As an outlier treatment, intervals with only one or two values also show those values as dots. Optionally, you could overlay a line at the median or mean. In my limited testing, it holds up well, though I have some concerns for general use:
- You must choose an interval width (and offset), which is akin to the forever problem of choosing the “best” histogram bin width. In the worst case, the bin size and data discreteness can interfere to produce odd patterns (Binomial data in Fig. 7).
- You must choose a color mapping which can be prone to visual perception issues with too many levels and less informative with too few.
- If multiple heatmap strips are being shown together, they need to have their color scale (and binning) aligned, which means that unlike the other strip views, the view will change depending on its context.
- By design, the bin intervals are not aligned to data values, not even in the minimum and maximum extents.
Nonetheless, if I ever get around to making a study for comparing data strip performance on real data, I’ll want to include distribution heatmaps.
New Views
Now on to the new views. Though only some of the views have “Grubbs” in the name, all of them except 20/50/80 Rug use the Grubbs test for determining the outermost region. And the views using shortest intervals use the control panel’s Split Penalty in deciding whether to split a shortest interval into two parts. I’m undecided about whether that’s a good idea, but I’ve left the penalty at a modest value to see how the split intervals do.
HDR Grubbs
These two views follow the same HDR rules for the given percentiles (50/90 or 33/67) and then add one more region based on the threshold from the Grubbs’s test, clipped to actual data values. Compared to the classic HDR, they reduce the number of outliers for large data sets (Fig. 2) and avoid the overflow beyond the data even for internal bounds (Fig. 3 and 4). Only the 50/90 version shows the density mode line. All the HDR plots give direct indication of the bimodal nature of the Old Faithful data (Fig. 5).

Shortest Thirds
Shows the shortest third of the data and the shortest two thirds, in addition to the Grubbs-based widest interval. Like in the HDR using 33%, the idea is to dispense with the ideal of having one central estimate value and show a tight range instead.
Shortest Halves
Shows the shortest half applied iteratively. The final iteration with only 2 or 3 values is known as the half-sample mode. I hoped it might serve as a good data-driven and robust central estimate, but I haven’t been encouraged by my tests so far; it seems jumpy, especially with small sample sizes.
Equal Gaussian
Shows the shortest intervals based on Gaussian distribution percentiles. If the data distribution is Gaussian, then the inner regions should have about the same length. Specifically, if the data were Gaussian, the inner region would be one standard deviation wide (from mean-0.5sd to mean+0.5sd), and the next ones would be 3 and 5 standard deviations side. Besides having those lengths serve as a Normality indicator, using shortest intervals helps with heavily skewed distributions (Fig. 2).
20/50/80 Rug
Instead of trying to be a better box plot or better HDR plot, this one aims to be a better rug plot. It shows the specified shorted intervals to provide centrality and shape indicators, and it shows other values as short rug lines, except it combines adjacent or overstriking lines so that they don’t look denser than the identified dense areas. Not having any explicit outlier rule threshold, this view may be the most purely data driven. The rug technique combined with the low outer percentile (80%) seems to help see the discreteness of data (Fig. 5 and 7).
Shortest 0/50/95
This strip view is intended to be most like the box plot but using shortest intervals instead of quartiles. The “0” indicates the use of the half-sample mode as the central estimate. Like all the strip views based on shortest intervals, it does better than the box plot for highly skewed data (Fig. 6). However, the previously mentioned jumpiness of the half-sample mode may be a drawback.

Grubbs Box
This strip view is the same as a box plot, except with a different outlier rule. It uses the 1.5 IQR value or the threshold from the Grubbs test, whichever is greater. When the Grubbs value is used, the whisker end cap is drawn with a curve (suggesting it’s been pushed out beyond its default value). For small data sets there’s usually no difference, but there’s a big difference for large data sizes (Fig. 3 and 8).
Another possible refinement, I haven’t tried would be to clip the box edges to actual data values.
Large Data
If a strip view is to serve as a general-purpose exploratory data analysis tool, it needs to handle all data sizes well. Figure 8 shows how all the strips handle a large (n=100,000) Gaussian data set. The rigid outlier threshold problem is evident for the common HDR and box plots. Heatmap and Density Strip suffer from color mapping issues where the low-density regions are practically invisible. The rug plot is overwhelmed and has slow drawing speed. The new views do well, but the split shortest interval algorithm had to be adjusted to sacrifice accuracy for speed (not an issue here but could be).

Small Data
The data sets I see most often seem to be small and skewed, like the 25-sample LogNormal of Figure 9. All look okay, in that if asked “is this data closer to normal or lognormal?” most readers would answer lognormal. I still find the box plot appealing, but familiarity is obviously a factor. These are the kinds of views I alluded to early regarding the jumpy half-sample mode. Just change the random seed in the control panel and watch the mode jump from one side of the 50% interval to another.

Conclusions
The forever-conclusion still applies: “more research is needed,” but I would like to whittle down the options before going to the next level, which is probably a similar app that shows the strips in a practical situation. For instance, the same method showing three different groups of results from an experiment. And repeated for a few methods.
I could try to quantify the “jumpiness” of all the methods with some sort of bootstrapped confidence intervals of the range bounds.
Of my three “new ideas” above, using Grubbs’s test for outlier bands seems most promising, and that would be a good option for box plots and HDR plots. However, I’m not sure I’m seeing much benefit from the asymmetric outlier calculations. Probably not worth the complexity and depends too much on having a good central estimate. The shortest interval strips shine for really skewed data but seem too sensitive for small data. Maybe Shortest Thirds is the most viable since its wider intervals are naturally less sensitive.
I invite you to explore the app, experiment with different parameters, and even fork the project on GitHub to make better variations.

4 responses to “Data Strips Experiment”
Fantastic! Thanks
[…] Data Strips: A nice lookup at ways to show the distribution of a single numeric variable “in-line” […]
[…] they look odd (are those really negative values on a log scale?) and might make good fodder for my data strips […]
[…] I wanted to try all the views in my Data Strips web app, and I came up with another view to try that might maintain the simplicity of a box plot: a […]