Smoothing Star Trek gender ratios

Our household recently got Paramount+ streaming, and we’ve been catching up on some old and new Star Trek series. The new ones are Discovery and Picard. The more balanced gender ratios are noticeable in the new shows, so I was especially interested to see Birko-Katarina Ruzicka‘s excellent analysis of episode line counts by gender. Here is one of her many graphs, looking at the trend over time for all the series.

This chart uses the smoothest of smoothers, the straight line, but of course I wanted to try more variations. Most commendably, she shared the data and code on GitHub, including the cleaned up script content and gender assignments for almost 2000 characters. I got tired just trying to spot-check the data, so I know it was a lot of work to assemble.

The shared code is in Python, but I wanted to try some interactive exploration in JMP. The first step was importing the giant 20MB JSON file of script data. Unfortunately, the organization wasn’t suited for JMP’s JSON importer. JSON data is largely a nested key-value store, and JMP expects the data to be in the values, but much of this data was embedded in the JSON keys. Fortunately, JMP’s scripting language, JSL, can parse and iterate through the JSON. After creating an empty data table, here’s the parsing code:

json = Parse JSON( Load Text File( "StarTrekDialogue.json" ) );
For Each( {k1}, json,
  For Each( {k2}, json[k1],
    For Each( {k3}, json[k1][k2],
      lines = json[k1][k2][k3];
      nl = N Items( lines );
      dt << add rows( 1 );
      dt[NRows(), 0] = Eval List( {k1, k2, k3, k3, nl} );
    )
  )
);

Next, I had to translate the Python clean-up code into JSL, and along the way I found a lot of other cleaning issues with the character names. Some were typos like “PIKER” for “RIKER”, and others were just variations to standardize, such as “PICARD JR” for, I assume, a parallel universe Picard. And sometimes stage directions were mixed in with the names, such as “KIRK [OC]” for “off camera”.

I reported some additional items I found as a GitHub issue, and they’ve now been addressed by the author. I’m sure more cleaning can be done. Using JMP Recode’s Group Similar Values command provides a report of a few suspiciously similar character names, but most look too minor to affect the results even if they are miscodings.

My first step in understanding a previous analysis is usually to recreate an existing chart, partly to check that I imported the data correctly and partly to understand the choices made. Here’s a version of the original scatterplot with a fitted spline smoother per series instead of one common fitted line.

Each dot is an episode and should be about the same as the red (female) dots in the original chart. My smoothers are still rather stiff to show the main central trends. A few observations:

Voyager is distinctly higher than the other middle era series. I guess having a female captain helps.
All of the middle series have a lot of variation, presumably because some episodes focus on just a subset of the large cast.
Though better, TNG, DS9 and ENT are not much more balanced that TOS. 25% vs 15% or so.
DIS and PIC finally get to gender dialogue parity.
That high dot in TAS is The Lorelei Signal, which I just happened to watch after looking at this data (we’re in the process of watching The Animated Series for the first time). The episode features Uhura taking command of the Enterprise and is written by Margaret Armen. Makes me wonder if there’s a general correlation between writer gender and dialogue balance. Need more data!

While a cubic spline is my go-to smoother, I was surprised to notice another smoother type do much better for detecting shifts over time in this noisy data. Here’s a view of just the middle four series (since they have the longest runs), separated each into its own panel. Instead of date on the X axis, I’m using episode numbers interpolated so that each season starts at a multiple of 100.

The mystery smoother in use here is a modernization of what John Tukey called a “wandering schematic plot” — essentially a continuous version of a box plot. JMP calls it a “moving box” smoother. The line is a moving median and the shaded regions correspond to the box and whisker extents of a box plot. I don’t remember enough of these shows to even guess at explanations for some of the features (the steady climb in seasons 3 and 4 of TNG and the steady decline in DS9 season 3). However, the lower new normal of DS9 starting in season 4 coincides with male Worf joining the cast, and the bump starting in season 4 of VOY corresponds to female Seven of Nine joining, so I feel the moving box is providing some value. There’s also something weird going on toward the end of TNG, where it’s more male-centric overall but skewed with a few female-centric episodes.

The above analyses counted lines of script dialogue. How about words of script dialogue? The time series charts of word percentages don’t look that different from the line percentages, so let’s compare them on a per-series basis.

Avoiding the stereotypical pink and blue, I’ve used nature’s choice of male and female colors, as sampled from pictures of male and female cardinals. Overall, the male characters have more words per line of dialogue. TNG is the most significant exception, while the means in DIS and PIC are too close to distinguish.

Looking at a per-character breakdown of words per line doesn’t show any strong gender patterns.

It’s interesting that the top two characters are the same alien race, Cardassians. Also, the only humans in the top ten are from Discovery. Ironically the communications specialists Uhura and Hoshi are near the bottom. Non-binary Adira had 11.4 words per line but didn’t make the chart with only 168 lines in total.

Finally, here’s a packed bars view of total dialogue word counts. With twenty rows, we can infer that Picard has almost one twentieth of all Star Trek dialogue.

Raw Data Studies

Smoothing Star Trek gender ratios

Leave a ReplyCancel reply

Smoothing Star Trek gender ratios

Leave a ReplyCancel reply

Discover more from Raw Data Studies