Back in 2018 I shared a few graphs about a song’s title appearing within its lyrics, and a few recent reminders have prompted me to revisit that study.
- Dominikus Baur shared a project he co-created about movie titles appearing in the dialog (using the term “title drops”)
- Amy Shafer’s first-listen video for The End by The Doors noted how unusual it is for rock songs to include the title of the song in the first line.
- The paper Soundscapes of morality: Linking music preferences and moral values through lyrics and audio by Preniqi, V., Kalimeri, K., & Saitis, C. (2023, March 23) includes lyrics for songs it analyzes as part of its downloadable supplementary materials.
Looking back at my tweets (search for “song” in my tweet archive), I focused on (1) how often the title appeared:

In this dataset, the maximum was over 140 occurrences of “never” in Flo Rida’s song of that name.
And (2) at what relative position in the song the title appeared:

Lyrics data
The first challenge for analyzing lyrics is finding them in downloadable form. Many lyrics websites exist, but I don’t know of any proper lyrics datasets. The charts above used a dataset of 57,000 songs someone shared on Kaggle. It’s no longer at the URL I tweeted, but it looks like the same dataset is now on Kaggle as Spotify Million Song Dataset. Despite the name it’s still about 57,000 songs, and it has the same error the original dataset had, where the ABBA song titled “She’s My Kind of Girl” has the “S” replaced with “A”.
For this study, I’ll be mainly using the song data from the Soundscapes of Morality paper, both to get a different sample and because it’s more recent and can include newer songs. The paper limited itself to the five most played songs of each of 5,459 artists, totaling over 27,000 songs.
Cleaning and matching
No matter what the data source, much cleaning is necessary to have a decent title search result. In these datasets, song titles often have some extra info included such as “live” or “remix” or even naming a guest artist. Most of those are in parentheses, and I chose to ignore anything in parentheses for the purpose of title inclusion, even though sometimes the parenthetical bit was a substantive part of the title, such as “I Wanna Dance with Somebody (Who Loves Me)” by Whitney Houston. I also removed punctuation, except replacing “+” and “&” with the word “and”.
For the matching, I built a custom regular expression for each title that added alternative versions of some terms. Small numerals also match their spelled out forms. Other alternatives I found while exploring the songs and titles:
- in/ing/ing/ing
- r/are
- u/you
- ur/your
- you/ya
- o/oh/ooh/oooh
- n/and
- duz/does
- luv/love
- thru/through
- wanna/want to
- wanna/want to
- gotta/got to
- gonna/going to
- z/s
I didn’t do any smart stemming to match different word forms, except to allow a suffix on the final word of a title. That is, the title match doesn’t have to end at whitespace. That allows mostly plurals, but another example I saw was a song called “Stop” which only appeared as “stopped” in the lyrics, and now that counts as a match.
Updated graphs
Here’s an update to the “when” histogram with the new data and no longer having a separate bar for a title at the very start of the lyrics.

This dataset has more songs whose lyrics don’t include the title (22% versus 14%). I used the same matching logic, but it could be the newer dataset needed more cleaning to get more matches. Or maybe it’s related to the different song selection criteria or genre, as Amy Shafer suggested.
The occurrences bar chart has a similar shape as the original, except for the already seen increase in zero occurrences.

It even has the conspicuous dips at five and seven. I’ll speculate the preference for even numbers is driven by songs with particular repetitive pattens in the chorus, such as in the Stevie Wonder song, “I just called to say I love you”
I just called to say I love you
I just called to say how much I care
I just called to say I love you
And I mean it from the bottom of my heart
Genre graphs
Amy Shafer mentioned a genre correlation, but unfortunately this dataset doesn’t include genre. I found a few other datasets with genre for different artists, and managed to get some genre data for about 20,000 of the songs. I still had to do a long of hand-recoding to get 100+ genres down to a dozen or so broad genres, and that’s assuming all songs by the same artist fall in the same genre.
But following Rumsfeld’s Law of Data Science (“you analyze the data that you have, not the data that you might want or wish to have at a later time”), I made a few charts based on my shaky genre data.

I’d never heard of adult standards as a music genre—apparently it includes classic pop artists like Frank Sinatra and Stevie Wonder, but it’s the genre most likely to include the title very near the start of the lyrics. Rock is somewhere near the middle.
Instead of summing up an entire genre with one number, we can look the a distribution of the title drop positions, this time as a scatter plot using density random jitter.

There’s a bimodal nature to the position in most genres. With each dot drawn with some transparency, the dark stripes show overly dense parts of the distributions. It seems metal songs are more likely to save the title drop until the end.
Data surprises
I didn’t scan the data carefully, but I happened to notice a couple surprises. The song “Oh Pretty Woman” by Roy Orbison (and later Van Halen) is one where the title first occurs at the end of the lyrics. The song starts with and often includes “pretty woman”, just not with the “oh” until the end. A different matcher have have ignored “oh” and given a different result.
The 10,000 Maniacs song “Like the Weather” never mentions those words in the title. In my mind the lyrics went “I’m feeling like the weather” but in reality it’s only “I’m thinking about the weather.”
