| Who are songs written about? Answer: Jesus, John, Joe, Dan, Johnny, Billy and only then Mary |
[Feb. 22nd, 2008|02:22 pm] |
| [ | music |
| | 50 years of rock'n'roll | ] | Inspired by some graphs dr_tectonic produced which showed graphical representations of individual songs, I got distracted this morning and started thinking about who songs are written about.
As it happens, I have a (gendered) list of names on my computer from previous research, and I also have a dump of OLGA (the OnLine Guitar Archive, now defunct due to pressure from the MPA and NMPA) for my own personal, educational and research use. That's about 10,000 songs from about 1,100 artists; it's worth noting the list is biased towards English-language popular music with guitars in it that can be represented with chords or tab (i.e. rock/blues/pop) from the last fifty or so years.
There are a few things that make this problem difficult. First, identifying names is hard. I've assumed that they're capitalized words in songs that appear on the name list. That does cause some problems: there's a lot of names that are common words ('Will', 'Hope', 'Van', etc). Second, there's no XML here: it's all just flat text files, in directories by first eight characters of the band name. Third, identifying gender of names is a whole problem unto itself. Fourth, I'm assuming that anonymous transcribers of songs are scrupulous about capitalization -- no "layla! i get down on my knees". Fifth, I don't want this to be more biased than necessary by the names of the artists: I want to know who they're writing about, not who's doing the writing. And sixth, I'm count each name each time it shows up, not once-per-song, so the name 'Layla' gets nine hits, despite the fact it's probably one song. Makes you wonder why I even bother trying. So the code has a lot of hedging to try and get around those problems[1], *and* I have to manually go into the result and decide what I think are and are not names.
But, in conclusion, out of 1,255,417 lines in 10,296 songs, the following names are mentioned more than 50 times:
will 464 #likely not a name most of the time jesus 199 john 163 joe 144 america 147 #wierd name list dan 108 johnny 108 billy 108 mary 102 #first female name paul 78 van 72 #mainly due to a lot of non-english songs, not mr. morrison james 69 peter 69 jack 66 tom 66 sally 63 jimmy 62 santa 59 ray 59 polly 55 willie 51
Aren't you glad you didn't ask? J [1]: I skip the first five lines of the file, I skip any lines with 'by' or 'artist' in them, I skip lines which are All In Title Case, I skip lines that have the name of the directory in them [ie some approximation of the artist], and I skip the words ["you", "love", "come", "song", "into", "set","straight", "christmas","lady","round","york","melody","young"] which are in the names list but seem unlikely in this context to be informative.
For more detail, you might want to check out the complete output of names and frequencies here.
You could, but it's unlikely, want to check out the code here or the [gendered] namelist here. |
|
|