: New: crunching IMDB data

IMDB, the Internet Movie DataBase, has a lot of information about movies, TV shows, actors, directors, gaffers, animators, etc. I just crunched some exported IMDB data to build a crossword dictionary so that the excellent Crossword Compiler could automatically create a "vocabulary puzzle" grid full of Hollywood magic:

vocabulary-puzzle grid whose entries are names of movies, TV shows, and performers

Creating vocabulary puzzles isn't super-useful, but but there have been a few times when I wanted lists of famous movie, TV shows, actors, and I couldn't figure out how to get them. So now I want to jot down some notes about what I did for next time.

A couple of days ago, the excellent Sandor Weisz posted on the Puzzle World Discord:

Sandy ↔: Does anyone know of a word list that's just movie titles, with rankings (of their popularity, not the movie quality)? ideally the list includes spaces between words
…and I tsked, sure that he was hurtling into an abyss of disappointment. Years before, I'd searched the internet for [imdb api] and the top result had been a fairly-useless site. Therefore, I knew that Sandy sought the impossible. But then I kept reading:
Sandy ↔: Update: found this, checking out now. https://datasets.imdbws.com/

So…instead of searching for an IMDB API, I should have searched for exported IMDB data. IMDB posts files with a lot of their most-useful data. They don't post all of their data. But what they do post is pretty cool. (Thank you excellent IMDB people for posting it; thank you Sandy for the pointer.)

names.basic.tsv.gz This is a list of people IMDB knows about. In other exported data files, it doesn't refer to people by name. Instead, there's a name ID like nm1400577. This is useful because there are seven people named Amber Benson who have worked in show biz, but they have different name IDs, so you can keep them straight. IMDB knows that names can contain funny foreign letters as in Skarsgård, Chloë, or Penélope; my crossword-making program is not super-happy with those. IMDB does the right thing here, but I still want to remember to keep an eye on it if I tinker with this stuff again. Here are a few lines from the file:

nm0000229	Steven Spielberg	1946	\N	producer,writer,director	tt0082971,tt0120815,tt0083866,tt0108052
nm0072435	Amber Benson	1977	\N	actress,writer,director	tt0106627,tt0282410,tt0345551,tt0118276
nm1400577       Lawrence Hosken \N      \N      miscellaneous   tt0328181

Those tt numbers are title IDs. They're like name IDs, but for titles of movies, TV shows, etc. The names.basics file doesn't list all titles for a person. E.g., Steven Spielberg has directed more than four movies; the file will list at most four; IMDB's notion of the four titles most-associated with this person. There's some other filter at work: IMDB knows about two works I'm credited on; but this exported data file only lists one of them. (It's fine with me that it doesn't list my other credit, where I donated money to help an animator make a movie and that animator turned out to be a TERF. ugh I'm not complaining about the system; I'm just pointing out something that surprised me.) To find out the titles associated with those tt IDs, check another file:

titles.basic.tsv.gz Lists basic information about "titles": movies, TV shows, etc. This lets you know the actual title of the movie IDed in other files by a tt#. Here are a few lines:

tt0082971	movie	Indiana Jones and the Raiders of the Lost Ark	Raiders of the Lost Ark	0	1981	\N	115	Action,Adventure
tt0282410	movie	Chance	Chance	0	2002	\N	75	Comedy,Drama
tt0328181	videoGame	New Legends	New Legends	0	2002	\N	\N	\N

There's a tt#, a category (movie, TV series, TV episode, video game…), a title, an original title (maybe different).

titles.ratings.tsv.gz: a quality rating and number of reviews for each title. This is darned useful if you're a puzzle designer who wants a list of well-known movies: look for titles that have at least hundreds of thousands of reviews:

tt0082971	8.4	954526
tt0282410	6.7	1221
tt0328181	5.1	16

Nearly a million IMDB users reviewed Raiders of the Lost Ark; it's famous. About a thousand IMDB users reviewed the movie Chance; it's pretty obscure. About a dozen IMDB users reviewed the video game New Legends; it's pretty obscure. But this measure isn't perfect: about 500 IMDB users reviewed the pretty-famous TV game show Wheel of Fortune, 50% fewer than reviewed the delightful-but-obscure movie Chance. So… this measure is a good way to find famous works, but not a great way to find all the famous works.

title.principals.tsv.gz does a better job of listing people associated with each title than you can get from names.basics.tsv.gz, but it's still missing a lot. Here is all the info it has about people who worked on Raiders of the Lost Ark:

tt0082971	1	nm0000148	actor	\N	["Indy"]
tt0082971	2	nm0000261	actress	\N	["Marion"]
tt0082971	3	nm0293550	actor	\N	["Belloq"]
tt0082971	4	nm0722636	actor	\N	["Sallah"]
tt0082971	5	nm0000229	director	\N	\N
tt0082971	6	nm0001410	writer	screenplay by	\N
tt0082971	7	nm0000184	writer	story by	\N
tt0082971	8	nm0442241	writer	story by	\N
tt0082971	9	nm0550881	producer	producer	\N
tt0082971	10	nm0002354	composer	\N	\N

That's just the top ten people in the credits. Many many more people worked on the movie, of course. But the file will show you at most ten of them. There are other limits, too. IMDB knows that about 20 people worked on Red's Dream, the early Pixar short. But the title.principals file only lists two of them. I dunno what system decided that computer nerd Eben Ostby, technical director, didn't warrant a place in title.principals. If I'm building a crossword-dictionary of famous Hollywood types, it's OK if that dictionary doesn't list Eben Ostby; outside the halls of Pixar and the heads of gray-haired computer nerds, he's not-so-famous. But if I'm assuming that the title.principals always has ten names per film with > 10 contributors, I'm in for a rude shock.

In the data files I looked at, I didn't find anything with the complete map of which people worked on which movies. For each person, you can get a list of about-four things they're most-known for… where the "about-four" can be kind of mysterious. For each title, you can get a list of about-ten principal contributors… again, where the "about-ten" can be kind of mysterious. There's a title.crew file that lists more directors and writers for a project, even if there are more than ten. I was hoping for a file like that for performers, but there's none such.

Anyhow, writing this down in case I want to do something like this again and need to de-rust my memories in a hurry. Oh and I put my code online.

Tags: puzzle scene programming entertainment industry