phraser

Downloads:

Words:
500,000+ words
Phrases:
5,000,000 phrases
"Prebaked" data files:
- IMDB popular titles/names
  A few thousand each popular entertainment titles and people
- Google Ngrams: 1- and 2-grams Common words and two-word phrases. American English Ngrams from many many books by way of the Google Books Ngrams data.
- Crossword Answers Some highly-rated phrases from Crossword maker dictionaries, with spaces.
- Project Gutenberg/100
  Words and phrases from English books in Project Gutenberg (as pre-processed by the excellent Gutenberg, Dammit project)

Phraser generates a long list of English phrases and words. The list is roughly ordered by suitability for word puzzles. Thus, if you have tools handy for searching text files, you can use them to find likely puzzle answers by searching this phrase list. Phraser generates its list by analyzing big piles of text:

Phraser's output is a text file where each line is of the form score TAB phrase; the file is sorted such that high scores come first. "Score" roughly corresponds to how common a phrase is; there's some tweaking to boost clue-able things. Some excerpts from an output file:

You might wonder how such a list is useful for designing puzzles. Consider the following puzzle:

Musical Numbers
Get a secret message out of this:

Travis Tritt: ♫ Doesn't the Good ________ ___ ___? ♫	_ _ _ _ _ _ _ _ _ _ _ _ _ _?
He defined the dogma of the Immaculate Conception	_ _ _ _ _ _
Minnesota governor most likely to win a cage match	_ _ _ _ _ _ _ _ _ _ _ _

♫ __ _ _______ in the slipstream/Between the viaducts of your dream ♫	_ _ _ _ _ _ _ _ _ _ _
Prime Directive novelist	_ _ _ _ _ _ _ _ _ _ _ _-_ _ _ _ _ _ _
Songs from Lonely Avenue band	_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
a tyro	_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

↓
_ _ _ _ _ _ _

Here (SPOILER WARNING), the gimmick is that each partial answer contains the name of a digit. E.g., that Travis Tritt lyric ends "…outweigh the bad". The secret message is a seven-digit phone number, so the puzzle designer needed seven of these digit-containing phrases. Those aren't so easy to think of with your brain; but if you've compiled a big list of phrases, it's pretty easy to search. And if you've (roughly) sorted your list by puzzle-suitable-ness, then you shouldn't have to wade through too much trash to find the good stuff. You might slap together a quick-and-dirty little Python script…

DIGITS = ["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]
found = {}
for d in DIGITS: found[d] = []

for line in open("phrases_5M.txt"):
  score_s, phrase = line.strip().split("\t")
  spaceless = phrase.replace(" ", "")
  for d in DIGITS:
    if not d in spaceless: continue # find jeSE VENtura
    if d in phrase: continue        # ...but not SEVENties
    found[d] = [f for f in found[d] if not f in phrase] # found "pope piuS IX"? discard "piuS IX"
    found[d].append(phrase)

for d in DIGITS:
    print d, found[d][:10] # for each digit, show up to 10 "best" phrases

$ python sample_digits.py
zero ['fertilizer on', 'organizer of the', 'buzzer on', 'brian setzer orchestra', 'popularizer of', 'appetizer or', 'eliezer of damascus', 'gaze roamed']
one ['back to new york', 'carbon emissions', 'what to do next', 'to new york in', 'also never', 'peace on earth', 'to newport', 'on either side of the', 'restriction enzymes', 'conservation efforts']
two ['that would have been', 'it would seem that', 'it would appear that', 'that would not be', 'gunshot wounds', 'it would be more', 'in the ancient world', 'and it would be', 'as it would be', 'do not worry about']
three ['with reeds', 'with reed', 'judith reeves']
four ['days of our lives', 'of our god', 'of our times', 'the history of our', 'the nature of our', 'the time of our', 'the results of our', 'of our world', 'of our existence', 'the beginning of our']
five ['if ive', 'if i venture', 'fi ve']
six ['pope pius ix', 'louis ix of france', 'charles ix of france', 'king louis ix', 'acknowledgments ix introduction', 'of pius ix', 'charles ix of sweden', 'king charles ix', 'ansi x3', 'contents acknowledgments ix']
seven ['it was eventually', 'sports events', 'and perhaps even', 'and sometimes even', 'and was eventually', 'jesse ventura', 'towards evening', 'this event was', 'it is even more', 'news events']
eight ['weigh them', 'to weigh the', 'outweigh the benefits', 'outweigh the costs', 'outweigh the risks', 'weigh the pros', 'outweigh their', 'weigh the benefits', 'weigh the risks', 'outweigh the disadvantages']
nine ['an inexperienced', 'an inexplicable', 'an ineffective', 'an inexorable', 'than in england', 'an inexpressible', 'an ineffectual', 'an inestimable', 'education in england', 'for security and co operation in europe']
$

…and now you've got some choices to look through. They're not all great; even some of the OK results need some massaging. (E.g., the word list has "judith reeves" but her full name is "Judith Reeves-Stevens.")

How to Use It

Get some text fodder. This program can read mediawiki dumps (e.g. wikipedia snapshots), Google Ngrams files, plain text files, and Crossword Compiler word-list files.

"Prebaked"

You can use phrase lists outputted by previous runs of phraser as input to a new run of phraser.

Mediawiki

Wikipedia: Learn about Wikipedia downloads. You want a file with a name something like enwiki-20160204-pages-articles.xml.bz2; you'll download and uncompress it.

Wiktionary, wikiquote, and wikisource also have downloadable snapshots. Writing a bunch of puzzles on some theme nerds like? See if there's a fandom.com on the topic and if it has a downloadable snapshot (which will be linked from http://whatever.fandom.com/wiki/Special:Statistics; you want "Current Pages".)

Google NGrams

Low-effort way to use the 1- and 2- grams: Grab the Ngrams item from the "Prebaked" section of the Downloads list at the top of this page.

Download many data files to a directory. This is a big data set; it might take several days to download. It might fill up your hard drive.

Run the


   ngram_winnow.py

script in your download directory. If your hard drive is filling up, you might want to discard the ngram data files as this script generates "winnowed" versions of them.

Text files

Aside: If you're thinking I'd like a lot of text files you might think I'll grab Project Gutenberg's tens of thousands of books. Instead of putting all those into your text-files directory, look up in the "Downloads" list at the top of this page and grab the "Project Gutenberg/100" file and put it in a directory of "prebaked" files.

Build, Run

Run it, telling it where to find those directories full of data files you downloaded: phraser -wikipath dumpz/mediawiki/ -txtpath dumpz/textfiles/ -ngrampath dumpz/ngram/ -xwdpath dumpz/xwd_lists/ Noisy logs are noisy, sorry.

In the first stage, In a tmp directory, it gathers files with "scores" for parts of the data. In the last stage, it reads in those intermediate files to generate one big count; during this stage, it's using a lot of memory.

When done, it generates a big output file in your home directory at a file named Phrases_something_something.txt. You might use this to generate smaller files for day-to-day use, perhaps something like

Notes on Data Sets

Wikipedia is pretty sweet at covering many topics. It's not a great source for "playful" language. Any wikipedia article saying "The emperor was as nervous as a long-tailed cat in a roomful of rocking chairs" would be swiftly edited by folks who (sensibly) didn't want readers to have to struggle with such a tangled idiom. Playful language is good for puzzles, though.

Google Ngrams have a wide variety of language, probably wider than Wikipedia. But it does have some not-so-great quirks. The data seems skewed towards what you might expect from many small books: more copyright notices and mentions of publishing companies than you might want. Scanning mistakes abound; you'll find plenty plenty of "words" like doesn'tmatter.

Project Gutenberg has the text of tens of thousands of books; wow, a great source of text. 99+% of these books are more than a hundred years old, so not-useful for suggesting puzzle answers that are, e.g., Pokemon names.

Crossword word-lists don't use spaces consistently. E.g., one public list has entries for ANCIENTALIENS and Big Bang Theory. That ANCIENTALIENS should be ANCIENT ALIENS… but of course, for a crossword constructor, it's fine the way it is. phraser uses a haphazard workaround: when processing a crossword word-list, if it sees a "word" with more than five letters and no spaces, it doesn't "boost" that maybe-word very much.

I found text files with "playful" language. E.g., data files for "fortune cookie" programs. I'd like more. I'd sure like a collection of song lyrics and lines of poetry "weighted" by what percentage of the population knows/recognizes them. And/or a list of movie titles scored by how many folks know those titles. I'd like more idioms. Still, the stuff I've collected feels pretty good so far.