phraser

Phraser generates a list of English phrases roughly ordered by suitability for word puzzles. It crunches through wikipedia, Google Ngram data, and text files to find popular phrases.

A What the What List?

Its output is a text file where each line is of the form score TAB phrase; the file is sorted such that high scores come first. "Score" roughly corresponds to how common a phrase is; there's some tweaking to boost clue-able things above grammatical glue (but there's still plenty of grammatical glue). Some excerpts from an output file:

142036	the
96614	of
92286	in
88057	to
…
40228	at
40053	are
38387	of the
38310	this
36233	have
…
3711	distinction
3711	evaluation
3709	1969
3709	counter
3709	de la
…
36	october or early
36	octothorpe
36	octree
36	ocular dominance columns
36	odal
…
7	화
7	후
7	fi
7	fire
7	ſt

You might wonder how such a list is useful for designing puzzles. Consider the following puzzle:

Musical Numbers
Get a secret message out of this:

Travis Tritt: ♫ Doesn't the Good ________ ___ ___? ♫ _ _ _ _ _ _ _ _   _ _ _   _ _ _?
He defined the dogma of the Immaculate Conception _ _ _ _   _ _
Minnesota governor most likely to win a cage match _ _ _ _ _   _ _ _ _ _ _ _
 
♫ __ _ _______ in the slipstream/Between the viaducts of your dream ♫   _ _   _   _ _ _ _ _ _ _ _
Prime Directive novelist _ _ _ _ _ _   _ _ _ _ _ _-_ _ _ _ _ _ _
Songs from Lonely Avenue band _ _ _ _ _   _ _ _ _ _ _   _ _ _ _ _ _ _ _ _
a tyro _ _   _ _ _ _ _ _ _ _ _ _ _ _ _   _ _ _ _ _ _

_ _ _   _ _ _ _

Here (SPOILER WARNING), the gimmick is that each partial answer contains the name of a digit. E.g., that Travis Tritt lyric ends "…outweigh the bad". The secret message is a seven-digit phone number, so the puzzle designer needed seven of these digit-containing phrases. Those aren't so easy to think of with your brain; but if you've compiled a big list of phrases, it's pretty easy to search. And if you've (roughly) sorted your list by puzzle-suitable-ness, then you shouldn't have to wade through too much trash to find the good stuff. You might slap together a quick-and-dirty little Python script…

DIGITS = ["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]
found = {}
for d in DIGITS: found[d] = []

for line in open("phrases_5M.txt"):
  score_s, phrase = line.strip().split("\t")
  spaceless = phrase.replace(" ", "")
  for d in DIGITS:
    if not d in spaceless: continue # find jeSE VENtura
    if d in phrase: continue        # ...but not SEVENties
    found[d] = [f for f in found[d] if not f in phrase] # found "pope piuS IX"? discard "piuS IX"
    found[d].append(phrase)

for d in DIGITS:
    print d, found[d][:10] # for each digit, show up to 10 "best" phrases
…give it a whirl…
$ python sample_digits.py
zero ['fertilizer on', 'organizer of the', 'buzzer on', 'brian setzer orchestra', 'popularizer of', 'appetizer or', 'eliezer of damascus', 'gaze roamed']
one ['back to new york', 'carbon emissions', 'what to do next', 'to new york in', 'also never', 'peace on earth', 'to newport', 'on either side of the', 'restriction enzymes', 'conservation efforts']
two ['that would have been', 'it would seem that', 'it would appear that', 'that would not be', 'gunshot wounds', 'it would be more', 'in the ancient world', 'and it would be', 'as it would be', 'do not worry about']
three ['with reeds', 'with reed', 'judith reeves']
four ['days of our lives', 'of our god', 'of our times', 'the history of our', 'the nature of our', 'the time of our', 'the results of our', 'of our world', 'of our existence', 'the beginning of our']
five ['if ive', 'if i venture', 'fi ve']
six ['pope pius ix', 'louis ix of france', 'charles ix of france', 'king louis ix', 'acknowledgments ix introduction', 'of pius ix', 'charles ix of sweden', 'king charles ix', 'ansi x3', 'contents acknowledgments ix']
seven ['it was eventually', 'sports events', 'and perhaps even', 'and sometimes even', 'and was eventually', 'jesse ventura', 'towards evening', 'this event was', 'it is even more', 'news events']
eight ['weigh them', 'to weigh the', 'outweigh the benefits', 'outweigh the costs', 'outweigh the risks', 'weigh the pros', 'outweigh their', 'weigh the benefits', 'weigh the risks', 'outweigh the disadvantages']
nine ['an inexperienced', 'an inexplicable', 'an ineffective', 'an inexorable', 'than in england', 'an inexpressible', 'an ineffectual', 'an inestimable', 'education in england', 'for security and co operation in europe']
$

…and now you've got some choices to look through. They're not all great; even some of the OK results need some massaging. (E.g., the word list has "judith reeves" but her full name is "Judith Reeves-Stevens.")

There are some other excellent wordlists out there.

How to Use It

Get some text fodder. This program can read mediawiki dumps (e.g. wikipedia snapshots), Google Ngrams files, and plain text files.

Mediawiki

Wikipedia: Learn about Wikipedia downloads. You want a file with a name something like enwiki-20160204-pages-articles.xml; you'll download something and uncompress it.

Wiktionary and wikiquote also have downloadable snapshots. Writing a bunch of puzzles on some theme nerds like? See if there's a wikia on the topic and if that wikia has a downloadable snapshot (which will be linked from http://whatever.wikia.com/wiki/Special:Statistics; you want "Current Pages".)

Put your wiki files in a directory.

Google NGrams

Download many data files to a directory. This is a big data set; it might take several days to download. It might fill up your hard drive.

Run the ngram_winnow.py script in your download directory. If your hard drive is filling up, you might want to discard the ngram data files as this script generates "winnowed" versions of them.

Text files

Find cool text files. Put them in a directory.

Build, Run

Have Go build tools installed and development environment set up.

Have the phraser source code.

Build it: go install github.com/lahosken/misc/phraser

Run it, telling it where to find those directories full of data files you downloaded: phraser -wikipath dumpz/mediawiki/ -txtpath dumpz/textfiles/ -ngrampath dumpz/ngram/ Noisy logs are noisy, sorry.

In the first stage, In a tmp directory, it gathers files with "scores" for parts of the data. In the last stage, it reads in those intermediate files to generate one big count; during this stage, it's using a lot of memory.

When done, it generates a big output file in your home directory at a file named Phrases_something_something.txt. You might use this to generate smaller files for day-to-day use, perhaps something like

$ cd
$ head -5000000 Phrases_20160625_082207.txt > phrases_5M.txt
$ grep -v ' ' Phrases_20160625_082207.txt | head -500000 > words_500K.txt
$

Notes on Data Sets

Wikipedia is pretty sweet at covering many topics. It's not a great source for "playful" language. Any wikipedia article saying "The emperor was as nervous as a long-tailed cat in a roomful of rocking chairs" would be swiftly edited by folks who (sensibly) didn't want readers to have to struggle with such a tangled idiom. And yet this language is good for puzzles.

Google Ngrams have a wide variety of language, probably wider than Wikipedia. Still, there are some quirks. One was obvious enough even for me to notice: it doesn't use contractions. Instead of "doesn't", it has "does not". The data ?seems? skewed towards what you might expect from many small books: more copyright notices and mentions of publishing companies than you might want.

I found text files with "playful" language. E.g., data files for "fortune cookie" programs. I'd like more. I'd sure like a collection of song lyrics and lines of poetry "weighted" by what percentage of the population knows/recognizes them. And/or a list of movie titles scored by how many folks know those titles. I'd like more idioms. Still, the stuff I've collected feels pretty good so far.

[^]

comment? | | home |