Phraser generates a list of English phrases roughly ordered by suitability for word puzzles. It crunches through wikipedia, Google Ngram data, and text files to find popular phrases.
Its output is a text file where each line is of the form score TAB phrase; the file is sorted such that high scores come first. "Score" roughly corresponds to how common a phrase is; there's some tweaking to boost clue-able things above grammatical glue (but there's still plenty of grammatical glue). Some excerpts from an output file:
142036 the 96614 of 92286 in 88057 to … 40228 at 40053 are 38387 of the 38310 this 36233 have … 3711 distinction 3711 evaluation 3709 1969 3709 counter 3709 de la … 36 october or early 36 octothorpe 36 octree 36 ocular dominance columns 36 odal … 7 화 7 후 7 ﬁ 7 ﬁre 7 ﬅ
You might wonder how such a list is useful for designing puzzles. Consider the following puzzle:
Here (SPOILER WARNING), the gimmick is that each partial answer contains the name of a digit. E.g., that Travis Tritt lyric ends "…outweigh the bad". The secret message is a seven-digit phone number, so the puzzle designer needed seven of these digit-containing phrases. Those aren't so easy to think of with your brain; but if you've compiled a big list of phrases, it's pretty easy to search. And if you've (roughly) sorted your list by puzzle-suitable-ness, then you shouldn't have to wade through too much trash to find the good stuff. You might slap together a quick-and-dirty little Python script……give it a whirl…
…and now you've got some choices to look through. They're not all great; even some of the OK results need some massaging. (E.g., the word list has "judith reeves" but her full name is "Judith Reeves-Stevens.")
There are some other excellent wordlists out there.
Get some text fodder. This program can read mediawiki dumps (e.g. wikipedia snapshots), Google Ngrams files, and plain text files.
Wikipedia: Learn about Wikipedia downloads. You want a file with a name something like enwiki-20160204-pages-articles.xml; you'll download something and uncompress it.
Wiktionary and wikiquote also have downloadable snapshots. Writing a bunch of puzzles on some theme nerds like? See if there's a wikia on the topic and if that wikia has a downloadable snapshot (which will be linked from http://whatever.wikia.com/wiki/Special:Statistics; you want "Current Pages".)
Put your wiki files in a directory.
Download many data files to a directory. This is a big data set; it might take several days to download. It might fill up your hard drive.
Run the ngram_winnow.py script in your download directory. If your hard drive is filling up, you might want to discard the ngram data files as this script generates "winnowed" versions of them.
Find cool text files. Put them in a directory.
Have Go build tools installed and development environment set up.
Have the phraser source code.
Build it: go install github.com/lahosken/misc/phraser
Run it, telling it where to find those directories full of data files you downloaded: phraser -wikipath dumpz/mediawiki/ -txtpath dumpz/textfiles/ -ngrampath dumpz/ngram/ Noisy logs are noisy, sorry.
In the first stage, In a tmp directory, it gathers files with "scores" for parts of the data. In the last stage, it reads in those intermediate files to generate one big count; during this stage, it's using a lot of memory.
When done, it generates a big output file in your home directory at a file named Phrases_something_something.txt. You might use this to generate smaller files for day-to-day use, perhaps something like
$ cd $ head -5000000 Phrases_20160625_082207.txt > phrases_5M.txt $ grep -v ' ' Phrases_20160625_082207.txt | head -500000 > words_500K.txt $
Wikipedia is pretty sweet at covering many topics. It's not a great source for "playful" language. Any wikipedia article saying "The emperor was as nervous as a long-tailed cat in a roomful of rocking chairs" would be swiftly edited by folks who (sensibly) didn't want readers to have to struggle with such a tangled idiom. And yet this language is good for puzzles.
Google Ngrams have a wide variety of language, probably wider than Wikipedia. Still, there are some quirks. One was obvious enough even for me to notice: it doesn't use contractions. Instead of "doesn't", it has "does not". The data ?seems? skewed towards what you might expect from many small books: more copyright notices and mentions of publishing companies than you might want.
I found text files with "playful" language. E.g., data files for "fortune cookie" programs. I'd like more. I'd sure like a collection of song lyrics and lines of poetry "weighted" by what percentage of the population knows/recognizes them. And/or a list of movie titles scored by how many folks know those titles. I'd like more idioms. Still, the stuff I've collected feels pretty good so far.
| comment? | | home |