Downloads:
Phraser generates a long list of English phrases and words. The list is roughly ordered by suitability for word puzzles. Thus, if you have tools handy for searching text files, you can use them to find likely puzzle answers by searching this phrase list. Phraser generates its list by analyzing big piles of text:
Ways to Use It
A What the What List?
Phraser's output is a text file where each line is of the form score TAB phrase; the file is sorted such that high scores come first. "Score" roughly corresponds to how common a phrase is; there's some tweaking to boost clue-able things. Some excerpts from an output file:
142036 the 96614 of 92286 in 88057 to … 40228 at 40053 are 38387 of the 38310 this 36233 have … 3711 distinction 3711 evaluation 3709 1969 3709 counter 3709 de la … 36 october or early 36 octothorpe 36 octree 36 ocular dominance columns 36 odal … 7 화 7 후 7 fi 7 fire 7 ſt
You might wonder how such a list is useful for designing puzzles. Consider the following puzzle:
Musical Numbers
Get a secret message out of this:
Travis Tritt: ♫ Doesn't the Good ________ ___ ___? ♫ | _ _ _ _ _ _ _ _ _ _ _ _ _ _? |
He defined the dogma of the Immaculate Conception | _ _ _ _ _ _ |
Minnesota governor most likely to win a cage match | _ _ _ _ _ _ _ _ _ _ _ _ |
♫ __ _ _______ in the slipstream/Between the viaducts of your dream ♫ | _ _ _ _ _ _ _ _ _ _ _ |
Prime Directive novelist | _ _ _ _ _ _ _ _ _ _ _ _-_ _ _ _ _ _ _ |
Songs from Lonely Avenue band | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
a tyro | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
Here (SPOILER WARNING), the gimmick is that each partial answer contains the name of a digit. E.g., that Travis Tritt lyric ends "…outweigh the bad". The secret message is a seven-digit phone number, so the puzzle designer needed seven of these digit-containing phrases. Those aren't so easy to think of with your brain; but if you've compiled a big list of phrases, it's pretty easy to search. And if you've (roughly) sorted your list by puzzle-suitable-ness, then you shouldn't have to wade through too much trash to find the good stuff. You might slap together a quick-and-dirty little Python script…
DIGITS = ["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"] found = {} for d in DIGITS: found[d] = [] for line in open("phrases_5M.txt"): score_s, phrase = line.strip().split("\t") spaceless = phrase.replace(" ", "") for d in DIGITS: if not d in spaceless: continue # find jeSE VENtura if d in phrase: continue # ...but not SEVENties found[d] = [f for f in found[d] if not f in phrase] # found "pope piuS IX"? discard "piuS IX" found[d].append(phrase) for d in DIGITS: print d, found[d][:10] # for each digit, show up to 10 "best" phrases
$ python sample_digits.py zero ['fertilizer on', 'organizer of the', 'buzzer on', 'brian setzer orchestra', 'popularizer of', 'appetizer or', 'eliezer of damascus', 'gaze roamed'] one ['back to new york', 'carbon emissions', 'what to do next', 'to new york in', 'also never', 'peace on earth', 'to newport', 'on either side of the', 'restriction enzymes', 'conservation efforts'] two ['that would have been', 'it would seem that', 'it would appear that', 'that would not be', 'gunshot wounds', 'it would be more', 'in the ancient world', 'and it would be', 'as it would be', 'do not worry about'] three ['with reeds', 'with reed', 'judith reeves'] four ['days of our lives', 'of our god', 'of our times', 'the history of our', 'the nature of our', 'the time of our', 'the results of our', 'of our world', 'of our existence', 'the beginning of our'] five ['if ive', 'if i venture', 'fi ve'] six ['pope pius ix', 'louis ix of france', 'charles ix of france', 'king louis ix', 'acknowledgments ix introduction', 'of pius ix', 'charles ix of sweden', 'king charles ix', 'ansi x3', 'contents acknowledgments ix'] seven ['it was eventually', 'sports events', 'and perhaps even', 'and sometimes even', 'and was eventually', 'jesse ventura', 'towards evening', 'this event was', 'it is even more', 'news events'] eight ['weigh them', 'to weigh the', 'outweigh the benefits', 'outweigh the costs', 'outweigh the risks', 'weigh the pros', 'outweigh their', 'weigh the benefits', 'weigh the risks', 'outweigh the disadvantages'] nine ['an inexperienced', 'an inexplicable', 'an ineffective', 'an inexorable', 'than in england', 'an inexpressible', 'an ineffectual', 'an inestimable', 'education in england', 'for security and co operation in europe'] $
…and now you've got some choices to look through. They're not all great; even some of the OK results need some massaging. (E.g., the word list has "judith reeves" but her full name is "Judith Reeves-Stevens.")
There are some other excellent wordlists out there.
Get some text fodder. This program can read mediawiki dumps (e.g. wikipedia snapshots), Google Ngrams files, plain text files, and Crossword Compiler word-list files.
You can use phrase lists outputted by previous runs of phraser as input to a new run of phraser.
Wikipedia:
Learn about Wikipedia
downloads. You want a file with a name something like
enwiki-20160204-pages-articles.xml.bz2
;
you'll download and uncompress it.
Wiktionary, wikiquote, and wikisource also have downloadable snapshots. Writing a bunch of puzzles on
some theme nerds like? See if there's a fandom.com
on the topic and if it
has a downloadable snapshot (which will be linked from
http://whatever.fandom.com/wiki/Special:Statistics
; you want "Current
Pages".)
Put your wiki files in a directory.
Low-effort way to use the 1- and 2- grams: Grab the Ngrams item from the "Prebaked" section of the Downloads list at the top of this page.
High-effort way to do something else:
Download many data files to a directory. This is a big data set; it might take several days to download. It might fill up your hard drive.
Run the
ngram_winnow.py
script in your download directory.
If your hard drive is filling up, you might want to discard the ngram data files as this script
generates "winnowed" versions of them.
Find cool text files. Put them in a directory.
Aside: If you're thinking I'd like a lot of text files you might think I'll grab Project Gutenberg's tens of thousands of books. Instead of putting all those into your text-files directory, look up in the "Downloads" list at the top of this page and grab the "Project Gutenberg/100" file and put it in a directory of "prebaked" files.
Have Go build tools installed and development environment set up.
Have the phraser source code.
Build it: go install github.com/lahosken/misc/phraser
Run it, telling it where to find those directories full of data files you downloaded:
phraser -wikipath dumpz/mediawiki/ -txtpath dumpz/textfiles/ -ngrampath dumpz/ngram/ -xwdpath dumpz/xwd_lists/
Noisy logs are noisy, sorry.
In the first stage, In a tmp directory, it gathers files with "scores" for parts of the data. In the last stage, it reads in those intermediate files to generate one big count; during this stage, it's using a lot of memory.
When done, it generates a big output file in your home directory at a file named
Phrases_something_something.txt
.
You might use this to generate smaller files for day-to-day use,
perhaps something like
$ cd $ head -5000000 Phrases_20160625_082207.txt > phrases_5M.txt $ grep -v ' ' Phrases_20160625_082207.txt | head -500000 > words_500K.txt $
Wikipedia is pretty sweet at covering many topics. It's not a great source for "playful" language. Any wikipedia article saying "The emperor was as nervous as a long-tailed cat in a roomful of rocking chairs" would be swiftly edited by folks who (sensibly) didn't want readers to have to struggle with such a tangled idiom. Playful language is good for puzzles, though.
Google Ngrams have a wide variety of language, probably wider than Wikipedia. But it does have some not-so-great quirks. The data seems skewed towards what you might expect from many small books: more copyright notices and mentions of publishing companies than you might want. Scanning mistakes abound; you'll find plenty plenty of "words" like doesn'tmatter.
Project Gutenberg has the text of tens of thousands of books; wow, a great source of text. 99+% of these books are more than a hundred years old, so not-useful for suggesting puzzle answers that are, e.g., Pokemon names.
Crossword word-lists don't use spaces consistently. E.g., one public list has entries for ANCIENTALIENS and Big Bang Theory. That ANCIENTALIENS should be ANCIENT ALIENS… but of course, for a crossword constructor, it's fine the way it is. phraser uses a haphazard workaround: when processing a crossword word-list, if it sees a "word" with more than five letters and no spaces, it doesn't "boost" that maybe-word very much.
I found text files with "playful" language. E.g., data files for "fortune cookie" programs. I'd like more. I'd sure like a collection of song lyrics and lines of poetry "weighted" by what percentage of the population knows/recognizes them. And/or a list of movie titles scored by how many folks know those titles. I'd like more idioms. Still, the stuff I've collected feels pretty good so far.