phraser improvements

Phraser, the tool for generating word+phrase lists useful for solving+designing puzzles, is now smarter when reading crossword constructor dictionaries. Thus, hundreds of thousands of words+phrases got a big boost in the list based on their interesting-ness instead of their common-ness.

Also, you can now feed a Phraser-generated phrase list back into Phraser (presumably together with some other inputs).

Crossword Lists

Crossword-making dictionaries are awesome: they contain hundreds of thousands of words+phrases ranked based on their coolness as puzzle answers. Crossword-making dictionaries are aggravating: they leave out the spaces between words. Is that POWERSOFTEN "powers of ten", "powers often", or "power soften"? You probably know it's "powers of ten", but how do you explain that to a computer program?

Before, Phraser would give crossword dictionary entries a teeny tiny boost in its rankings. E.g., it'd say maybe there's a word "powersoften", let's give it an entry with a score of 3. And then when I ran the Phraser tool, "powersoften" was waaaaaay down in the list, lumped in with typos and misprints. And that was kind of appropriate, since "powersoften" isn't really a word.

Now, Phraser uses words+phrases it learns about elsewhere to figure out the spaces. Phraser doesn't just read crossword dictionaries. It reads Wikipedia, text files, Google Ngrams, …. So it already knows that "powers of ten" is a thing but "power soften" isn't. Now Phraser uses that information. It reads wikis, text files, etc first, saving the crossword dictionaries for last. When it sees POWERSOFTEN, it remembers that it saw "Powers of Ten" in wikipedia, and knows what to do.

This doesn't catch everything. It can boost an uncommon-but-cool phrase up from the depths of the list to the top; but it can't use crossword dictionaries to "discover" a phrase that isn't in other sources. E.g., I'm looking at a crossword dictionary entry YESSIREEBOB. Alas, Phraser doesn't know the phrase "yessiree Bob" at all; so it doesn't boost that phrase's score even though a crossword puzzle writer really likes the phrase.

The new behavior is better than the old. There's still room for improvement though.

Feed Output back in as Input, "prebaked" files

I like to run Phraser to re-fresh the phrase and word lists at least once a year. After all, the world keeps changing. I download fresh copies of Wikipedia and other things.

I don't download a fresh copy of Project Gutenberg, the online collection of tens of thousands of books available as text files. I got ahold of a copy of Project Gutenberg's books from Gutenberg Dammit, a project that cleaned the texts for use with computers learning natural language. Gutenberg Dammit is a one-time-effort thing (I guess); there's no fresh version coming along next year. It was built on top of another one-time-effort thing; there's no fresh version of that coming along next year either. It takes more than twelve hours for my doughty desktop machine to count the words+phrases in Project Gutenberg's tens of thousands of books. So…once a year, my desktop machine is really busy for more than half a day carefully re-counting numbers that haven't changed since last year.

Google goes several years in between updating their Google Ngrams data. So… once a year my desktop machine is really busy for another couple of ours carefully re-counting more numbers that haven't changed since last year.

I tweaked Phraser so I can feed a previously-generated phrases lists as input for generating a new phrases list. I ran Phraser once on Project Gutenberg's tens of thousands of books to generate a phrase list. I ran Phraser once on Google Ngram files to generate another phrase list. In future years, I can re-use those "prebaked" phrase lists and save my desktop machine hours of effort.

Crosswords + Prebaked

I mentioned YESSIREEBOB, a phrase beloved by crossword constructors but which Phraser doesn't know. There are a lot of these phrases that a human could look at and say "well, the space goes between the EE and the BOB". To help Phraser out, I figured out the spaces in thousands of crossword-beloved phrases. Then I made a "prebaked" file with those space-ified phrases so that Phraser will "know" those phrases in future years.

I figured out where to put the spaces in the phrases with a mixture of computing and human-ing. As a first pass, I wrote a computer program to make a good guess. Then I, a human, looked over the output of that program to catch the mistakes. The computer program was pretty good, but it made mistakes. E.g., it looked at BOXINGROUND and came up with "box in ground", but I'm pretty sure the crossword people meant "boxing round."

Maybe some day, I'll come back to this and figure out the spaces in more phrases that computer constructors like. After looking at thousands of phrases this go-around, my brain was kind of fried.

Use less memory

Historically, I was pretty blasé about how much memory Phraser used. It used a lot of memory. I'd set it going, let it run overnight while I slept. Phraser would use all my computer's memory, so my computer wasn't usable; but that was OK, I don't use my computer when I'm sleeping.

In recent months, the Linux OS has been more aggressive about halting programs that are using too much memory. So I'd start Phraser running, go to bed… and when I woke up the next morning instead of a phrase list, I'd just see an error message.

Thus, I could no longer be blasé about using all the memory. I buckled down, figured out what was using so much memory and made some changes. Now I can run Phraser overnight and when I wake up, there's a happy phrase list instead of a sad error message. (And maybe I run Phraser during the day now; Phraser only uses about ⅓ of my machine's memory now, so I can keep using that machine to futz around with other stuff.)

Words like Blasé

Some puzzles pretend that Pokémon is spelled Pokemon; others don't. This can cause problems, as when crossword writers tell you that the Spanish for "year" is ANO but that's really AÑO; meanwhile, ANO is pretty rude. But if you insist on accent marks in your puzzle answers, you're in for disappointment, so Phraser now compensates. Now when Phraser sees a word with accents like Pokémon, it adds both "pokémon" and "pokemon" to its list.