: New:

I have updated the Phraser word and phrase lists. Those of you who find these lists handy for solving/designing word puzzles, rejoice!

This update incorporates an epiphany! (It also has updated content from Wikipedia, etc, but you already expected that.) tl;dr I fixed many many mistakes, and I like the quality improvement. If you want the details, read on…

You may recall a quandary: Crossword constructors have hand-crafted lists of cool phrases, idioms, and such. I can download a few of these lists, and Phraser can see that SHORTANDSTOUT is a nifty phrase. Crossword constructors don't care about spaces in phrases; but I'd like to know where the spaces go.

So Phraser first tries to figure out a list of phrases that have appeared in text. It reads lots of text-sources: Wikipedia, text files from project Gutenberg, etc etc. But it doesn't realize that "short and stout" is a more-interesting phrase than "copyright 1995", which appears more often.

So it then goes through the crossword lists, notices that crossword constructors think SHORTANDSTOUT is cool, notices that "short and stout" is SHORTANDSTOUT with spaces, and boosts the score of "short and stout".

But what if Phraser never figured out that "short and stout" is a thing? Maybe it figured out the phrase "short and" and the word "stout" are things, but never sees that teapot song and thus never realizes that "short and stout" goes together. In that case, when it sees SHORTANDSTOUT in a crossword list, it just kinda shrugs, thinks "I don't what to do with that" and moves on. What a waste.

One time, I wrote a program that looked over my crossword lists for SHORTANDSTOUTs for which Phraser couldn't figure out where to put spaces. For each, it would look for a pair of phrases that could be combined. So if Phraser had spotted "short and" and the word "stout", this other program would spot that those phrases could be combined to make "short and stout". I kept the output from that program, and fed it to Phraser on subsequent runs. Thus, it would know "short and stout" was a thing the next time it ran; and when it saw SHORTANDSTOUT in a crossword list, it would know to boost the score of "short and stout". My phrase-combiner program didn't get it right every time. Like, if a crossword constructor like the very-obscure word BLUNGE, my phrase-combiner program would guess that must be "B LUNGE". But it was right most of the time, and the results were good enough such that I kept using it.

A few days ago, I was looking at one of the wrong phrases that my phrase-combiner program had come up with:
diuretic ally
That's not a thing. "Diuretically" is a word, sort of. You can look at it and figure out it's an adverb to describe something acting in the manner of a diuretic, I guess. It's reeeeeeeally rare, though. If you look up diuretical and diuretically on the Google ngram viewer, you can see that they show up not-quite-never in books. And diuretically appears so very rarely in texts that Phraser figured "aw, that's probably just a typo" and forgot about it. But crossword lists agree that "DIURETICALLY" is a kinda-important thing. So my phrase-combiner, trying its best, had come up with diuretic ally. And a couple of days ago, I was staring at that and wondering: OK, why do all these crossword lists think that "diuretically" is good thing to put into a crossword, given that nobody uses this word in real life? That's when I had the epiphany.

The epiphany: DIURETICALLY is a valid Scrabble word. I looked through my crossword-word-lists and saw a fair number of words-only-Scrabble-players use. As near as I can tell, crossword constructors are pretty forgiving about Scrabble words that nobody uses but are figure-out-able. (They're not so forgiving about obscure scientific terms; there's no obvious way for a solver to figure the name of a rare sheep disease by applying grammar-suffixes to a common word, I guess.)

So I hauled out a SOWPODS list (list of Scrabble words), looked through the list of best-guess-phrases from my phrase-combiner tool, and thus found many other of my mistakes like diuretic ally. And I purged them. I'm now much more confident in the surviving best-guess-phrases; so I increased their "boost" so that they're more likely to appear in the 5-million-phrases file.

Tags: words puzzle scene

lahosken@gmail.com

Tags