I saw mention of another movie database. I already knew about IMDb, a pretty-good example acquired by Amazon some years back. New to me: TMDb, The Movie Database. I previously figured out how to use IMDb's data to improve phraser's phrase list with lots of movie titles and movie-people names. Now I figured out how to do it again with TMDb, hopefully thus getting a broader list. (Maybe more international?) And maybe if Amazon ever goes on a cost-cutting rampage and discards IMDb, I might have a fall-back plan. Anyhow, this data should trickle in next time I update the phraser lists; I'm trying the new data at home now to make sure it doesn't ruin everything.
updated phraser again
While investigating the question "Why doesn't phraser know about [[Redacted MIT Mystery Hunt puzzle solution Redacted]]?" I found a bug: When reading Wikipedia, if there was an absurdly long paragraph, phraser thought it had reached the end of Wikipedia. Back in 2016 when I was writing phraser and looking carefully for problems, Wikipedia didn't have any absurdly long paragraphs, so everything seemed to be working fine. In the intervening years, alas, that changed. I was no longer looking carefully for problems and, alas, didn't notice.
I fixed that bug, yay. (Less notice-ably, I fixed another bug, and thus did something to help phraser find [[Redacted MIT Mystery Hunt puzzle solution Redacted]]. When considering Wikipedia cross-references to, say, Critique of Pure Reason, I was counting "critique of", "of pure", and "pure reason" more than I meant to.)
Anyhow, you might want to download the latest phrase and word lists from the appropriate page. If you run phraser yourself, this would be a good time to refresh and pick up the latest code.
I have updated the Phraser word and phrase lists. Those of you who find these lists handy for solving/designing word puzzles, rejoice!
This update incorporates an epiphany! (It also has updated content from Wikipedia, etc, but you already expected that.) tl;dr I fixed many many mistakes, and I like the quality improvement. If you want the details, read on…
You may recall a quandary: Crossword constructors have hand-crafted lists of cool phrases, idioms, and such. I can download a few of these lists, and Phraser can see that SHORTANDSTOUT is a nifty phrase. Crossword constructors don't care about spaces in phrases; but I'd like to know where the spaces go.
So Phraser first tries to figure out a list of phrases that have appeared in text. It reads lots of text-sources: Wikipedia, text files from project Gutenberg, etc etc. But it doesn't realize that "short and stout" is a more-interesting phrase than "copyright 1995", which appears more often.
So it then goes through the crossword lists, notices that crossword constructors think SHORTANDSTOUT is cool, notices that "short and stout" is SHORTANDSTOUT with spaces, and boosts the score of "short and stout".
But what if Phraser never figured out that "short and stout" is a thing? Maybe it figured out the phrase "short and" and the word "stout" are things, but never sees that teapot song and thus never realizes that "short and stout" goes together. In that case, when it sees SHORTANDSTOUT in a crossword list, it just kinda shrugs, thinks "I don't what to do with that" and moves on. What a waste.
One time, I wrote a program that looked over my crossword lists for SHORTANDSTOUTs for which Phraser couldn't figure out where to put spaces. For each, it would look for a pair of phrases that could be combined. So if Phraser had spotted "short and" and the word "stout", this other program would spot that those phrases could be combined to make "short and stout". I kept the output from that program, and fed it to Phraser on subsequent runs. Thus, it would know "short and stout" was a thing the next time it ran; and when it saw SHORTANDSTOUT in a crossword list, it would know to boost the score of "short and stout". My phrase-combiner program didn't get it right every time. Like, if a crossword constructor like the very-obscure word BLUNGE, my phrase-combiner program would guess that must be "B LUNGE". But it was right most of the time, and the results were good enough such that I kept using it.
A few days ago, I was looking at one of the wrong phrases that
my phrase-combiner program had come up with:
diuretic ally
That's not a thing. "Diuretically" is a word, sort of. You can look at it and figure
out it's an adverb to describe something acting in the manner of a diuretic, I guess.
It's reeeeeeeally rare, though. If you look up diuretical
and diuretically on the Google ngram viewer, you can see that they show
up not-quite-never in books.
And diuretically appears so very rarely in texts that Phraser figured
"aw, that's probably just a typo" and forgot about it.
But crossword lists agree that
"DIURETICALLY" is a kinda-important thing. So my phrase-combiner, trying its
best, had come up with diuretic ally. And a couple of days ago, I was
staring at that and wondering: OK, why do all these crossword
lists think that "diuretically" is good thing to put into a crossword, given that nobody
uses this word in real life? That's when I had the epiphany.
The epiphany: DIURETICALLY is a valid Scrabble word. I looked through my crossword-word-lists and saw a fair number of words-only-Scrabble-players use. As near as I can tell, crossword constructors are pretty forgiving about Scrabble words that nobody uses but are figure-out-able. (They're not so forgiving about obscure scientific terms; there's no obvious way for a solver to figure the name of a rare sheep disease by applying grammar-suffixes to a common word, I guess.)
So I hauled out a SOWPODS list (list of Scrabble words), looked through the list of best-guess-phrases from my phrase-combiner tool, and thus found many other of my mistakes like diuretic ally. And I purged them. I'm now much more confident in the surviving best-guess-phrases; so I increased their "boost" so that they're more likely to appear in the 5-million-phrases file.