Larry Hosken: New: Tag: words

rephrased Phraser word+phrase lists

I updated the scored word and phrase lists over at the phraser page, using data from a recent copies of Wikipedia and other wikis.

Soon after I updated them, I saw that my over-enthusiastic tool that guesses at the spacing of crossword puzzle dictionaries had messed up; thinking that "pre examines" and "sniffing ly" are each two-word phrases. That'll be fixed… next year, along with several more pre- and -ly words.

Uhm, yeah, I guess there's not a big difference between these lists and those I got when I was tinkering with this stuff back in September. But I'm in the habit of refreshing these in the weeks before Mystery Hunt, and without our habits what are we but beasts of chaos?

[Geordi La Forge meme. Geordi rejects: Unfashionably obsolete September phrases. Geordi respects: Fresh lovely December phrases]

Permalink
& Comments

Surfwords is an intense word game. I'm enjoying it so far… in short doses, because it's intense.

Permalink
& Comments

phraser improvements

Phraser, the tool for generating word+phrase lists useful for solving+designing puzzles, is now smarter when reading crossword constructor dictionaries. Thus, hundreds of thousands of words+phrases got a big boost in the list based on their interesting-ness instead of their common-ness.

Also, you can now feed a Phraser-generated phrase list back into Phraser (presumably together with some other inputs).

Crossword Lists

Crossword-making dictionaries are awesome: they contain hundreds of thousands of words+phrases ranked based on their coolness as puzzle answers. Crossword-making dictionaries are aggravating: they leave out the spaces between words. Is that POWERSOFTEN "powers of ten", "powers often", or "power soften"? You probably know it's "powers of ten", but how do you explain that to a computer program?

Before, Phraser would give crossword dictionary entries a teeny tiny boost in its rankings. E.g., it'd say maybe there's a word "powersoften", let's give it an entry with a score of 3. And then when I ran the Phraser tool, "powersoften" was waaaaaay down in the list, lumped in with typos and misprints. And that was kind of appropriate, since "powersoften" isn't really a word.

Now, Phraser uses words+phrases it learns about elsewhere to figure out the spaces. Phraser doesn't just read crossword dictionaries. It reads Wikipedia, text files, Google Ngrams, …. So it already knows that "powers of ten" is a thing but "power soften" isn't. Now Phraser uses that information. It reads wikis, text files, etc first, saving the crossword dictionaries for last. When it sees POWERSOFTEN, it remembers that it saw "Powers of Ten" in wikipedia, and knows what to do.

This doesn't catch everything. It can boost an uncommon-but-cool phrase up from the depths of the list to the top; but it can't use crossword dictionaries to "discover" a phrase that isn't in other sources. E.g., I'm looking at a crossword dictionary entry YESSIREEBOB. Alas, Phraser doesn't know the phrase "yessiree Bob" at all; so it doesn't boost that phrase's score even though a crossword puzzle writer really likes the phrase.

The new behavior is better than the old. There's still room for improvement though.

Feed Output back in as Input, "prebaked" files

I like to run Phraser to re-fresh the phrase and word lists at least once a year. After all, the world keeps changing. I download fresh copies of Wikipedia and other things.

I don't download a fresh copy of Project Gutenberg, the online collection of tens of thousands of books available as text files. I got ahold of a copy of Project Gutenberg's books from Gutenberg Dammit, a project that cleaned the texts for use with computers learning natural language. Gutenberg Dammit is a one-time-effort thing (I guess); there's no fresh version coming along next year. It was built on top of another one-time-effort thing; there's no fresh version of that coming along next year either. It takes more than twelve hours for my doughty desktop machine to count the words+phrases in Project Gutenberg's tens of thousands of books. So…once a year, my desktop machine is really busy for more than half a day carefully re-counting numbers that haven't changed since last year.

Google goes several years in between updating their Google Ngrams data. So… once a year my desktop machine is really busy for another couple of ours carefully re-counting more numbers that haven't changed since last year.

I tweaked Phraser so I can feed a previously-generated phrases lists as input for generating a new phrases list. I ran Phraser once on Project Gutenberg's tens of thousands of books to generate a phrase list. I ran Phraser once on Google Ngram files to generate another phrase list. In future years, I can re-use those "prebaked" phrase lists and save my desktop machine hours of effort.

Crosswords + Prebaked

I mentioned YESSIREEBOB, a phrase beloved by crossword constructors but which Phraser doesn't know. There are a lot of these phrases that a human could look at and say "well, the space goes between the EE and the BOB". To help Phraser out, I figured out the spaces in thousands of crossword-beloved phrases. Then I made a "prebaked" file with those space-ified phrases so that Phraser will "know" those phrases in future years.

I figured out where to put the spaces in the phrases with a mixture of computing and human-ing. As a first pass, I wrote a computer program to make a good guess. Then I, a human, looked over the output of that program to catch the mistakes. The computer program was pretty good, but it made mistakes. E.g., it looked at BOXINGROUND and came up with "box in ground", but I'm pretty sure the crossword people meant "boxing round."

Maybe some day, I'll come back to this and figure out the spaces in more phrases that computer constructors like. After looking at thousands of phrases this go-around, my brain was kind of fried.

Use less memory

Historically, I was pretty blasé about how much memory Phraser used. It used a lot of memory. I'd set it going, let it run overnight while I slept. Phraser would use all my computer's memory, so my computer wasn't usable; but that was OK, I don't use my computer when I'm sleeping.

In recent months, the Linux OS has been more aggressive about halting programs that are using too much memory. So I'd start Phraser running, go to bed… and when I woke up the next morning instead of a phrase list, I'd just see an error message.

Thus, I could no longer be blasé about using all the memory. I buckled down, figured out what was using so much memory and made some changes. Now I can run Phraser overnight and when I wake up, there's a happy phrase list instead of a sad error message. (And maybe I run Phraser during the day now; Phraser only uses about ⅓ of my machine's memory now, so I can keep using that machine to futz around with other stuff.)

Words like Blasé

Some puzzles pretend that Pokémon is spelled Pokemon; others don't. This can cause problems, as when crossword writers tell you that the Spanish for "year" is ANO but that's really AÑO; meanwhile, ANO is pretty rude. But if you insist on accent marks in your puzzle answers, you're in for disappointment, so Phraser now compensates. Now when Phraser sees a word with accents like Pokémon, it adds both "pokémon" and "pokemon" to its list.

Permalink
& Comments

I've had a good time playing the word puzzle game Cell Tower at https://www.andrewt.net/puzzles/cell-tower/
[screen shot of a grid of letters. A continuous set of letters that spells GUESSED is highlighted]

Permalink
& Comments

Thank you Google Books for clearing up the burning questions on common English usage, e.g. is there a space in "backasswards"? (Answer: sometimes, but mostly no.)
line graph showing relative usage of "backasswards", "bass ackwards", and "back asswards" over time

I usually say "bass ackwards" but now I see that is falling out of favor. I should get with the times.

Permalink
& Comments

I updated the big ol' list of words and the big ol' list of phrases on the Phraser page.

Feel free to download fresh files if you're into that sort of thing.

Permalink
& Comments

I updated that Bewordled game, the one where you swap tiles to make words kinda like Bejewelled but with words. Now it looks prettier with firecracker emojis and clouds. After I updated it, it occurred to me to start tracking my revisions. By Murphy's law, we now know that I will never change that game again.

Permalink
& Comments

The Collaborative Word List Project is a darned useful resource for word puzzle constructors and now it's free.* This is a list of phrases and hand-tuned scores. Here are a few lines from the file:

…
BLOGPOST;29
BLOGING;16
BLOGS;45
BLOGGER;45
BLOGGING;50
BLOG;65
…

Here, you can see that BLOGGING (with 2 Gs) is a good-enough phrase with a score of 50 while BLOGING (with just 1 G) is a refuge of scoundrels with a score of 16. The BLOGPOST entry points out something else about this file: it normally leaves out spaces. That makes sense for crossword constructors; depending on what kind of puzzle you're making, it might or might not make sense for you.

I've happily used this list for many years. The hand-tuned scores are often darned handy. My phraser lists know a lot of phrases, but most of its judgment of phrase "quality" comes from how often those phrases appear in Wikipedia. So phraser thinks that "of the" is an amazingly high-quality top-20 phrase; but if you saw OFTHE in a crossword, you'd probably think it was so-so. The lovely maintainers of the collaborative word list know that OFTHE is a so-so phrase and give it a so-so score of 45. When I want to steer clear of so-so phrases, I filter my phraser lists using scores from the Collaborative Word List.

Back in the day, the Collaborative Word List was only for paid-up users of Crossword Constructor, but now it's free. If you make word puzzles, give it a look. If you know how to use the git tool (which I think many of this blog's reader do), you can even contribute to the List. (If you want to contribute but don't know git, there are instructions at the Collaborative Word List github page; you want to learn those instructions. Learning enough git to contribute to the List is cool and lovely. But if you try to learn all of git's many many features right off the bat, you'll probably get discouraged.)

*Apparently, it's been free for over a year and I didn't notice?

Permalink
& Comments

Daily 5-dle #0007 11 : 5&8&6&11&10 polydle.github.io/?classic/daily/5 ⬜⬜⬜⬜🟨 ⬜🟩⬜🟨⬜ ⬜⬜⬜⬜🟩 ⬜🟩🟩🟩🟩 🟩🟩🟩🟩🟩 ⬜🟨⬜⬜🟨 ⬜⬜⬜🟨⬜ ⬜🟨⬜🟨⬜ ⬜⬜⬜🟨🟨 ⬜⬜⬜🟨🟨 ⬜⬜🟨⬜🟨 ⬜🟩🟩🟩🟩 🟩🟩🟩🟩🟩 🟨🟨⬜🟨🟩 ⬜⬜⬜⬜⬜ 🟩⬜⬜⬜⬜ 🟨⬜⬜🟨⬜ ⬜⬜⬜🟨⬜ 🟩🟩🟩🟩🟩 🟨⬜⬜⬜⬜ 🟩⬜⬜⬜⬜ ⬜🟨⬜⬜⬜ ⬜⬜⬜⬜⬜ ⬜⬜⬜⬜⬜ ⬜⬜⬜🟨⬜ 🟨⬜⬜⬜🟩 ⬜⬜⬜⬜🟩 ⬜⬜⬜⬜⬜ ⬜⬜⬜⬜⬜ 🟩🟩🟩🟩🟩 ⬜⬜🟩🟨⬜ ⬜⬜⬜🟩⬜ 🟩⬜🟩🟩🟩 ⬜⬜⬜⬜🟩 ⬜⬜⬜⬜🟩 🟩⬜⬜⬜⬜ ⬜🟨⬜⬜⬜ ⬜🟨⬜⬜⬜ ⬜⬜🟩🟨🟨 🟩🟩🟩🟩🟩

Permalink
& Comments

Okay, now RAISE is my new Wordle starter word. As before, I am not the first to figure this out. Last night, I was measuring a starting word's quality based on how many green and yellow squares it yielded on average. That's a pretty good measurement, but not quite rigorous. E.g. you have to guess "how much more valuable is a green than a yellow?"

A more algorithm-ly rigorous measure is: on average, for a starter word when the game shows you the green and yellow squares, how many wrong-choices get eliminated? Sorry, that's kind of a mouthful. Maybe more clearly:

If you've wallowed in classic puzzles, you've probably seen plenty of coin-weighing problems a la you have 12 coins, one of which is counterfeit and a little light; you have a balance scale; can you find the counterfeit in just three weighings? In many of these coin-weighing problems, the key is to divvy the 12 coins into 3 groups of 4 instead of the the obvious-but-wrong 2 groups of 6. This lets you rule out eight coins in the first round instead of six.

My new measurement looks at a potential starter word. Then it considers all the potential answer-words; for each, what green-and-yellow-squares does the game report? Put all answer-words that get the same green-and-yellow-squares into the same "bucket". You're hoping for many buckets, all about the same size. Because English isn't smooth, you won't get many-many-equally-sized buckets. But you can measure the buckets you do get to see how closely they approach the ideal.

This is a subtle difference. Last night, my favorite word using the the how-many-green-and-yellow-squares measure was SLATE. According to this new measure, SLATE is 99.97% as good as RAISE. I don't know the exact point of diminishing returns for thinking about this problem, but I'm 100% sure I'm way past it.

Permalink
& Comments

Update: This blog post, which superceded another blog post, has since then itself been superceded. Try to keep up. Also, my "only root words" explanation wasn't quite right. Apparently, non-root words with unusual grammatical endings are fine. E.g. BOXED will never be a Wordle final answer; but AROSE could be.

I changed my mind about my Wordle starter word. Now I like SLATE. (I'm not the first/only/whatever person to figure this out.)

Someone on the internet pointed out that Wordle's source code contains its word list. (Well, there are a couple of word lists in there. I don't know for certain, but one looks like a list of awesome words; and one looks like a list of ugh-acceptable-words.) When I picked my earlier starter word, I considered many many five-letter words. I worried that my word-list wasn't like the game's word-list. It turns out I was right to worry. The game favors, uhm, root words. E.g. BOXES is a lovely five-letter English word in my word list, but it is not in Wordle's word list—I bet that's because BOXES is a plural, not a root. When I search Wordle's list for other lovely plural words, I don't find 'em. My previous starter-word-picking system thought S was a very likely last letter. But if you cross out plural words, S isn't so common.

Anyhow, now I like SLATE. SAUCE and SLICE are nice too.

Permalink
& Comments

UPDATE: This post has been superceded.

I've been playing Wordle, the online game that's like a cross between Mastermind and guess-the-word. It occurred to me that the ideal "starting word" would have commonly-appearing letters. And it would be even better if those letters appeared in their most common positions. E.g., it would be good if my starting word contained the most common letter E, but EARLY might not be a great word because E is much more likely to appear at the end of words, not at the beginning. But I've got a list of words and can write a little program to find out which letters are most common at which positions in five-letter words et voila:

position
1 2 3 4 5
s a a e s
a o r a e
t e n i a
b i e o y

…so I eyeballed that data and chose my new starter word. (No, I didn't choose SAAES; I picked a real word out of that mess.)

Permalink
& Comments

I got wind of a new-ish public word list for crossword constructors, the spread the word(list). So I grabbed a copy and tossed it into the big pile of data that feeds the "Phraser" phrase and word li...

Permalink & Comments

Further Bewordled I read Allison Parrish's article "Rewordable versus the alphabet fetish," in which she discusses the design of the card game Rewordable. Like Scrabble and Bananagrams, in Rewordable a player builds u...

Permalink & Comments

Updated "phraser" word list I updated that big ranked "phraser" word list (and also the even bigger ranked phrase list). It counts words (and phrases) from different sources than it did before. The Expanded Crossword Name Dat...

Permalink & Comments

Google Ngrams Download There's a new-to-me set of Google Ngrams (big files with frequency counts for common and not-so-common English words&phrases&word-strings): Google Ngrams Download. I mention this because when...

Permalink & Comments

I have a couple of iron-on patches but no iron. 👆 Still trying to figure out on how many levels that sentence is not ironic. ...

Permalink & Comments

"Septuple" has eight letters but "Octuple" has seven. The English language fights you at every turn. ...

Permalink & Comments

A few months back, I mentioned that I'd boosted phraser's word lists by using data from Project Gutenberg's huge stash of old books… and mentioned that I wished I'd thought to omit the non-Eng...

Permalink & Comments

big ol' text corpus Project Gutenberg is a collection of Important Works kept online. E.g., if you'd like to read Shakespeare's sonnets and don't want to schlep off to some library for a physical book (ugh), you can dow...

Permalink & Comments

The Dutch word "scheepvaart" means "maritime", not ovine flatulent whatever. ...

Permalink & Comments

Remember that list of phrases and/or that list of words in a text file handy for designing/solving word puzzles? I updated those lists again with some fresh content. While I'm here: Happy Thanksgivi...

Permalink & Comments

Huh. Neither of my senators' voicemail boxes were full this morning. Maybe I should start leaving longer messages. ...

Permalink & Comments

As previously threatened, I've updated the phrase and word lists linked from the phraser page with more modern language. E.g., podesta was the 82,350th most "common" word on the old list, but with ne...

Permalink & Comments

Remember phraser, that tool for generating puzzle-design-friendly word lists? I just updated it. I found OMDB, a big database of movie info with a public API. (Did I find it? Or did one of you tell m...

Permalink & Comments

phraser, a word list generator When you construct word puzzles, it's good to have a nice list of words to work with. Over the last several weeks, I've been tinkering on and off to build phraser, a tool that chugs through wiki data...

Permalink & Comments

Bird Names, part of the new gig It's an exaggeration to say that Twitter's moving from a Big-Ball-of-Mud monolithic RnR architecture to a loose confederacy of services, but after you tone down the hyperbole that's roughly what's ha...

Permalink & Comments

Book Report: Many Subtle Channels in praise of potential literature In honor of USA's Buy Nothing Day, a report on a book that I checked out of the library: Many Subtle Channels It's a book about the OuLiPo. You've probably heard of them: they're a literary cabal in...

Permalink & Comments

Speaking of "what's this kind of puzzle called?", what is "Put together the letter-triples ION ISS NSM TRA to form a word"? It's kind of an anagram, but easier since you've got three triples instead ...

Permalink & Comments

Link: Ranking Wikipedia Pages This puzzle nerd has ideas on how to rank Wikipedia pages for notable-ness. Similar goals to Nutrimatic, but taking advantage of more data. Some of you folks might have some good ideas on things he c...

Permalink & Comments

Crossword Compiler Noob Diary Unsurprisingly, creating mediocre crossword puzzles is easy but creating good crossword puzzles is hard. Mind you, I don't feel pressured to create great crossword puzzles. For puzzlehunts, I only ne...

Permalink & Comments

Cyber-F-22 Sometime the past few years, the prefix "cyber-" changed meaning. It used to mean "high-tech". But lately, it's meant "I am trying to sell some poorly-thought-out computer crap to the USA governmen...

Permalink & Comments

Michael Agger wants a word for someone who speechifies about the future. He coined "Keynotist" but I prefer TEDifice. ...

Permalink & Comments

The voice of Wikipedia. Each article written by people writing about what they care about most. The precise language of controversies tiptoed around. The earnestness. You might think you could rob...

Permalink & Comments

Google & OpenID: discovery URL A while back, I mentioned that Google supported Opendid. There's one important detail that I had a hard time finding amidst the mountains of documentation: If the user wants to use their Google acco...

Permalink & Comments

Book Report: Alphabet Juice This book is a sort of lexicon, except that instead of definitions there are riffs. These are some of the author's favorite words, or at least words that he wanted to write about. He likes to pron...

Permalink & Comments

Book Report: Letting Go of the Words I'm a professional technical writer and I recommend this book about writing: Letting Go of the Words. I theoretically train engineers so that they can write clearly. This book would help those peopl...

Permalink & Comments

Link: Warren Spector, Playing Word Games Warren Spector does not, as far as I know, play uppercase "T" The uppercase "G" Game. But he designs lowercase "g" games. He worked on some good stuff for the Paranoia pencil-and-paper RPG... uhm, ...

Permalink & Comments

Book Report: Ambient Findability This was not the right book for me. Rather, I was not the right person to read this book. Ambient Findability is a high-level overview, a survey of the surge of information that's coming at us, and...

Permalink & Comments

Book Report: Rainbows End It pays to increase your word power. I always thought that "hyperventilation" meant "breathing too fast", but really it means "breathing too fast and/or too deeply". I didn't know it was possible to...

Permalink & Comments

Book Report: Everything is Miscellaneous I am scheduled for HEAD & NECK SURGERY. It says so, in all-capital letters on the appointment form. Don't worry, mom, HEAD & NECK SURGERY is a scary-sounding category of things, but really s...

Permalink & Comments

Link: Travelers Storybook I have mentioned this before: When I was growing, I spent a fair amount of time with Bob & Kelly Wilhelm, friends of the family. Bob was and is a storyteller. I don't just mean that he can rela...

Permalink & Comments

Link: Webster's Online Dictionary Puzzle hunts were everywhere last weekend. Midnight Madness in Hot Springs. Some movie called BHAGAMBHAG set up a promo treasure hunt in Mumbai, sounds big-scale. I didn't do any of that. I have ...

Permalink & Comments

Puzzle Hunts are Everywhere, from Seattle to Siena Some awesome folks in Seattle are contributing to their local Game community by setting up a web site with announcements and forums and stuff. Check it out. I fed their RSS feed into my reader so I...

Permalink & Comments

Publishing News Tom Manshreck is in town. Tom was living in NYC, working in publishing. There's a lot of publishing around there. Tom was working on engineering textbooks, but he still cares about the literary st...

Permalink & Comments

Not Quite Letting Go of Spring Did I mention that White Mughals mentions a doctor treating a bladder infection? And the doctor is named George Ure. Ure should totally be the root of the word "urea", though it isn't, really. Tha...

Permalink & Comments

Tags