: New:

A few months back, I mentioned that I'd boosted phraser's word lists by using data from Project Gutenberg's huge stash of old books… and mentioned that I wished I'd thought to omit the non-English books. Thanks to the gutenberg-dammit project, it was pretty easy to do, if only you realize it's worth doing. I finally got around to it. That project "tags" books' language in a consistent way for easy computation. The word list and phrase list at the phraser page now are more English-y and less other-y. This makes a difference; before, "pour" was about the 1500th most popular phrase; now it's about the 7500th most popular; "pour" is valid English but it's darned common in French. Project Gutenberg doesn't have a ton of French books, but is has many… enough to warp phraser's idea of what's a super-common word and what's a perfectly serviceable word.

Tags: words

blog comments powered by Disqus