: New: big ol' text corpus

Project Gutenberg is a collection of Important Works kept online. E.g., if you'd like to read Shakespeare's sonnets and don't want to schlep off to some library for a physical book (ugh), you can download that from Project Gutenberg. What hasn't been so easy until now: downloading all of Project Gutenberg's text files for data-crunching. Different works are stored in different formats: this one is an ASCII text file; that one is a zip-compressed utf-8 file… You could get them all, but you have to do a lot of thinking and data-converting and (ugh) effort. The excellent poetry-automation expert Allison Parrish has put together a cleaned-up consistent-ified collection: the gutenberg-dammit project. It's pretty nifty.

I grabbed its text files and tossed them into the pile of text I use to generate the "phraser" lists of common words and phrases arguably useful for solving and creating word puzzles. Now if you download the phrase-list and/or word-list files from there, you'll know that they are informed by the world's great public-domain literature and stuff. (I'll let those with exquisite taste debate whether, say, Edgar Rice Burroughs' Chessmen of Mars counts as "literature" or "stuff".) Anyhow, it's different. Project Gutenberg is big! Generating these word lists took a day longer than it would have without all of these big files. (If you're wondering "Hey, Larry, could you describe some of the differences you notice? Is there some literary phrase that got 'boosted' in the list thanks to these sources?", alas I can't really: along with pulling in Project Gutenberg, I updated some other things. I'm too lazy to crank out a Gutenberg-less list and compare. I like to use this machine for internet surfing, not just for crunching word lists…)

I could have (and perhaps will in the future) been more selective about choosing works: The gutenberg-dammit files have metadata that points out that some files aren't in English; some of these text files are just placeholders saying Don't look at this text file, for this Edison Wax Cylinder recording, see the audio file at such-and-such place. I sloppily just dumped all of those into my text-file directory and crunched 'em all. If I'd realized that would lead to a whole extra day of computation, maybe I would have left out the German and Dutch and French etc etc.

But anyhow: a darned useful resource for folks who want a big ol' pile of English text, gutenberg-dammit, check it out.

Tags: words book

blog comments powered by Disqus