Blogger.com manages this part of my site, the /new/ part. In the long-forgotten days of 2006, Blogger.com didn't support labels/tags/whatever. In those dark days, I hand-made some tags, tags which looked suspiciously like tarted-up links to technorati.com. Then Blogger.com did support labels/tags/whatever. But now I had all of these blog posts that didn't follow the new tagging scheme. What to do?
I've been noodling around on the computer a lot, and today I tackled this one. What did I want to do?
- Read my blog's data into a program
- Look for links to technorati.com/tag/foo
- For each of those links, create a label "foo".
- Upload the results to Blogger.com
As I was putting a program together to do this, I discovered another task: don't convert all of the tags; tweak some of them. That sounds like a strange goal, doesn't it? Well, I had made technorati tags for nautical and maritime, just one item tagged with each. Maybe I wanted to convert that "maritime" to "nautical" for consistency. Why leave out some tags? I'd technorati-tagged a Pi post "irrational". I'd never used that tag again. (You might doubt that, given the hysterical tone of this blog, but it's true.) I didn't especially want to clutter up my tag space with 100 obscure tags, each used only once.
So I started reading up on the Blogger GData APIs. At first I thought I could install the Python client library with a simple sudo aptitude install python-gdata. That seemed to install an old version of the library that didn't work with my Python. (It insisted on importing ElemTree from some nonexistent place.) So I ended up downloading and installing the latest version of the library.
Then I set about reading the docs, copying out bits of sample code, and twisting them towards my own purpose. Soon I had a messy piece of code that seemed to work:
from xml.etree import ElementTree import gdata.service import gdata import atom import getopt import re TAG_RE = re.compile('technorati.com/tag/([^\'\"]+)') TAG_MAP = {} # If a tag isn't in here, ignore it. If it is in here, see how to convert it. TAG_MAP['book'] = 'book' TAG_MAP['books'] = 'books' TAG_MAP['puzzle%20hunts'] = 'puzzlehunts' TAG_MAP['puzzle%20hunt'] = 'puzzlehunts' # ...this TAG_MAP crapola went on for a while TAG_MAP['pi'] = 'pi' TAG_MAP['poesy'] = 'poesy' # Get my blog's ID number. I guess I didn't really need to # run this each time. I should just have made a note of the # number. def RetrieveBlogId(blogger_service): query = gdata.service.Query() query.feed = '/feeds/default/blogs' feed = blogger_service.Get(query.ToUri()) return feed.entry[0].GetSelfLink().href.split("/")[-1] # The program doesn't actually call this function! But # an earlier version of the program did, back when I was # still trying to figure out how the API worked. This fn is # taken from the sample code, but with an important tweak. # The sample didn't mention that by default, this would # only retrieve 25 blog posts. My blog had 427 items, so # I appended a tactical "?max-results=500". def PrintAllPosts(blogger_service, blog_id): feed = blogger_service.GetFeed('/feeds/' + blog_id + '/posts/default?max-results=500') print feed.title.text for entry in feed.entry: print "TITLE \t" + entry.title.text print "\t" + entry.content.text print "\t" + entry.updated.text print # Create a client class which will make HTTP requests with Google Docs server. blogger_service = gdata.service.GDataService("lahosken@gmail.com", "ilikeyou") blogger_service.source = 'Technorati-to-Label-1.0' blogger_service.service = 'blogger' blogger_service.server = 'www.blogger.com' blogger_service.ProgrammaticLogin() blog_id = RetrieveBlogId(blogger_service) # PrintAllPosts(blogger_service, blog_id) feed = blogger_service.GetFeed('/feeds/' + blog_id + '/posts/default?max-results=500') for e in feed.entry: dirty = False for tag in TAG_RE.findall(e.content.text): if tag in TAG_MAP: kitty = atom.Category(term=TAG_MAP[tag], scheme="http://www.blogger.com/atom/ns#") e.category.append(kitty) dirty = True if dirty: blogger_service.Put(e, e.GetEditLink().href)
Yes, that is some awful code. No, I don't think it was good style to name that variable "kitty". Give me a break, it was a one-off.
This code ran quickly, but I think that Blogger.com is still handling the results. At first, I didn't think that it had worked. I thought maybe I needed to force a republish of all my pages. So I tweaked the blog's template and triggered a republish. But I don't think that was the problem. Actually, my pages are changing. It's just taking a while. It's several hours later now, and I notice that my blog's pages keep getting re-uploaded. I think I changed tags on about 100 blog posts, and I think each of those changes triggered a republish. They seem to happen about 2-10 minutes apart. If there are a few hundred to process (plus that template change), I guess it makes sense that it would take a few hours.
I wonder when this post will appear.
Labels: programming, site, tagging