New: Site Update: Updated Tags for Old Blog Posts

Blogger.com manages this part of my site, the /new/ part. In the long-forgotten days of 2006, Blogger.com didn't support labels/tags/whatever. In those dark days, I hand-made some tags, tags which looked suspiciously like tarted-up links to technorati.com. Then Blogger.com did support labels/tags/whatever. But now I had all of these blog posts that didn't follow the new tagging scheme. What to do?

I've been noodling around on the computer a lot, and today I tackled this one. What did I want to do?

  1. Read my blog's data into a program
  2. Look for links to technorati.com/tag/foo
  3. For each of those links, create a label "foo".
  4. Upload the results to Blogger.com

As I was putting a program together to do this, I discovered another task: don't convert all of the tags; tweak some of them. That sounds like a strange goal, doesn't it? Well, I had made technorati tags for nautical and maritime, just one item tagged with each. Maybe I wanted to convert that "maritime" to "nautical" for consistency. Why leave out some tags? I'd technorati-tagged a Pi post "irrational". I'd never used that tag again. (You might doubt that, given the hysterical tone of this blog, but it's true.) I didn't especially want to clutter up my tag space with 100 obscure tags, each used only once.

So I started reading up on the Blogger GData APIs. At first I thought I could install the Python client library with a simple sudo aptitude install python-gdata. That seemed to install an old version of the library that didn't work with my Python. (It insisted on importing ElemTree from some nonexistent place.) So I ended up downloading and installing the latest version of the library.

Then I set about reading the docs, copying out bits of sample code, and twisting them towards my own purpose. Soon I had a messy piece of code that seemed to work:

from xml.etree import ElementTree
import gdata.service
import gdata
import atom
import getopt
import re

TAG_RE = re.compile('technorati.com/tag/([^\'\"]+)')

TAG_MAP = {} # If a tag isn't in here, ignore it.  If it is in here, see how to convert it.
TAG_MAP['book'] = 'book'
TAG_MAP['books'] = 'books'
TAG_MAP['puzzle%20hunts'] = 'puzzlehunts'
TAG_MAP['puzzle%20hunt'] = 'puzzlehunts'
# ...this TAG_MAP crapola went on for a while
TAG_MAP['pi'] = 'pi'
TAG_MAP['poesy'] = 'poesy'

# Get my blog's ID number.  I guess I didn't really need to 
# run this each time.  I should just have made a note of the
# number.
def RetrieveBlogId(blogger_service):
  query = gdata.service.Query()
  query.feed = '/feeds/default/blogs'
  feed = blogger_service.Get(query.ToUri())
  return feed.entry[0].GetSelfLink().href.split("/")[-1]

# The program doesn't actually call this function!  But
# an earlier version of the program did, back when I was
# still trying to figure out how the API worked.  This fn is
# taken from the sample code, but with an important tweak.
# The sample didn't mention that by default, this would
# only retrieve 25 blog posts.  My blog had 427 items, so 
# I appended a tactical "?max-results=500".
def PrintAllPosts(blogger_service, blog_id):
    feed = blogger_service.GetFeed('/feeds/' + blog_id + '/posts/default?max-results=500')
    print feed.title.text
    for entry in feed.entry:
      print "TITLE \t" + entry.title.text
      print "\t" + entry.content.text
      print "\t" + entry.updated.text
    print

# Create a client class which will make HTTP requests with Google Docs server.
blogger_service = gdata.service.GDataService("lahosken@gmail.com", "ilikeyou")
blogger_service.source = 'Technorati-to-Label-1.0' 
blogger_service.service = 'blogger'
blogger_service.server = 'www.blogger.com'
blogger_service.ProgrammaticLogin()

blog_id = RetrieveBlogId(blogger_service)
# PrintAllPosts(blogger_service, blog_id)

feed = blogger_service.GetFeed('/feeds/' + blog_id + '/posts/default?max-results=500')
for e in feed.entry:
    dirty = False
    for tag in TAG_RE.findall(e.content.text):
        if tag in TAG_MAP:
            kitty = atom.Category(term=TAG_MAP[tag], scheme="http://www.blogger.com/atom/ns#")
            e.category.append(kitty)
            dirty = True
    if dirty:
        blogger_service.Put(e, e.GetEditLink().href)

Yes, that is some awful code. No, I don't think it was good style to name that variable "kitty". Give me a break, it was a one-off.

This code ran quickly, but I think that Blogger.com is still handling the results. At first, I didn't think that it had worked. I thought maybe I needed to force a republish of all my pages. So I tweaked the blog's template and triggered a republish. But I don't think that was the problem. Actually, my pages are changing. It's just taking a while. It's several hours later now, and I notice that my blog's pages keep getting re-uploaded. I think I changed tags on about 100 blog posts, and I think each of those changes triggered a republish. They seem to happen about 2-10 minutes apart. If there are a few hundred to process (plus that template change), I guess it makes sense that it would take a few hours.

I wonder when this post will appear.

Labels: , ,

Posted 2007-12-02