Larry Hosken: New

Information Architecture: There Oughtta be a Law

I recently read an article by London Times writer Alan Brien in which he wrote

"I used to think that I was the first reader, enraged by the difficulty of tracking down a passage in a long work of reference without re-scanning every single page, who proposed that all non-fictional [sic] books without indexes should be denied copyright."
"Indexes–pleasures of; pitfalls in; regrettable absences of; penalty for failing to provide"
Feb 23 1968 London Times

I encountered this quote as I hunted for an article by Mr Brien--an article I might have found right away if only the publishers of the Times had seen fit to provide an index.

To be clear, though this passage pissed me off when I first encountered it, I don't think Brien was being hypocritical. I think he was being humble. I guess he didn't expect that anyone would go through the Arts sections of decades-old newspaper microfiches looking for his old articles. Manual indexing takes effort. It almost seems an act of hubris to create an index of one's own writing.

(It's times like this when I love my job.)

Labels: paper, research, tagging

Book Report: Ambient Findability

This was not the right book for me. Rather, I was not the right person to read this book. Ambient Findability is a high-level overview, a survey of the surge of information that's coming at us, and the methods we use to navigate it. It talks a little about many things. Geographical data, RFID, SEO, taxonomies, ... It doesn't go into depth on any topic in particular. If you keep up with slashdot, or if you think about how to help people to find things, then you've probably already run into this book's material. I'm guessing that this book is snippets of talks that Peter Morville, the author, gives to people who don't read slashdot and who don't think about how people find their stuff. He's a consultant, he probably has to explain this crap to customers all the time.

I mean, you ask people, uhm, say you're talking with a kitchen appliance manufacturer, and you ask "The people who should find your website, what do they think they're looking for?" And the answer comes back "Oh, they're looking for a variable-speed food processor." And this hypothetical manufacturer wants you to organize everything based on "variable-speed food processor". And when you say "Are you sure that they don't think that they're looking for... like, wouldn't they call it a 'blender'?" And then this hypothetical kitchen appliance manufacturer looks at you sharply and barks something about "precision of language" and wonders why no-one can find his variable-speed food processors. What do you do? You could punch the manufacturer bozo, but that would land you in jail. Or you could give him a copy of this book. I can't tell whether or not the book would help him figure this stuff out, but it might keep you out of jail and that's worth something.

I haven't been posting much recently. I've been on jury duty. Or rather, I've been living in the shadow of threatened jury duty, and have thus been working extra hours at my job to get stuff done in case I end up on the trial and most of my time disappears for the next month. So remember kids: give books instead of punches. It keeps you--and perhaps me--out of the courtroom.

Labels: book, tagging, words

Book Report: Everything is Miscellaneous

I am scheduled for HEAD & NECK SURGERY. It says so, in all-capital letters on the appointment form. Don't worry, mom, HEAD & NECK SURGERY is a scary-sounding category of things, but really someone is just going to cut this bump off of my lip. I guess to make it sound less scary, they could have given a sub-category: HEAD & NECK SURGERY / LIP FIXING. But figuring out categories is hard, figuring out subcategories is harder and it's silly to waste time figuring out if some procedure is LIP FIXING or CUPID'S BOW REARRANGEMENT when you could spend that time cutting bumps off of lips instead.

If you have a bump on your lip, searching for medical information on the internet is frustrating. If you search for [lip], these sites serve up results for herpes. If you look over the description of herpes and say, "nope, I just have this one bump", the sites don't know what to say. I think that's because these sites are organized by condition--and for a lip-bump the doctor doesn't diagnose the exact medical condition. "Is it one of these scary things? Nope. Then let's just cut that bump off, whatever it is." I've had medical self-help books that are organized by initial diagnosis, not by medical condition. You start with "bump on lip" and go through a flow chart. They capture the case of "we don't know exactly what it is, but it's probably not too serious" better.

Categorizing things can be tricky. Figuring out which things are the things you're talking about can be tricky.

The first time I ever heard of Peter Morville, information architecture pundit, was when he gave a talk at work a few months ago. I got the impression that he hated tagsonomies, hated users annotating web pages/photos/anything. I thought he only wanted librarians to have that Mysterious Classification Power. (Hey, bear with me, Peter Morville fans. I now realize I was wrong.) It upset me and made me think he was a jerk. So, what did he say?

This is the free-tagging and the folksonomies of flickr and del.icio.us, and there's almost a sort of religous revolutionary zeal that's wrapped up with this notion of free-tagging. Get rid of the librarians, the information architects, the taxonomies, the controlled vocabularies, and just let the users tag stuff with anything they want! And in that sort of spirit, David Weinberger, who's got a new book out called Everything is Miscellaneous, said "The old way creates a tree, the new rakes leaves together." So the old way was about taxonomies and tree structures, and the new is about these wonderful self-organizing clusters. When I saw that, I thought you know that's the perfect metaphor. Because we know what happens to those lovely piles of leaves we shuffle through each fall. They very quickly rot, and they return to the ground where they become food for trees, which come in many shapes and sizes and live a very long time.
I actually think that David's book is brilliant; I think that he's a really smart guy. I'm not sure he's totally fair to librarians but I'm of course a little biased. But I actually think that the answer lies in the genius of the "and", in figuring out how do we bring these traditional and novel organization approaches together...

I hadn't read Everything is Miscellaneous. More to the point, I hadn't seen some of the reactions it drew from a set of idiot blowhards. Thus, I didn't interpret Morville's words as "tagsonomies and professionally-put-together taxonomies help each other." I just heard "this new crap will rot and then my beloved librarians will have the power back neiner neiner." His talk's ending didn't help much.

He told the story of the three stonecutters. Here, I'll summarize the story: ask three stonecutters in a quarry what they're doing. First one answers "I'm making a living." Second one answers, "I am honing my craft." Third one gets starry-eyed and says, "I am building a cathedral." Third one has his eye on the big picture. So, what does this story illustrate...

I've always thought of libraries as more than just warehouses of information. I've thought of them, to some degree, as cathedrals of knowledge. Sort of lifting us up and inspiring us, making us aware of the human potential to create and share knowledge and work together. And my hope is that we move further into the internet age that we take some of those values with us and that we also create and share with one another new sources of inspiration. That we are seeing the big picture in what we're doing, and that we're not just marching forward with this sense of some sort of pre-ordained techno-utopian future, but that we're actually taking the time to think about a future that we want to create, and that we end up working towards desirable futures.

When you're speaking to computer programmers, don't present the "cathedral" as a good thing. When you say "cathedral", programmers sweep the stonecutter story out of their brains, and instead remember the essay The Cathedral and the Bazaar. This essay is about the advantages of very-open-source programming--the model in which developers around the world can see a program's progress as that program is being written, can contribute to that program, all very open. That's the "bazaar". This essay's thesis is that the "bazaar" model works much better than the old model of programming: a team doesn't show their source code to the world as they work (or does so only rarely); the team doesn't accept fixes/code from the outside world; the resulting software is buggier because not as many people look at the source code. That's the "cathedral", the team that doesn't listen.

Don't call yourself a Cathedral when you're talking to programmers, not when you're trying to convince us that a small team of elites is doing something wonderful that a huge crowd of enthusiasts couldn't do. I'm sure Morville has read The Cathedral and the Bazaar, but does he understand how much more it resonates than that stonecutter story?

Oh, I'm getting worked up again. What was I talking about? Oh right--I didn't understand the context of Morville's statements. Morville talks to librarians. A lot. A lot of them had read this book Everything is Miscellaneous. And this book says--well, first: It doesn't say that librarians aren't talented; it doesn't say that librarians are stupid. It does say that as information moves of of paper and onto the web, many activities that librarians have historically spent a lot of time on--those activities aren't going to be so useful. But librarians still have useful skills. Many librarians understand this. Some don't; they say that this book is an attack on their profession. Some idiot blowhards, not librarians themselves, are tut-tutting the book, wrapping their toxicity in a cloak of "I'm just trashing this book because I love librarians."

That's a pretty slick maneuver. Everybody loves librarians, so folks might not figure out that you're just being an idiot blowhard. But that's not what Everything is Miscellaneous is about. Who are these idiot blowhards?

[In an earlier draft of this book report, here I quoted one of these fight-picking morons and then pointed out how they'd misinterpreted the book, why their misinterpretation wasn't even internally consistent... Ahem. But pointing at someone else and calling him a fight-picking moron... Well, that's not setting a great example of how not to be a fight-picking moron...]

It is good to remember that these idiot blowhards are out there. Some of them don't like it when you tell them that the Dewey Decimal System has not aged well. I bet they've whined at Peter Morville about it until he stopped wondering "Did I overlook something in that book?" and started wondering "Why is that David Weinberger being so mean to librarians?"

I finally read Everything is Miscellaneous. And reading it made me want to go back and watch the video of Morville's talk. EisM talks about subject classifications and tagsonomies. That reminded me that Morville had talked about librarians vs tagsonomies. I hadn't understood that Morville's "yay librarians/boo tagsonomy" remark was in the context of talking about Everything is Miscellaneous and the reaction. Now that I look at the talk again, I don't think he was really saying "boo tagsonomies". Morville likes them. He likes having someone with a taxonomic attitude make the first organization for information, but he doesn't mind users annotating.

So what is this book that baited so many idiot blowhards to bloviate? (Since I'm writing about it, I guess it's lured in one more...)

This book is about how we organize knowledge. It's about categories, about ontologies, about taxonomies, about indexes, about card catalogs, about the Dewey Decimal system, about books, about the web, about Web2.0, about tagsonomies, about user-supplied annotations, about user-supplied content, about how we perceive the world. And it's written very understandably. So you can see why some people would have opinions about it.

He talks about how we organize physical objects so that people can find them. You can categorize things--in an office supply store, you might put "printer stuff" in a section. You could try just storing everything in alphabetical order based upon some name provided by the manufacturer. But what if someone fails to find the Printer Cable they're looking for just because the manufacturer called it a "parallel cable"?

Categorization can also help you to identify things. Fun fact I learned from this book: Linnean classification predates acceptance of evolution. Similarly-classified things weren't necessarily supposed to spring from common ancestry. The classification was just a way of helping to identify the thingies you were talking about. Which flower? The one with the split stamen, five petals, etc etc. There wasn't a great reason to canonicalize your classification to be "first describe the stamen, then number of petals", but you did need some canonical order because you were writing all of these things down. And if you wrote down all possible orderings, then the resulting index would be larger than your botanical garden's library building.

Card catalogs have subject cards. But there was pressure on librarians not to assign a book to too many subjects--there wouldn't be enough physical space in the card catalog to hold that many subject cards. If a book was mostly about the Crimean War and had some interesting things to say about caring for horses--you'd probably never know about the horses from the card catalog.

The Dewey Decimal system is an attempt to order books by category. It was very impressive. It is showing its age. Why is the "Religion" section so big and why is Christianity so large within that section? Well, back when the system was devised, that might have made sense. Why doesn't Chinese get more of a... Uhm, the Dewey Decimal system is showing its age. (Though Weinberger doesn't talk about it, the Library of Congress system seems like it's skewing out of balance too. I think that's what they use at the UC Berkeley library. I find myself in some "subjects" all the time and others not at all--with less balance than I'd expect even with my narrow interests. I bet the Cutter System has similar problems.)

The web is here and it's popular. Now it's easy for people to annotate things. They can point out links between things. They can comment on things. And they can categorize things. It's very exciting. It's not limited by physical space. So an electronic card catalog could have 2000 "subject cards" for each book.

Our categories are fuzzy. Hamlet is a tragedy. What about Charlotte's Web? Well, it has sad parts. Maybe it's kinda tragic.

The chapter called Smart Leaves nudged me out into new-to-me mental territory: The things that we're trying to describe/categorize are themselves fuzzy. What is Hamlet? Is it the First Folio edition? There were a couple of editions before that. Is a book that combines all of those, showing the differences--is that Hamlet, too? How about Rosencrantz and Guildenstern Are Dead, is that Hamlet? (I'd run into this problem while using allconsuming, a web site that lets you comment on books. It pulls some basic information about books from Amazon. But Amazon doesn't list "Hamlet"; it lists every edition of Hamlet that is has for sale. So when I want to comment on a book on allconsuming, I have to choose which edition to comment on. I usually look for the one that the most other people chose. But that's hardly scientific.)

It talks about..

It talks about the issues that you hit every day on the internet. Yet, I learned from it. And even when I was reading things that I already knew, it was sufficiently well-written such that I didn't get bored. Check it out.

Labels: book, phylogeny, tagging, words

Site Update: Updated Tags for Old Blog Posts

Blogger.com manages this part of my site, the /new/ part. In the long-forgotten days of 2006, Blogger.com didn't support labels/tags/whatever. In those dark days, I hand-made some tags, tags which looked suspiciously like tarted-up links to technorati.com. Then Blogger.com did support labels/tags/whatever. But now I had all of these blog posts that didn't follow the new tagging scheme. What to do?

I've been noodling around on the computer a lot, and today I tackled this one. What did I want to do?

Read my blog's data into a program
Look for links to technorati.com/tag/foo
For each of those links, create a label "foo".
Upload the results to Blogger.com

As I was putting a program together to do this, I discovered another task: don't convert all of the tags; tweak some of them. That sounds like a strange goal, doesn't it? Well, I had made technorati tags for nautical and maritime, just one item tagged with each. Maybe I wanted to convert that "maritime" to "nautical" for consistency. Why leave out some tags? I'd technorati-tagged a Pi post "irrational". I'd never used that tag again. (You might doubt that, given the hysterical tone of this blog, but it's true.) I didn't especially want to clutter up my tag space with 100 obscure tags, each used only once.

So I started reading up on the Blogger GData APIs. At first I thought I could install the Python client library with a simple sudo aptitude install python-gdata. That seemed to install an old version of the library that didn't work with my Python. (It insisted on importing ElemTree from some nonexistent place.) So I ended up downloading and installing the latest version of the library.

Then I set about reading the docs, copying out bits of sample code, and twisting them towards my own purpose. Soon I had a messy piece of code that seemed to work:

from xml.etree import ElementTree
import gdata.service
import gdata
import atom
import getopt
import re

TAG_RE = re.compile('technorati.com/tag/([^\'\"]+)')

TAG_MAP = {} # If a tag isn't in here, ignore it.  If it is in here, see how to convert it.
TAG_MAP['book'] = 'book'
TAG_MAP['books'] = 'books'
TAG_MAP['puzzle%20hunts'] = 'puzzlehunts'
TAG_MAP['puzzle%20hunt'] = 'puzzlehunts'
# ...this TAG_MAP crapola went on for a while
TAG_MAP['pi'] = 'pi'
TAG_MAP['poesy'] = 'poesy'

# Get my blog's ID number.  I guess I didn't really need to 
# run this each time.  I should just have made a note of the
# number.
def RetrieveBlogId(blogger_service):
  query = gdata.service.Query()
  query.feed = '/feeds/default/blogs'
  feed = blogger_service.Get(query.ToUri())
  return feed.entry[0].GetSelfLink().href.split("/")[-1]

# The program doesn't actually call this function!  But
# an earlier version of the program did, back when I was
# still trying to figure out how the API worked.  This fn is
# taken from the sample code, but with an important tweak.
# The sample didn't mention that by default, this would
# only retrieve 25 blog posts.  My blog had 427 items, so 
# I appended a tactical "?max-results=500".
def PrintAllPosts(blogger_service, blog_id):
    feed = blogger_service.GetFeed('/feeds/' + blog_id + '/posts/default?max-results=500')
    print feed.title.text
    for entry in feed.entry:
      print "TITLE \t" + entry.title.text
      print "\t" + entry.content.text
      print "\t" + entry.updated.text
    print

# Create a client class which will make HTTP requests with Google Docs server.
blogger_service = gdata.service.GDataService("lahosken@gmail.com", "ilikeyou")
blogger_service.source = 'Technorati-to-Label-1.0' 
blogger_service.service = 'blogger'
blogger_service.server = 'www.blogger.com'
blogger_service.ProgrammaticLogin()

blog_id = RetrieveBlogId(blogger_service)
# PrintAllPosts(blogger_service, blog_id)

feed = blogger_service.GetFeed('/feeds/' + blog_id + '/posts/default?max-results=500')
for e in feed.entry:
    dirty = False
    for tag in TAG_RE.findall(e.content.text):
        if tag in TAG_MAP:
            kitty = atom.Category(term=TAG_MAP[tag], scheme="http://www.blogger.com/atom/ns#")
            e.category.append(kitty)
            dirty = True
    if dirty:
        blogger_service.Put(e, e.GetEditLink().href)

Yes, that is some awful code. No, I don't think it was good style to name that variable "kitty". Give me a break, it was a one-off.

This code ran quickly, but I think that Blogger.com is still handling the results. At first, I didn't think that it had worked. I thought maybe I needed to force a republish of all my pages. So I tweaked the blog's template and triggered a republish. But I don't think that was the problem. Actually, my pages are changing. It's just taking a while. It's several hours later now, and I notice that my blog's pages keep getting re-uploaded. I think I changed tags on about 100 blog posts, and I think each of those changes triggered a republish. They seem to happen about 2-10 minutes apart. If there are a few hundred to process (plus that template change), I guess it makes sense that it would take a few hours.

I wonder when this post will appear.

Labels: programming, site, tagging

Book Report: Giant Robot #36

I am always glad to see an article by Claudine Ko. But I am not sufficiently secure in my whatever to start reading Jane Magazine, where she spends most of her efforts. So instead I read her interview with Brandon Lee, porn star. This interview appeared in Giant Robot. Maybe it's too racy for Jane? I don't know. Am I sufficiently secure in my whatever to read an interview with a gay porn star? I guess so. It was an okay interview.

My favorite article was about Xavier Cha, who came up with a new grafitti method: topiary tagging. She cut her name into people's hedges. She talks about getting caught.

...[the police officer] walked me up to the house and said, "I found this woman cutting up your hedges." The woman didn't seem to know which hedges he was talking about. She was entertaining guests and seemed annoyed by the cop disrupting her afternoon soirée. She didn't even bother going out to look, and said it was all right. She probably regretted her decision because a couple of days later, it was cut down. A lot of them get cut down right away.

Tags: zine | topiary |

Labels: tagging, zine