New: Link: AllMyData

I occasionally backed up my files. But it was always ad-hoc: zip up an archive of some files, upload it to my web server. Done by hand when I got around to it (not often).

Then there was the time when I upgraded my OS and it all went pear-shaped. I knew it was risky, so I zipped up an archive of my files, encrypted it (there was some private info in there), and uploaded it. Then the upgrade took a lot longer than I thought--partway through, it became apparent that I needed to send away for installation discs. And I lost the Post-It with the password I'd used to encrypt my files. And I couldn't remember the password. So that was a few months' of data lost forever.

Gone, daddy, gone.

So it was time to re-think my backups.

I'm starting to buy music online. If I send too much data back-and-forth to my website, I get charged for it. And there's a disk space quota besides. So, as I accumulate more music, I can't keep on using my web-site to hold my backup files.

So it was really time to re-think my backups.

Where could I keep these files? I could get an external hard drive. Of course, a lot of the problems that could wipe out my PC could also wipe out a hard drive sitting next to my PC. I have a teeny-tiny apartment. A small dog could destroy most of my electronics in about a minute. So, I was thinking about off-site backups. Backing up my data to "the cloud," as it were. Every year or so, there's a flurry of rumors that Google will launch a "G-drive". Two years of rumors and it ain't happened. I didn't want to wait for that. (Yes, you know where I work. No, I can't comment on unannounced thingies, nor do I even know half of the stuff that's brewing at work.)

I lurk on a mailing list about "capabilities" in computer security. This guy named "Zooko" keeps posting there about this online storage service he works on. Wow, an online storage system sounds like exactly what I need. Zooko seems reasonably sharp.

At some point, I follow up by learning more about this online file service, Allmydata.com. I find out that their CEO is Peter Secor. I worked with Peter Secor back in the day at Geoworks. Peter was pretty competent back then; no reason to think he's less competent now.

So I signed up for an account at Allmydata.com: $100 a year for "unlimited" storage--where "unlimited" means "as much as I can upload/download".

If you just look at the Allmydata.com website, you'll think "Wait, this is for Windoze losers and Macintosh freaks. There's nothing here about Unix weenies." But my previous research had clued me in to the existence of Allmydata.org's Tahoe, an open source program which would, among other things, allow me to send my files to/from Allmydata.com's service.

What I didn't have was a program that would do something rsync-like every so often. So I threw together some quick-and-dirty Python that could run in a cron job.

#!/usr/bin/python

# Back up files to allmydata.com servers.
#
# Use a random order: I don't trust myself to write
# a program that doesn't fail partway through.  But
# if we run a few times, with a different order each
# time, we should back up most files OK.

import datetime
import simplejson as json
import os
import random
import subprocess
import tempfile

# The trees of files to upload.  Those with 'crypt': True get gpg-encrypted
# before upload.  (The rest are uploaded plain, unencrypted.)
LOCAL_ROOTS = [
    { "path": '/home/lahosken/keep/' },
    { "path": '/home/lahosken/seekrit/', 'crypt': True },
]

# Something like tahoe:/2008/home/lahosken/keep/path/to/file.txt
REMOTE_ROOT = 'tahoe:' + str(datetime.date.today().year)

TMP_DIR = tempfile.mkdtemp()

def EnumerateLocalDirs(start_path):
    retval = []
    for (root, dirs, files) in os.walk(unicode(start_path)):
        if files: retval.append(root)
    return retval

def MakeRemoteDir(local_dir_name):
    print "MakeRemoteDir", local_dir_name
    p = subprocess.Popen("tahoe mkdir '%s'" % REMOTE_ROOT, shell=True)
    sts = os.waitpid(p.pid, 0)
    path = ""
    for pathlet in local_dir_name.split("/"):
        path += pathlet + "/"
        remote_dir_name = REMOTE_ROOT + path
        remote_dir_name = remote_dir_name.replace("'", "'\\''")
        p = subprocess.Popen("tahoe mkdir '%s'" % remote_dir_name, shell=True)
        sts = os.waitpid(p.pid, 0)

def SyncFile(local_path, root):
    print "SyncFile", local_path
    local_path = local_path.replace("'", "'\\''")
    if "crypt" in root:
      encrypted_file_path = os.path.join(TMP_DIR, 
                                         local_path.replace("/","")[-9:]+".gpg")
      p = subprocess.Popen("gpg -r mister@lahosken.san-francisco.ca.us -o %s --encrypt %s" %
                           (encrypted_file_path, local_path),
                           shell=True)
      sts = os.waitpid(p.pid, 0)
      p = subprocess.Popen("tahoe cp '%s' '%s'" % 
                           (encrypted_file_path, REMOTE_ROOT + local_path),
                           shell=True)
      sts = os.waitpid(p.pid, 0)
      os.remove(encrypted_file_path)
    else:
      p = subprocess.Popen("tahoe cp '%s' '%s'" % 
                           (local_path, REMOTE_ROOT + local_path),
                           shell=True)
      sts = os.waitpid(p.pid, 0)

def MaybeSync(local_dir_name, root):
    print "MaybeSync", local_dir_name
    remote_dir_name = REMOTE_ROOT + local_dir_name
    local_file_list = os.listdir(local_dir_name)
    random.shuffle(local_file_list)
    remote_dir_contents_json = subprocess.Popen(["tahoe", "ls", "--json", remote_dir_name], stdout=subprocess.PIPE).communicate()[0]
    remote_dir_contents_dict = { "children" : [] }
    if not remote_dir_contents_json:
        MakeRemoteDir(local_dir_name)
    else: 
        remote_dir_contents_dict = json.loads(remote_dir_contents_json)[1]
    for local_file_name in local_file_list:
        local_path = os.path.join(local_dir_name, local_file_name)
        if not os.path.isfile(local_path): continue
        if "crypt" in root and local_file_name.endswith(".rc4"): continue
        if not local_file_name in remote_dir_contents_dict["children"]:
            SyncFile(local_path, root)
        else:
            remote_mtime = remote_dir_contents_dict["children"][local_file_name][1]["metadata"]["mtime"]
            local_mtime = os.stat(local_path).st_mtime
            if local_mtime > remote_mtime:
                SyncFile(local_path, root)
            else:
                pass
                # print "Skipping", local_dir_name, local_file_name, "already there"
            

def main():
    p = subprocess.Popen("tahoe start", shell=True)
    sts = os.waitpid(p.pid, 0)
    random.shuffle(LOCAL_ROOTS)
    for root in LOCAL_ROOTS:
        local_dirs = EnumerateLocalDirs(root["path"])
        random.shuffle(local_dirs)
        for local_dir in local_dirs:
            MaybeSync(local_dir, root)

if __name__ == "__main__":
    main()

Things can still go wrong with this system. I use gpg to encrypt my private stuff. My gpg key is on... my computer. It's also on a piece of removable media. Remember that hypothetical small dog I mentioned earlier, how easily it could mess up a hypothetical external hard drive at the same time it was destroying my computer? My removable media is not quite right next to my computer, but... things could still go wrong with it. Since my previous backup-recovery failure was due to me losing the encryption key, I'd like a stronger, foolproofier solution here.

Still, this is better than what I had before. I sleep easier now.

Labels: , ,

Posted 2009-02-17