Last time, I wrote about a brute force web crawler. This time, I'm writing about an elegant web crawler. As you would expect from elegant code, I didn't write it.
The Pirates BATH game had a pregame website. Teams could log in to the web site. There was a web form which I'd programmed before I'd dropped out of Game Control. This web form allowed teams to "search for treasure": enter a string of text. Game Control gave them some strings of text that they could enter: entering one of those into the web form yielded puzzles. When a team solved the puzzle, the answer was a phrase: entering the phrase into the web form yielded a hint which would be useful during the upcoming game.
If they entered text that wasn't a puzzle and wasn't an answer, they were told that they'd found nothing. And if they paid attention, they also noticed some black dots, some white dots, and some xs. These were a "Mastermind" puzzle. If they entered a nonsense phrase, a program figured out which "useful" word was closest; it would then display one white dot for each letter in the correct place; a black dot for each correct letter in the wrong place; an X for each incorrect letter. So if "BELOW" was a word and someone entered "BLOW", they'd see a white dot (for the B), three black dots (for L, O, and W), and an X (for the E).
This was the way to find one game hint: no puzzles solved to the correct word for this hint. But four puzzles gave words that didn't actually yield hints--but instead were just near to the word to enter for this special hint.
What if a team just tried to guess every possible text string? They could guess A B C ... Y Z AA AB AC ... ZY ZZ AAA AAB AAC ... Of course, that would take a long time. It would probably take less time to just solve the puzzles.
So I was kind of surprised when my pager started buzzing one day: BATH Game Control was sending me messages: Team Scoobies had set up a bot to crawl the server! The Scoobies had found puzzles that they couldn't have found!
I looked over the logs. There was a program crawling the system, but the Scoobies weren't running it. Team Blood was running it. The bot was not brute-forcibly checking every possible text string. It was playing Mastermind!
It would guess "A". If it got back a white dot, it knew that at least one word started with A. If it got back a white dot, it knew that at least one word started with A. (A white dot meant right letter in right place.) Next it would try try AA AB AC AD AE ... AZ. If AA returned just one white dot (not two), then the bot knew no words started with AA (e.g., no word was AARDVARK). So it never tried AAA AAB AAC... Thus, it didn't need to check so many things. Thus, elegance.
When I reported my findings to Game Control, they decided that this thing must be stopped. Though it was elegant, what if it allowed the team to bypass puzzles? Game Control figured that this would be unfair.
Hmm, how to stop the bot without disrupting other teams? How did the bot work? Team Blood was running it. Rich Bragg captained Team Blood. I worked at the same company as Rich. Maybe he'd written this program while at work? And maybe he'd left the program somewhere where I could find it? I thought about it: If I were Rich and I'd written this program at work, where would I have put the source code? I looked there: no program. Then I tried my second guess and saw a file: piratebath.py. Bingo. It was a web crawler, a very specialized web crawler.
#!/usr/bin/python2.4 import cookielib import re import time import urllib2 def Login(): print "Logging in..." cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) opener.open("http://www.piratesbath.com/log_in.php", "access_login=blood&access_password=bythepint") return opener def NumMatches(html_data, substring): matches = re.findall(substring, html_data) if not matches: return 0 return len(matches) def NumLettersCorrect(html_data): return NumMatches(html_data, "dot_white.gif") def FoundTreasure(html_data): return NumMatches(html_data, "No treasure found.") == 0 def SearchOne(opener, results, query): data = opener.open("http://www.piratesbath.com/search.php?q=" + query).read() letters_correct = NumLettersCorrect(data) print "Query:", query, "had", letters_correct, "of", len(query), "letters" all_correct = letters_correct == len(query) if all_correct and FoundTreasure(data): print "Found:", query results.append(query) return all_correct def SearchAll(opener, results, query_prefix = ''): alphabet = list('abcdefghijklmnopqrstuvwxyz') for letter in alphabet: if SearchOne(opener, results, query_prefix + letter): SearchAll(opener, results, query_prefix + letter) def Run(query_prefix = ''): opener = Login() results =  SearchAll(opener, results, query_prefix) print "Results: ", len(results), "words starting with '%s'" % query_prefix for word in results: print word Run()
Aha, the code was checking for text in the page: dot_white.gif and No treasure found. If I just added some visible-to-bots-but-invisible-to-humans text like that, I could fool the bot into mis-counting white dots or what-have-you. So that's what I did. (Security-minded folks in the audience might say: uhm, but what about stopping the general case of bots? Yeah, I set up code for that too, but wanted to let Game Control configure it to say how much guessing was "too much", and that took a while. Fooling Rich's bot--that was a quick-n-dirty fix.)
(I notice that this code imports the "time" module, but doesn't use it. I wonder if an earlier version of code politely "slept" a little between queries--but maybe Rich figured out that the server was waiting a second between responding to a team's queries anyhow, and that the sleep was thus not so useful...)
Rich noticed when his bot started generating garbage results. He mailed Game Control to make sure there were no hard feelings. Game Control asked him to stop running it, and he did. He said that this script was basically another monitor: it alerted the team to the presence of new puzzles; thus no-one had to go re-check the piratesbath.com web site each day.
In hindsight, when I programmed that web form, we should have used it only for entering answers, not for getting puzzles. We should have used some other way to distribute puzzles. Thus, a team could monitor that to look for puzzles and Game Control wouldn't need to panic that someone was bypassing the puzzles to get the answers.
Labels: arrr, programming, puzzlehunts