Last time, I wrote about a brute force web crawler.
This time, I'm writing about an elegant web crawler. As you would expect
from elegant code, I didn't write it.
The Pirates BATH game had a pregame
website. Teams could log in to the web site. There was a
web form which
I'd programmed before I'd dropped out of Game Control. This web form
allowed teams to
"search for treasure": enter a string of text.
Game Control gave them some strings of text that they could enter: entering
one of those into the web form yielded puzzles. When a team solved the puzzle, the answer
was a phrase: entering the phrase into the web form yielded a hint which would be
useful during the upcoming game.
If they entered text that wasn't a puzzle and wasn't an answer, they were
told that they'd found nothing. And if they paid attention, they also
noticed some black dots, some white dots, and some xs. These
were a "Mastermind" puzzle. If they entered a nonsense phrase, a program
figured out which "useful" word was closest; it would then display one
white dot for each letter in the correct place; a black dot for each
correct letter in the wrong place; an X for each incorrect letter.
So if "BELOW" was a word and someone entered "BLOW", they'd see
a white dot (for the B), three black dots (for L, O, and W),
and an X (for the E).
This was the way to find one game hint: no puzzles solved to the correct
word for this hint. But four puzzles gave words that didn't actually
yield hints--but instead were just near to the word to enter for this
special hint.
What if a team just tried to guess every possible text string? They
could guess A B C ... Y Z AA AB AC ... ZY ZZ AAA AAB AAC ... Of course,
that would take a long time. It would probably take less time to just
solve the puzzles.
So I was kind of surprised when my pager started buzzing one day:
BATH Game Control was sending me messages: Team Scoobies had set up
a bot to crawl the server! The Scoobies had found puzzles that they
couldn't have found!
I looked over the logs. There was a program crawling the system,
but the Scoobies weren't running it.
Team Blood was running it. The
bot was not brute-forcibly checking every possible text string.
It was playing Mastermind!
It would guess "A". If it got back a white dot, it knew that at least
one word started with A. If it got back a white dot, it knew that at least
one word started with A. (A white dot meant right letter in right place.)
Next it would try
try AA AB AC AD AE ... AZ. If AA returned just one white dot (not two),
then the bot knew no words started with AA (e.g., no word was AARDVARK).
So it never tried AAA AAB AAC... Thus, it didn't need to check so many
things. Thus, elegance.
When I reported my findings to Game Control, they decided that this
thing must be stopped. Though it was elegant, what if it allowed the
team to bypass puzzles? Game Control figured that this would be unfair.
Hmm, how to stop the bot without disrupting other teams? How did the
bot work? Team Blood was running it. Rich Bragg captained Team
Blood. I worked at the same company as Rich. Maybe he'd written this
program while at work? And maybe he'd left the program somewhere where I could
find it? I thought about it: If I were Rich and I'd written
this program at work, where would I have put the source code? I looked there:
no program. Then I tried my second guess and saw a file: piratebath.py.
Bingo. It was a web crawler, a very specialized web crawler.
#!/usr/bin/python2.4
import cookielib
import re
import time
import urllib2
def Login():
print "Logging in..."
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open("http://www.piratesbath.com/log_in.php",
"access_login=blood&access_password=bythepint")
return opener
def NumMatches(html_data, substring):
matches = re.findall(substring, html_data)
if not matches:
return 0
return len(matches)
def NumLettersCorrect(html_data):
return NumMatches(html_data, "dot_white.gif")
def FoundTreasure(html_data):
return NumMatches(html_data, "No treasure found.") == 0
def SearchOne(opener, results, query):
data = opener.open("http://www.piratesbath.com/search.php?q=" +
query).read()
letters_correct = NumLettersCorrect(data)
print "Query:", query, "had", letters_correct, "of", len(query), "letters"
all_correct = letters_correct == len(query)
if all_correct and FoundTreasure(data):
print "Found:", query
results.append(query)
return all_correct
def SearchAll(opener, results, query_prefix = ''):
alphabet = list('abcdefghijklmnopqrstuvwxyz')
for letter in alphabet:
if SearchOne(opener, results, query_prefix + letter):
SearchAll(opener, results, query_prefix + letter)
def Run(query_prefix = ''):
opener = Login()
results = []
SearchAll(opener, results, query_prefix)
print "Results: ", len(results), "words starting with '%s'" % query_prefix
for word in results:
print word
Run()
Aha, the code was checking for text in the page: dot_white.gif
and No treasure found. If I just added some
visible-to-bots-but-invisible-to-humans text like that, I could fool
the bot into mis-counting white dots or what-have-you. So that's what I did.
(Security-minded folks in the audience might say: uhm, but what about stopping the
general case of bots? Yeah, I set up code for that too, but wanted to let
Game Control configure it to say how much guessing was "too much", and that
took a while. Fooling Rich's bot--that was a quick-n-dirty fix.)
(I notice that this code imports the "time" module, but doesn't use it. I wonder
if an earlier version of code politely "slept" a little between queries--but
maybe Rich figured out that the server was waiting a second between
responding to a team's queries anyhow, and that the sleep was thus not so useful...)
Rich noticed when his bot started generating garbage results. He mailed
Game Control to make sure there were no hard feelings. Game Control asked
him to stop running it, and he did. He said that this script was basically
another monitor: it alerted the team to
the presence of new puzzles; thus no-one had to go re-check the
piratesbath.com web site each day.
In hindsight, when I programmed that web form, we should have used it only for entering
answers, not for getting puzzles. We should have used some other way to distribute
puzzles. Thus, a team could monitor that to look for puzzles and Game Control wouldn't
need to panic that someone was bypassing the puzzles to get the answers.
Labels: arrr, programming, puzzlehunts