Salvaged from Facebook: More on Observing Playtesters

A while back, I posted a plea for Game Control folks to offer up insights on how they Observe Playtesters. That got some interesting replies scattered around the internets. It also got some replies on Facebook. Those aren't so easy to find later. So I'm re~~printing~~ posting them here because there's some good stuff I wanna be able to find later:

Alexandra Dixon: Larry your post is so overwhelming, rambling and comprehensive I don't know quite how to reply and actually answer your questions! So I won't try. But yes it inspired me, since this is a subject I'm also very interested in, as a game producer and player in games, and also because I just participated in the Hallowe'en BANG on-the-ground playtest and afterwards solved a few puzzles that we didn't get to.
I write a lot of puzzles for DrClue.com and in those games, which take place in remote cities (sometimes VERY remote, like Dubai), we have to have people playtest from a distance where we can't observe them playtesting. In those cases, they're not playtesting the actual puzzles, but they're testing the questions/answers in the environment. So we always use two people, and have them go over the course independently, then send us their feedback via email. We'll maybe given them 20-25 questions to answer, and plan to use 10-12 of the questions in the actual game.
These are usually people we hire on craigslist, so they're unknown quantities.
We know what answers we expect (because one of the staff scouted and wrote the questions). The gold standard is, we'd like to only use questions where both playtesters got the answer we expect.
However, it often happens that one playtester is lazy and/or clueless and consistently gets wrong answers, while the other one is clearly on the ball and really has his head into it and gives us lots of good feedback such as "this exhibit is closing on October 15th, better not use it." In that case we will go solely with the playtester who appears trustworthy - if he got the answers we were expecting, that's good enough, even if the other guy didn't.
That's a longwinded way of saying, I always weight feedback by my subjective perception of how reliable I think the playtesters are - and in games where they're testing the puzzles, how closely the playtesters represent the skill level of the actual participants in the real game. You don't really want outliers (except for timing and staffing clue sites) - you kinda want middle-of-the-pack playtesters.
I suppose, I would weight my own observations more than the playtesters' feedback so I think actually watching people playtest (and NOT putting your thumb on the scale by giving unsolicited hints or feedback, or interacting with the playtesters) is far preferable to having them email their feedback on a clue they solved remotely.
Having said that, I did playtest three clues when I got home Sunday night - and what I tried to do is give a kind of real-time email update of the progress I was making on each clue as I was solving it. I didn't give any qualititative feedback, I just wrote down what I was thinking and what progress I made. I figured my thought process was the most important thing in giving feedback - and those wrong paths and parallel universe mistakes are REALLY important to GC. You want to flush those out and either decide "Alexandra was clueless, the average playtester would not go up that particular wrong path" or "Oops, we need to make that more clear. Other people are jumping to the same (wrong) conclusion."
So, playtesting isn't even just a person-by-person experience - you're also doing it in aggregate - is there something that MOST playtesters get tripped up by? If so, you definitely want to fix that even if it's not an outright bug. Or, if it's just one person - maybe you DON'T tweak it. Depends on the person and why they got tripped up.
After I had playtested everything in person and at home, I wrote up my feedback notes and sent it in (just this afternoon, actually). Mostly I focused on suggestions for tweaks or changes, stuff that maybe I didn't say while GC was observing my team. And lavish praise because I loved the game :-)
As for changing something just because a playtester suggested it? That's not automatic, in my book. Again, it would depend on what other playtesters thought about that clue, and what I think of that particular playtester's opinions.

PS IMHO the reasons the outliers don't matter so much (except for timing) is that you can't write the game to really challenge the top teams because then the rest of the pack would struggle and not have a good time. And you can't write the game for the slowest/noob teams because the rest of the pack would be bored. You can take care of the slower/noob teams with hints anyway (though of course, everybody hates taking hints). So by targeting the middle of the pack, you can kind of get the best of both worlds. And if you make the puzzles intricate and elegant enough, with enough built-in "aha's" - the top teams are going to have fun too.

One more thing ... when an individual is playtesting remotely - that's not as valuable as when a team is playtesting - remotely or in person - because the team dynamic will absolutely change how the puzzle is solved. Unless you get stuck in group-think - which can happen on a team - you're more likely to hit on the right aha's, and sooner, in a team environment than solving solo. So that's another factor for GC to consider when evaluating playtest feedback. Or, in a trivia-intensive clue, maybe Person A only knows 50% of the trivia - not critical mass for solving - but what are the odds that among four people, they won't have critical mass?

Melinda Owens: The worst playtesting comment we've ever gotten: "We're afraid this event is going to harm the community!" -said in a obviously pained and hysteric tone of voice.. (They obviously meant the puzzle-hunting community- we weren't going to spread toxic sludge around or anything.) That comment became kind of a running joke for us for the rest of the game, although we ended up restructuring the game considerably because of their more specific comments.
We try to incorporate all the feedback we can, especially because with only 2-3 groups playtesting we don't know how representative it might be. We do like to see dead ends and false paths, because even if the playtest team gets through one barrier, a real team might not. It's really easy to say in your head, "This team was just having a bad day," instead of humbling yourself enough to go change your own puzzle. In the end, our egos are less important than the player's experience during the game.

Alexandra Dixon: One thing that I think makes for a good game is if it's written by people who play in a lot of games. You have to think like a player to write good clues.
What's great about this thread is that people in our community (Stanford/BANG) take it for granted that you playtest before you run a game. But that's actually not a given outside of our community.
Here's an example. This is a clue from the 2007 Chinese New Year's Treasure Hunt .

"To: Alice Cromm
I can see, after some puzzlement, that you're simply, as they say in England, "mixed up." I'll give just under 300 to find you (you can bank on it). I know there's a red CAT not far away, but I won't leave you, for you were a source of Family values on that April day."

So, we're going to the 200 block of Commercial, right? Well, no.
Commercial has an odd property - it's a virtual street for the first four blocks. The numbering starts at the Embarcadero, as if Commercial ran east-west through the three Embarcadero Centers, bisecting them. At Battery Street it becomes an actual physical street. That's the 400 block. But if you look on a map, it looks like the unit block because the physical street starts there.
Whoever wrote that clue was actually sending us to a location on the 600 block of Commercial, not the 200 block. But we had no way to know that. So we wandered around Embarcadero 2 (the "200 block") until it was obvious we were in the wrong place, and we gave up.
Now, the only way somebody could have made that mistake is if they looked at a paper map that didn't have street numbers on it (or ignored the street numbers if they were on the map), counted the blocks as if Battery->Sansome were the unit block, then (a) never went to the actual block in person and (b) never playtested!
So 1500 people on the night of the event, had essentially no hope of solving this clue. Having said that, a few teams did solve it by accident, because they walked down Commercial from Chinatown and stumbled on the answer. But our team didn't take that route.
To me, that's just unforgivable. Your customers are not guinea pigs - they're CUSTOMERS. And even if it's a BANG and it's community-based and not-for-profit - they're giving you their time. They're trusting you to make them glad they did. So - you gotta playtest. Anyway, I'm a hypocrite because I still play in the CNYTH every year and I still enjoy it. But not so much for the clues as for the ambiance.

So there you go. When you're watching playtesters, at least be glad that you can watch them, and they're not just some random person in Des Moines who responded to a craigslist ad who you hope knows what they're doing. And it's worth getting rid of red herrings, even if your playtesters got past them— if that herring does block some other team later, you'd feel foolish saying "Oh yeah, the playtesters ran into that, but we didn't fix it because... uhm... dang."