The Science of Testing

Last week, Marek Jastrzebski¹ and I joined forces on a project at Blackbaud called 'Off the Grid'. One day a year anyone in the company can pick a project that interests them, work on it for 24 hours and give a 5 minute presentation on the experience. This was a truly exhilarating. I found it to be a great opportunity to digress from daily work and try something new. I wanted to do something where we used customer data to provide some kind of feedback loop with our application. My first thought was to apply A/B testing to some of our more lucrative user-facing web forms. That approach took a sharp turn when Marek showed me a blog post he'd read the day before ².

Epsilon-greedy and Epsilon-first Algorithms

Source: Wikipedia

Deciding to present a user one of a set of possible web forms is much like the Multi-armed Bandit³ problem. You have a number of slot machines to select from and you want to maximize your winnings. In the case of a web form, the reward is getting the user to follow the page flow to completion (which in many cases is completing a financial transaction). So given several potentially good web form variants, what's the right way to proceed?

Conventional wisdom says pick two of the best forms and do an A/B test. With the slot machines this is like picking two machines, pulling each arm 100 times, declaring a winner and sticking with that machine. This is a little troubling to me on several fronts. First you have to pick two to start with which seems highly subjective. You have some initial assumptions about what the user might do or find useful. Presumably there's something to like about each possibility you created and no way of knowing if the others might have panned out. The other problem is that there are things that can affect the outcome totally unrelated to the forms you pick. Is it the right time of day, a significant number of tests or the right audience at that time? I think it's really hard to know the data you collected during any given window is the 'right' data. It's not that A/B tests are bad. Surely, it's better to test your ideas and back them up with data - it's just that there has to be a better way.

Epsilon-greedy algorithms take a different approach. You never really stop exploring and you exploit the form that is giving you the best response at that time. There are many ways to tune the algorithm and a few variations on how to implement it, but the simplest case is where 90% of the time you select the choice that is giving you the biggest payout and the remaining 10% you pick a random choice.

It's the worst feeling when the slot machine next to you starts paying off and someone who is not you is jumping up and down excitedly while you're staring at a row of mismatched fruit. So in many ways this is also an exercise in minimizing regret. By continuing to explore the other alternatives you might find that one of those variants are starting to pay off in a way you didn't expect. And if it does, you exploit it.

Venturing Off the Grid

Marek and I started by talking about ideas and throwing a lot of stuff up on the whiteboard. We only had one day to work on this, so we wanted to brainstorm just enough to get started. We thought a lot about what the value would be for the customer and sketched out how to start. Later, we'd find some way to serve web pages, hook into our TeamRaiser² API and register users based on the form the algorithm chose.

One of the cool things about having a time constraint is that it forces you to focus and prioritize on creating the minimal level of quality required for a prototype. Several times during this exercise, we went down paths that could have been time consuming and distracting. Is it necessary for tomorrow's demo? If not we scuttled it. This happened several times on both sides and I think it really helped distill out the essential ideas.

The algorithm was written based on the high-level outline from the blog post. Basically each form records the number of tries, the number of completions and the reward (completions / tries) that each form is currently yielding. We stored this in a YAML file and would log when anything changed. This also allowed us to stop/start the server and continue from where we left off or reset everything to zero and start again.

Testing the Algorithm

Now that we had a web server serving pages using the epsilon-greedy algorithm, we needed a way to track the algorithm's progress and test its decision-making. For our purposes, we decided to use a made-up scenario where there's a new campaign to help provide for pets in a shelter. There are several pages to register through, each having a different animal prominent. We'd let the algorithm figure out which was the best and graph the reward over time.

Since the reward is completions/tries, then it basically represents the percentage of time that a user will go through the process to completion, which is what we want. In order to test that the algorithm was working, we gamed the system to provide a consistent payout for each of the forms. For example, dogs was set to 85%. 85% of the time, a user that saw the dogs page would complete the form. 15% of the time that user would drop out before the process was complete. Cats were set to 75%, Fish to 50% and the Zebra to 25%. (We'll get back to the Gazelle later.)

We ran the test and used Google Charts to graph the results. The test would record any change in reward % to a CSV for each of the forms. After a quick coffee we came back to see the first set of results.

This is interesting. The algorithm bounced around a bit but quickly noticed that dogs were paying out and the others, not so much. The other thing to note is that fish made a run for it and while it didn't get much traction, we realized it *could* have. The form could have gone viral and maybe what we thought was important at the start might have changed. Users are notably mercurial so this wouldn't be surprising. The other thing that stood out is the percentages were flattening out for dogs but still bouncing around for the other pets. This seemed like we just didn't have enough data for the others and that the trends would flatten (which they did).

Now that dogs ruled the roost, we wanted to see what happened when we introduced a new form. Enter the Gazelles. These quadrupeds are quite likable so we gamed them to peg a reward of 98% and watched to see what happened.

Sure enough the algorithm found that it was responding well and the reward continued to rise. We stopped the test to call it a night, but it certainly appears that it was only a matter of time before dogs would give in to the popularity of the gazelles.

Reviewing the Data

We were very pleased at our results and the epsilon-greedy algorithm weighted the forms very close to what we'd gamed them to respond. What was less than ideal was the speed at which the Gazelles responded. It seems like we either got really unlucky or there was a bug somewhere. When we introduced Gazelles it should have almost immediately jumped up to a reward of 100%. There should have been a miss only 2 out of 100 times so the ratio should have climbed much faster than we saw that night.

It was a bug. Turns out there was a problem in the way we were saving off the reward that didn't work consistently. After fixing it and re-running the test we got an immediate response when we introduced Gazelles into the mix.

That's more like it. We introduce a new form, the algorithm likes it, and does what it is designed to do - exploit the hell out of it.

Maximizing value for our clients

This would be a really great way to implement things. It allows our clients to develop several forms and instead of trying to pick the best one, just allow them all to be part of the epsilon-greedy selection. New forms can be dropped in at any time and if it's popular, it will succeed. No longer do you have to make a best guess or exclude other forms.

Most importantly, however, is that it allows people to innovate. You can take more risks because mistakes in this model are minimized. Sure you might occasionally serve a form that doesn't perform as well, but it's a tradeoff for detecting that next new form which might be exactly what your website needed.

See for yourself

I didn't go through all this work just to talk about it though. We have some prototype code and you can get it running quite literally, in minutes if you already have Ruby and Git installed.

https://github.com/convio/epsilon-greedy

git clone git://github.com/convio/epsilon-greedy.git
bundle install
cd into /server and run server.rb
go to the url: http://localhost:4567/test
in another browser tab go to http://localhost:4567/stats

That should start showing you the graphs and of course you can play around with percentages and other settings in the code itself. You might get slightly different results with each run but over time each form should start to fit the expected trend.

At some point I'd love to explore some of the other algorithms in this space. If you google around you can find references to quite a few other solutions to this problem and numerous white papers (for the mathematically inclined).

I think over the next few years we're going to see a lot more in this space. Utilizing customer data and creating a feedback loop to improve what we present to the user is such an awesome thing to be able to do. The more we do these innovations, the more our customers win.

^1.[@rubytester]↩
^2.[20 lines of code that will beat A/B testing every time]↩
^3.[Multi-armed Bandit]↩
^4.[Convio Luminate TeamRaiser for Special Events]↩

I'll be the first to admit it. I have a strong emotional attachment to Watir. I've used it almost every day for the last five years. It's had a profound affect on our engineering team and not just in finding bugs. It has made me a better tester. It has made our engineers respect and value what we do. It has helped me grow technically. It has introduced me to open source. It has a community like no other and has led me to meet and work with amazing people.

It's hard to let go, and yet I know that's the right thing to do.

Diminishing Returns

I joined Convio almost 6 years ago. We needed a tool to automate our UI tests and Watir was an excellent fit. It had a real programming language with Ruby, targeted the browser most used by our clients, and provided a simple yet powerful API. Windows was broadly established in households and IE was king. Over the last several years, however, that landscape has dramatically changed.

Since 2009 Internet Explorer (IE) has been losing market share at the rate of 10% per year and has no sign of slowing up. In March of 2012, IE accounted for roughly 35% of the market with Chrome close behind at 31% and Firefox an even 30%.

IE is no longer the 800 lb. (364 kilo.) gorilla. In fact, out of the 85% of people running some flavor of Windows, half of them opted for a different browser, ditching IE. I suspect that over time, we'll see IE sharing the same market percentage that Safari does now... a meager 4.5%¹

The Rise of Mobile

In the US alone, half of the mobile phones are now smart phones with a 38% increase from last year.²That's a lot of room for growth and these phones are having a profound effect on how the consumer accesses the internet.

That's not even the whole story. According to Opera³, out of the 7 billion people on this planet, 25% have access to a mobile phone and 1.3 billion people use the mobile internet.

Five years ago almost no-one had easy access through a mobile device and now all of my friends have one as do many of the generation before me. I can't imagine traveling without one and I almost feel a little naked without my phone. My 2-year-old daughter has no problem taking it and launching Netflix. These devices are now becoming an ubiquitous part of our daily life.

Changing of the Guard

When you get down to it, we're all working to get a paycheck and keep the lights on at home. For our personal marketability, it's important to know what skills are valuable to employers.

QTP and Mercury have been bleeding out over the last few years with Selenium taking the market by storm. In 2011, Selenium snuck by Mercury and recently eclipsed QTP. Watir has remained on the radar, but clearly never gained the traction that these other tools have managed.

My good friend and colleague, Bret Pettichord, thinks that once employers start realizing this trend, jobs requiring these proprietary tools as a core skill will drop like a rock. I'm inclined to agree with him.

The first message here is the power of open source tools. Being able to have a world-wide community contributing to a codebase is a competitive advantage. These tools are created by the users, for the users and as such are able to meet the needs much better than the shrink-wrap companies that lured in the masses with slick play-and-record tools (though the Selenium's IDE has a familiar smell⁴).

The second, speaks to the needs of the tester. Watir can only drive IE with Ruby on Windows. Selenium provides support for multiple languages on all major OS platforms driving browsers including IE, FireFox, Chrome and Opera. There's also work underway to support Safari. Furthermore, you can drive real mobile devices running iOS, Andoid and Opera browsers.

So Why are we Still Using Watir?

When I returned home from the Selenium conference in 2011, I was energized. WebDriver was just released and it promised to give me access to more browsers. More precisely, I was interested in Watir-WebDriver, the ruby implementation of WebDriver using the same Watir API (Jari Bakken is solely responsible for what has to be the prettiest Ruby I've ever seen). I got home and tried it and was disappointed. It was slow and didn't provide good solutions for a lot of the features that our legacy application relied on, including modal dialogs.

A lot of difference a year can make. A few months ago I tried again and much to my delight, WebDriver was able to handle most things I threw at it. In fact, it has actually surpassed Watir in functionality. Firefox is clearly the darling and gets the most attention with Chrome a close second. I found Firefox to be the most stable and when using WebDriver, so it has become my primary browser. IE is much slower than the other browsers and isn't as reliable but does work (Jim Evans has done an enormous amount of work to improve this driver over the last year).

So where I used to use Watir because Selenium wasn't mature enough, it's now becoming clear that it's no longer the case. Is Watir still better at driving IE? Yes, for sure. But is it better to stick to Watir and turn a blind eye to Selenium? Most definitely not.

A Call to Action

You see, what Selenium needs most is the Watir community. We're the expert users of IE. We can give the feedback they need and make their implementation of IE first-class. If we continue to fall back on the crutch of using Watir, then we'll languish in a browser that is quickly becoming obsolete. It's a mutually beneficial, symbiotic relationship. We can learn from each other and make WebDriver what it needs to be.

Now is the time to move. We need to redirect the Watir community to migrate from Watir to Watir-WebDriver. The API is very close and there's so much to gain. We have code to make the migration easier and if people switch to using Firefox instead of IE by default, it's an easy transition. IE on WebDriver will get better but it needs our help.

There are some who want to use Watir for IE and WebDriver for everything else. Honestly, the people who want to do that can do that today and don't need help from the Watir team. (I personally think trying to maintain two completely different codebases adhering to the same API is abject madness...). We shouldn't continue to support the Watir IE driver. We need to deprecate it and get the Selenium and Watir communities working together to make WebDriver better than anything else out there.

Rising From the Ashes

Driving browsers is no longer the long pole. There's more work to be done, that is certain, but WebDriver has evolved into an industry standard. Members of the Selenium team are getting traction on making WebDriver a W3C standard and pushing browser developers to provide the browser-specific implementations. Firefox and Chrome have already committed to this future and they're working on Apple and Microsoft to do the same.

I'm a core developer for Watir and a few weeks ago I renamed Watir to Watir-Classic. Watir is being reborn as a container gem. "By testers, for testers" is the new motto and it's going to start with moving to Watir-WebDriver in Watir 4.0. Down the road, the Watir gem will bundle what we think are the best set of tools to make testers successful. We're moving from driving a browser to developing standard ways to create and maintain a test framework. We're going to teach the Selenium team what we've learned about testing and share with them the community we've been able to develop and nurture.

It's exciting.

Update (2012/05/21)

Chrome just passed IE to become the most-used browser according to StatCounter: http://bit.ly/JKpAE5.

^{1. [w3schools.com]↩}
^{2. [blog.nielsen.com]↩}
^{3. [Andreas Tolfsen on Mobile Testing at SeConf 2012]↩}

^{4. [Bret Pettichord SeConf 2011 Keynote]↩}

Monday, June 4, 2012

Off the Grid and Epsilon-greedy Testing

Monday, May 21, 2012

Watir and the Pheonix

The Rise of Mobile

Changing of the Guard

When you get down to it, we're all working to get a paycheck and keep the lights on at home. For our personal marketability, it's important to know what skills are valuable to employers.

My good friend and colleague, Bret Pettichord, thinks that once employers start realizing this trend, jobs requiring these proprietary tools as a core skill will drop like a rock. I'm inclined to agree with him.

It's exciting.

About Me