Saturday, September 10, 2016

Wrap-up from this cycle of Outreachy

Now that all interns have completed their work, I wanted to share a few final thoughts about this cycle of Outreachy. Hopefully, this post will also help us in future usability testing.

This was my third time mentoring for Outreachy, but my first time with more than one intern at a time. As in previous cycles, I worked with GNOME to do usability testing. Allan Day and Jakub Steiner from the GNOME Design Team also pitched in with comments and advice to the interns when they were working on their tests and analysis.

As before, I structured the usability testing internship along similar lines to an online class. In fact, I borrowed from the usability testing internship schedule when I taught CSCI 4609 Processes, Programming, and Languages: Usability of Open Source Software, an online elective computer science program at the University of Minnesota Morris. We first learned about usability testing, then we practiced usability testing. In this way, the internship was broken up into two phases: a research phase and a project phase.


With three interns, we decided from the beginning that we should focus on three different areas for our usability tests. In previous cycles, with only one intern at a time, we conducted traditional usability tests that spanned several common design patterns used in GNOME. In this cycle, we wanted to continue with a traditional usability test, but we wanted to explore other areas of GNOME usability.

Based on the previous experience and interests of Renata, Ciarrai and Diana, we assigned them to do these three usability tests:

  1. Renata: A traditional usability test of ongoing development in GNOME.
  2. Ciarrai: A paper prototype test of the new GNOME Settings.
  3. Diana: A first-time GNOME User eXperience (UX) test.

A paper prototype test uses a slightly different test design from a traditional usability test, but the analysis is the same. In a paper prototype test, as in a traditional usability test, you ask a tester to respond to various scenario tasks. But in a paper prototype test, the tester is acting against a paper mock-up of the design. In this case, Ciarrai used screenshot mock-ups of the updated GNOME Settings.

Renata led a traditional usability test. In her test, we asked Renata to further examine areas that had improved based on the last cycle of GNOME usability testing. This allowed us to measure any usability gains in the design.

Diana performed a different kind of test. Technically, a user experience test is not a usability test; usability and user experience are different concepts. Usability is how easily users can use the software to accomplish real tasks in a reasonable amount of time. But user experience is more about the emotional engagement of the user when using the software. It's possible for a program to have good usability and poor UX, or for a program to have poor usability but good UX, but in most cases you will find a positive correlation between usability and user experience. Most of the time, if a program has good usability, it will likely have positive UX. And if a program has poor usability, it will likely have negative UX.


When planning any usability test, it's important to start with an understanding of who uses the software. Typically, a project should have these documented as personas, fake personalities that reflect real-world users. With personas, a project can discuss design changes in a more concrete way. Rather than a designer or developer planning a change "because it's cool" or "because I like it," they can plan changes that will help the "Sally" persona do her work, or will make the "Stephen" persona more productive, or will allow the "Chris" persona to more easily work with the software.

Once you understand the users, you need to define how the users will use the software. These are the use scenarios. Users might have different needs that carry different expectations. For a word processor program, one user might use the program to write a report at work, while another user might use the program to write a research paper (with footnotes!) for class, while another user might use the program to jot a quick letter to a friend.

From this understanding of use scenarios, you can then create a usability test. In most usability tests, the test is usually comprised of scenario tasks, which you give one at a time to each tester. The scenario task should set up a brief context for the task, then ask the tester to do something specific. The task needs to be general enough that the tester has latitude to complete the task the way they would normally do it, yet specific enough that both the test moderator and test volunteer would know when the task is completed.

Take notes while your testers use the software. This is usually made easier if you ask each tester to speak aloud what they are doing or what they are looking for (if they are looking for a Print button, they should say "I need to print, and I'm looking for a Print button"). Also note any comments they make while working on each task ("This seems difficult" or "I can't seem to find the Print button" or "That was easy").

And that's the process Diana, Renata and Ciarrai followed when they did their tests. Diana's UX test was a little different, because UX is not the same as usability so required a slightly different test design, but still she asked testers to explore GNOME via three broad scenario tasks.

After the test, you collate your notes and analyze your data. For a traditional usability test, I find it is easiest to start with a heat map. I discuss heat maps elsewhere on my blog. From the heat map, it's easy to look for "cool" and "hot" rows, which reflect scenario tasks that were easy for testers or more difficult. This is often my first step to examine themes.

Both Renata and Ciarrai used heat maps in their analysis. Ciarrai's heat map analysis method was a variant on the typical method, due to their test being a paper prototype test. The testers didn't interact with a live version of the software, but a paper mock-up using screenshots of the new design. So Ciarrai's analysis was made more difficult, but they did a great job.

For Diana's test, I asked her to try a different analysis method. After each tester explored GNOME, she asked them to describe their initial reaction (their first few minutes) to GNOME by pointing to an emoji. Diana had prepared a list of ten emojis that ranged from "ewww" to "happy." She also asked testers to describe their experience at the end of the test using the same emoji scale. Diana then followed up with a brief interview designed to explore how testers perceived GNOME.

Notes on the UX test for next time:

I have mentored several traditional usability test before, as part of Outreachy and as an instructor for CSCI 4609, so we had a lot of prior experience going into Renata's traditional usability test and Ciarrai's paper prototype test. That's probably why both went very well. But this was our first time doing a user experience test, and while I think we got some useful results from it, I think we can improve the user experience test the next time we do a UX test.

First, I think we need more testers in the user experience test. In most usability tests, you don't need very many testers to uncover most problems and get actionable results. This assumes that each tester will uncover many of the same problems found by the next tester. Some usability researchers claim you only need to test with five testers. I agree with that, but you need to remember the assumption: each tester uncovers many of the same problems the next tester will find. Nielsen finds a single tester will uncover about 31% of the problems in a single usability test.

But in UX testing, you aren't uncovering usability problems. Instead, you are examining how real people respond to the software. You are looking for something akin to emotional attachment, not ease of use. So five testers is not enough. Diana initially included five testers for her UX test, and later expanded to seven. Even that isn't enough. I think the next time we do a GNOME user experience test, we will want around twenty testers.

We should also closely examine the number of emoji that testers respond to. I worked with Diana to select ten emoji that span a range of emotions. I based this test design on a common contemporary UX test design, including that used in an article in the Washington Post where the newspaper asked readers to respond to political events (specifically, the Republican National Convention) by selecting from a list of emoji. Over a thousand readers responded in that case. With such great numbers, the results quickly narrowed down to show the top one or two emoji responses for each political moment. For example, when responding to "I declare Trump and Pence to be nominees for President and Vice President," the responses tailed off quickly: 553, 334, 90, 79, 70, … That makes it easy to identify the one or two dominant reactions.

The next time we do a user experience test, increasing the testers will help us to consistently identify emotional response via emoji. We might also reduce the number of emoji, so fewer testers have fewer options to choose from. At the same time, we need to be careful not to restrict the responses too much or we risk artificial results.

To offset this, and in addition to emoji, we might also ask the testers to pick three or five adjectives from a list that describe their reaction to the test. We can use synonyms of various emotions, so testers have a wide range of words with which to describe their reactions, yet we can combine results more easily. For example, synonyms of "confused" include baffled, bewildered, befuddled and dazed. If a tester felt confused at the beginning of the UX test, they might pick three adjectives "baffled, bewildered, confused" that would give a score of +3 to "confused." Similarly, synonyms of "happy" include cheerful, contented, delighted and ecstatic.

A few thoughts for future mentors:

Mentoring in Outreachy is a very rewarding experience, but it's also a lot of work.

When mentoring three interns in this cycle in Outreachy, I assumed the effort would be about the same as my online class with ten students. The research phase was certainly straightforward; we were researching the same topics. But once we entered the project phase, and Diana, Renata and Ciarrai each started work on their own projects, I was constantly switching gears as I reviewed each intern's work and provided direction and guidance. I think I did a good job here, but it came with a ton of effort on my end.

As weeks progressed in the project phase, I began to rely more heavily on Jakub and Allan to help me out. I seemed to constantly send them reminders to review that week's work and respond to the interns. I hope I haven't made a nuisance of myself or pushed too hard.

Looking back, mentoring three interns on three different usability tests was more work than my online class with ten students because the three interns were working on completely different projects, while students in my CSCI 4609 class worked on similar projects. In my online class, students had different starting points in their Personas and Use Scenarios, but they worked together on the same program for a "dry run" usability test. For the final project, students could pick any open source program that appealed to them, but in the "dry run" I picked the program for them to examine (based on common interests we explored in the first few weeks of class). This meant most of my energies as professor were pointed in the same direction. But in mentoring three projects for Outreachy, I was always getting pulled in a different direction.

I'm not sure how my experience compares to a project that focuses on coding. At a guess, I think it's not very comparable. Your mileage may vary.

My guess is the effort to mentor several interns in a coding project is probably the same as being the project maintainer for any active open source software project. I've been involved in open source software since 1993, and an active developer since 1994, so I know the effort isn't too great if you are reviewing code patches for a codebase that you are very familiar with. As maintainer of the program, you know where the puzzle pieces go, and you should have a good idea what the side effects may be for a new patch.

If you volunteer as a mentor to a future cycle in Outreachy, make sure to plan your time very carefully. Be aware of time commitments and how your work in Outreachy will intersect with other things you want to do. I over-committed myself, for example. I can't do that again.

A few thoughts for future interns:

One of the steps in your application to Outreachy is the first contribution. For the usability testing internship, I asked applicants to pick a few scenario tasks from previous usability tests, and do a usability test with a few test volunteers, then write a short article about your first contribution. I intentionally asked applicants to test with only a few users, usually less than five, because I wanted them to see what usability testing was like before they committed to the internship. Three isn't enough testers to get actionable results, so I think I will increase this to five testers next time.

I find that this first contribution really does provide an indicator to later performance. Applicants who did well in their initial contribution did well in their internship. We were able to predict performance (to some degree) based on their initial contribution. Diana, Ciarrai and Renata were accepted into the program because their initial contributions represented good work. Other applicants to the program who weren't selected may have done a good job but were not selected due to space, while others didn't seem to take the first contribution very seriously (they thought it was "busywork") or they waited until the last moment to do their initial contribution, leaving no time to evaluate their work before we had to make a decision.

If you plan to apply for a future cycle in Outreachy, don't forget the initial contribution. Take it seriously. Reach out to the project's mentors and discuss the initial contribution and how to approach it. I know I took into consideration each applicant's engagement and relative success in the initial contribution, and that mattered when selecting interns for the program.
image: Outreachy

No comments:

Post a Comment