Monday, April 7, 2014

Comparing my predictions

Before starting the usability test, I made some predictions for how things would go, what themes I would uncover, and how users would respond to the tasks. Here is a quick comparison of my predictions to what actually happened:

The 4 (now 5) themes will continue
I definitely saw a continuation of the 5 themes: Familiarity, Consistency, Menus, Obviousness, and Unexpected Behavior. The "hot corner" remained an issue for several testers, although this wasn't as prevalent as my previous usability test. Fortunately, when participants experienced the hot corner problem, most were able to quickly get back to the program before I stepped in to do it for them. But in general, this unexpected behavior was confusing and startling. 
Consistency was an important theme. Testers frequently said that because all programs looked more or less the same, they were able to apply what they learned in one program to the other programs. But a challenge was Familiarity; GNOME just isn't very much like Mac or Windows, so everyone had to re-learn how to use GNOME. 
Obviousness was also important, and in fact some testers thought that they hadn't completed an action because GNOME didn't give a very obvious indication that their action had an effect. A specific example is find and replace in Gedit. Menus were also an issue, but I'll talk more about that next.
Users will be confused about menus
While testers were confused about having menus split across two different "areas" (the Gear menu and the Application menu) they weren't as confused as I thought they would be. Several participants shared comments that the Gear menu (which they called "Options" or some similar name) seemed to have "program actions" and the Application menu had more "top level" actions. But overall, I'd say that users were confused about split menus.
Users will think the test didn't go well, even if they were able to complete all the tasks
I thought this would be the case, and I made a big deal out of it in my predictions, citing rhetoric principles. But it turns out that users thought the overall test went well if the last program went well. (This is an example of primacy-recency, a different principle of rhetoric.)
Specific problems with applications and tasks
I said this would be exhibited by specific problems with applications and tasks, such as: changing the default font in Gedit and Notes, creating a shortcut to a folder in Nautilus, searching for a file that's not under the current folder tree, bookmarking a website in Web, increasing the font size in Web, and deleting all notes in Notes. And for the most part, I was right. Look at the heat map to see how testers fared in these tasks. 
"Hot" areas include: changing the default font in Gedit and Notes (as predicted) and creating a shortcut to a folder in Nautilus (as predicted). Other less critical (but still important) usability issues include: find and replace in Gedit, bookmarking a website in Web (as predicted), increasing the font size in Web, and installing programs in Software. 
While I thought users would have a problem with Selection Mode in Notes, users didn't experience problems here. But importantly, neither did they use Selection Mode. When asked to delete all notes, testers invariably right-clicked a note, then clicked the Delete button - then repeated the process for the second note. Testers didn't seem to see a need to use Selection Mode to delete multiple notes.
Problems caused by possible bugs
Fortunately, I didn't experience any unexpected bugs. One problem I knew about beforehand is that if you install a program (Robots) using Software, then immediately try to remove the same program, you get an error. You seem to have to exit the Software program after installing a program, then re-launch Software before you can remove it. That seems like a program bug. As a workaround, I asked testers to exit the Software program before they tried to delete the Robots game, so we avoided that bug in Software.

Outline of my paper

This has been a very busy week, doing the usability tests and the analysis to identify trends. I'm using a heat map like last time, which I think shows the trends more clearly. This heat map also shows the same tests from the previous usability test, so I can start a comparative analysis.

I've also shared this with the folks from GNOME. We're going to try for a Hangout in the next week or so to go over the results and talk in more detail, discuss themes, and generally about how things went.

From here, my next steps are too finish up the next draft of my paper. I think the next draft will be much closer to the final form (which is good, based on where we are in the semester). The previous draft reused some text from my blog, so had a very informal tone. The next draft will step that up.

Since this is for my M.S. capstone, this iteration will be a much longer and more detailed document than the paper I wrote for a previous class on usability (later turned into an article for Linux Journal). My outline is currently shaping up this way:

  1. Introduction
  2. What is usability
  3. Usability needs to be an integral part of the development process
    1. context
    2. development methods in closed software v open source software
  4. Why is usability overlooked in open source software
  5. My usability test
  6. Usability test results
  7. Themes and conclusions

Venues?

After I have something more concrete to share, I'd love to get my results out there somehow. Academic journals are a good place to share. And I'm already talking to Linux Journal about a followup article, from the December 2013 issue. I think I'll also submit it to Slashdot and Reddit. I wonder if FSF would be interested in something specific to GNOME, for their newsletter? Maybe if I write a version that is more general. I'd love to write a version of my article for OpenHatch, and other organizations that are interested in open source software.

I'm also starting to think about conferences where I could talk more about this. I missed LibrePlanet, but maybe next year. Other conferences to consider are Penguicon and GUADEC.

What other venues would you suggest?

Friday, April 4, 2014

The usability of GNOME 3.10 & 3.12

This is only a preview of my results, just to share some immediate analysis. I'll provide more details later.

I scheduled 15 people for the usability test of GNOME 3.10 & 3.12. Three people were unable to make it, so I had 12 test participants. And from Nielsen, we know you only need about five testers to get useful results. After about five testers, you have enough information about themes and trends. And I definitely saw that here; after the first five or six testers, the trends were clear. With 12 participants, that's over twice the minimum required to expect good results.

So, what were the trends in this test? As before, I prefer to use a "heat map" to display the results. During the usability test, I typed up notes on a laptop while I observed the test participants doing the scenario tasks. I captured quotes as the test participants used the "speak aloud" method to describe their process and thoughts, but more generally I noted what paths the testers were attempting to complete the tasks. What menus did they click into? What buttons caught their attention? At the end of each task, I indicated if the tester had completed the task, and how difficult it was for them to do.

In the below heat map, a green box indicates a tester completed a task with little or no difficulty, while a yellow box represents a task that had moderate difficulty. A red box indicates a task that was extremely difficult to do, or was completed incorrectly or by accident. Where the tester was unable to complete the task, or chose to abort the task (usually believing it to be impossible to complete) I use a dark red (maroon) box. A white box indicates a test that was skipped.

This is just a quick screenshot from a spreadsheet:

(click to enlarge)

You may remember that I performed a similar usability test of an earlier release of GNOME, about a year and a half ago. That test involved Gedit, Firefox, and Nautilus - and included seven test participants. I've represented those results to the right, matched against this test of Gedit, Web, and Nautilus. This shows where programs fared better, worse, or about the same between the earlier version (on the right) and versions 3.10 & 3.12 (on the left).

I'll provide an analysis in a future post. Comparing these results to my notes, there's lots to be said about how testers actually use GNOME, and how they expect to use GNOME. But I'll have to save that for a future post.

Sunday, March 30, 2014

Predictions for the usability test

An important aspect of experimentation is to test a theory. And while the purpose of a usability test is to examine how real users interact with a system, this is my second usability test and I would like to make my predictions for how users will respond to GNOME 3.12. To be clear, this is not meant as a negative criticism of GNOME, but simply my own heuristic review of how users will respond to the GNOME design patterns.

Let's lay some groundwork. From my previous usability test, I found four themes of successful usability in GNOME:

1 .Familiarity
Testers commented that the programs seemed to operate more or less like their counterparts in Windows or Mac. For example, Gedit isn't very different from Windows Notepad, or even Microsoft Word. Firefox looks like other web browsers. Nautilus is quite similar to Windows Explorer or Mac Finder. To some extent, these testers had been “trained” under Windows or Ma, so having functionality (and paths to that functionality) that was approximately equivalent to the Windows or Mac experience was an important part of their success.
2. Consistency
User interface consistency between the programs worked strongly in favor of the testers, and was a recurring theme for good usability. Right-click worked in all programs to bring up a context-sensitive menu. Programs looked and acted the same, so testers didn't have to “re-learn” how to use the next program. While the toolbars differed, all programs shared a familiar menu system that featured File, Edit, View, and Help.
3. Menus
Testers preferred to access the programs’ functionality from the menus rather than via “hotkeys” or icons on the toolbar. For example, the only toolbar icon that testers used in the Gedit scenarios was the Save button. For other scenarios, testers used the drop-down menus such as File, Edit, View, and Tools.
4. Obviousness
When an action produced an obvious result, or clearly indicated success—such as saving a file in the editor, creating a folder in the file manager, opening a new tab in the web browser—testers were able to quickly move through the scenarios. Where an action did not produce obvious feedback, the testers tended to become confused. The contrast was evident when trying to create a bookmark or shortcut in the Nautilus file manager. In this case, Nautilus did not provide feedback, failing to indicate whether or not the bookmark had been created, so testers were unsure if they had successfully completed the activity.
And while it wasn't part of the test, users experienced problems with the "Activities" hot corner. I can generalize this into a fifth theme:

5. Unexpected behavior
Those who experienced the "hot corner problem" typically did so right away in the first exercise when attempting to use the program menus. While testers were able to recover from the hot corner, it definitely caused disruption several times throughout the test. None of the users in the usability test had used GNOME before, so they did not have previous experience in using GNOME. The programs they were evaluating in the usability test were running "maximized" so when they needed to use the program menus (say, the "File" menu) they would naturally bring the mouse towards the upper left corner of the screen, to the "File" menu - and then "overshoot" the menu, activating the hot corner. Suddenly, the screen changed completely - all the program windows from the test were represented as smaller "tiles" on the screen. Testers exhibited clear confusion, irritation, and surprise at this unexpected behavior, even when they experienced the "hot corner problem" multiple times throughout the test.
My question in this usability study is How well do users navigate or understand the new design patterns in GNOME? The differences in GNOME between version 3.4 (my previous usability test, using Fedora 17) and GNOME 3.10 (Fedora 20) & GNOME 3.12 (latest release) appear to be largely "under the hood." With the exception of Gedit, the user interface differences appear minimal. Perhaps the largest change is the loss of "title bars."

My predictions:

The 4 (now 5) themes will continue
Despite UI differences from GNOME to Mac or Windows, I believe testers will comment that the programs are somewhat familiar. Familiarity will likely be the strongest theme: Gedit is not too different from Notepad, Web is similar to Firefox or Internet Explorer, Nautilus is not unlike Windows Explorer or Mac Finder, etc. Notes may be a new experience, however. But the fact that all programs act similarly, and share the same design patterns, will make the programs easier to learn. Once you figure out one program, you can apply that new knowledge to the other programs. The other themes will similarly continue. Obviousness will get positive comments due to improved messages and feedback in GNOME 3.10 & 3.12. But I expect usability issues with the loss of obvious menus, and more "hot corner" problems.
Users will be confused about menus
In my previous usability test, menus were an important part of usability. Menus were also a theme of good usability, because GNOME 3.4 still used menus for most actions. But in GNOME 3.10 & 3.12, actions have been moved into a "gear" menu. Some "top level" or "global" actions may only be accessed from the application menu. The loss of these obvious menus will likely cause problems during the test. I expect users will discover some functionality by clicking on the "gear" menu. Having found that menu, I don't know that users will experiment with the application menu. So any functionality that must be accessed through the application menu may be rendered unusable.
Users will think the test didn't go well, even if they were able to complete all the tasks
This is a visual rhetoric statement, and probably the most important. In rhetorical discourse, if you are performing a rhetorical ritual and the audience feels uncomfortable, it's because you left out an important part of the ritual. 
An easy example is a wedding ceremony. There's nothing that requires the use of "I take thee" vows. A wedding is a form of a legal contract, and you need to have some language in there that defines the wedding, but "I take thee" statements are not part of that contractual ceremony. But they are part of the traditional ceremony, the ritual. If you leave out the "I take thee" vows, it's pretty much guaranteed that family and friends in the audience will feel uncomfortable, as though something was missing. They may not be able to figure out what was missing, but they will know something got left out. And that's when some family and friends may begin to wonder "did they really get married?" 
I think it's the same with this usability test. The loss of certain visual cues, the missing direction provided by visually rhetorical devices (such as "title bars" and other common desktop metaphors) will make the testers uncomfortable. At the end of the test, when I ask how things went, I predict a common theme where users initially say they aren't sure, then shift their answer to "it didn't go well." And I think that will happen even if the testers were able to complete all of the scenario tasks. 
This isn't to say that user interfaces can't ever change, that they need to remain static. But it does suggest that interface designers need to pay close attention to these visual cues. Changes in these areas must be done with care. Sometimes you can get away with total change by shifting the context (such as the new interface on tablets, versus the traditional "menus and windows" interface on desktops and laptops - most people weren't surprised when this totally different computer had a totally different interface). But in general, if it's a visual cue, a rhetorical part of the desktop interaction, you can only change it in stages.
Specific problems with applications and tasks
I predict that "training" in other desktop environments (Windows or Mac) will translate to behavior expectation. In addition, while I attempted to reflect common user actions in the scenario tasks, I feel some of these actions happen so rarely in other desktop environments that the testers will be unfamiliar with how they might do them under a different environment. (How often do you change the default font in a text editor, or a sticky-notes program? Maybe once, when you first start using it, but never after that.) This will be exhibited by specific problems with applications and tasks, such as: changing the default font in Gedit and Notes, creating a shortcut to a folder in Nautilus, searching for a file that's not under the current folder tree, bookmarking a website in Web, increasing the font size in Web, and deleting all notes in Notes.
Problems caused by possible bugs
It's possible that we'll uncover a few bugs along the way, and users may experience usability issues because of programmatic errors. One problem I've found is that if you install a program (Robots) using Software, then try to remove the same program, you get an error. You seem to have to exit the Software program after installing a program, then re-launch Software to remove it. That seems like a program bug. As a workaround, I may ask testers to exit the Software program before they try out the Robots game, so we don't trigger the bug in Software.

Saturday, March 29, 2014

Institutional review

When you do your own usability tests, it's a good idea to gather some information about who is doing the test. Are your testers mostly men, mostly women, or a mix? What are the ages of your testers? (Sometimes, programs that are easy for one age group to use may be difficult for another age group.) How much previous knowledge do your testers have about the programs they will be using during the test?

And when you ask these questions in a higher education research setting, as mine is, you need to go through what's called an Institutional Review Board. As the name implies, the IRB reviews your research project proposal to make sure it doesn't harm the participants, and that any information that you gather doesn't put the participants at risk. My IRB review was pretty straightforward: a usability test doesn't harm the test participants or put them at risk, and I'm only asking a few informational questions that don't identify the person participating in the test.

Here are the questions I'm asking each tester to answer, before we do the test:
1. Your age: (please mark one)
  • 18 - 24
  • 25 - 34
  • 35 - 44
  • 45 - 54
  • 55 - 64
  • 65 - 74
2. Your gender: (fill in the blank)
3. Please indicate (circle) your level of computer expertise:
  1. I know very little about computers
  2. I know some things, but not a lot
  3. I am pretty average
  4. I am better than most
  5. I am a computer expert

My wrap-up script

Just as important as the "welcome" script, I need to have a "wrap-up" script. This makes sure I don't forget anything, and provides a good opportunity for gathering final reflections and thoughts.

Here is my wrap-up script:
Thank you for your time.

Let’s talk about the tasks you did today. Generally, did the tasks seem to get easier, more difficult, or about the same as you went through them?

  • When did things start to seem easier/harder?
  • What got easier/harder, and why did it suddenly seem so?

What are some themes or common areas where you thought things went really well, or were more difficult to do?
Give parting gift. (As a "thank you," I am giving out $5 gift cards to our local coffee shop, Higbies.)