Conducting usability study with eye-tracking to determine how users make decisions on Yelp

The problem
It’s unclear what features of a business’ Yelp page influences consumer decisions most.
Our results
Photos, average rating, and number of ratings are the most influential features.
Contents
brief
Context
Yelp is a local-search service powered by crowd-sourced reviews.
This was a project for CSC 486 Human-Computer Interaction Theory and Design.
Team
Takahiro Shimokobe
Marianne Miranda
Paula Zitnick
Role
My major contributions were in the design of the main study experiment, administration of the experiment, and analysis of main study data.
Tools
SMI RED250 eye tracking hardware & accompanying software
Deliverable
Report
Time
January – April 2019
preliminary study
Objective
To determine what features to study with eye-tracking, we ran a preliminary study to surface features that were statistically significant.
Method
We created a Google Form survey asking respondents to measure feature importance with semantic differential questions.

Survey measuring importance of features.
Results
We found that the most important features are:
average rating (number of stars)
available full menu
number of reviews
photos
133 respondents
90% of respondents used Yelp at least once a month
process
Method
6 participants were presented with 3 sets of 2 businesses' Yelp pages and asked to choose one to patronize.
We tracked their gaze while they compared the pages.
After making a choice, we asked them which features had informed their choice.
Design considerations
Bias
Businesses chosen for comparison were far away from the testing location to minimize the likelihood that participants had existing opinions about them.
Confounding variables
Each pair of businesses had similar values for each feature, except for the feature being tested.
Experiment design
Test cases
We chose the top 3 most important features discovered in our preliminary study to test.
Each pair of businesses will have similar values for each feature, except for the feature being tested.
Case 1. Number of reviews
Option 1. 120 reviews; 4.5 stars; menu available
Option 2. 47 reviews; 4.5 stars; menu available
Case 2. Average rating (number of stars)
Option 1. 5 stars; 209 reviews; menu available
Option 2. 4 stars; 209 reviews; menu available
Case 3. Menu availability
Option 1. menu available; 4 stars; 137 reviews
Option 2. menu not available; 4 stars; 137 reviews
Hypothesis
We predict that Option 1 will be chosen since it has the more successful value within each pair.
i.e. Businesses with more reviews, higher average rating, and menu availability will be chosen.
Analysis
We compared each participant's choices with (1) the features they reported as influential and (2) their eye-tracking data. We interpreted these results for each of the 3 cases.
Faulty approach
Initially, we had hoped to simply measure which features participants looked at (to gauge each feature's influence).
Initially, we had hoped to simply measure which features participants looked at (to gauge each feature's influence). This was ultimately ineffective because no significance was found (i.e. most participants looked at most features).
Silver linings
This made our collection of participant's self-reported influences useful in identifying correlations between their eye-tracking data and choices.
Alternate aproach exploration
We also explored using the proportion of time each participant looked at a feature, relative to the rest of the page. However, this was also not a good measure due to variation in detail (e.g. it takes longer to read a review than the amount of stars).
Results
Case 1. Number of reviews
All participants looked at both review counts.
5 of 6 participants chose the predicted option.
The other participant chose the non-predicted option because they had more authentic looking food in the photos.
Case 2. Average rating (number of stars)
All participants looked at both average ratings.
4 of 6 participants chose the predicted option and reported average rating as an influential factor.
The other 2 participants chose the non-predicted option because of opening hours and quality of business interior.
Case 3. Menu availability
3 of 5 participants looked at both menu availabilities and did not choose the predicted option.
1 of 5 participants chose the predicted option, but they did not look at the menu availabilities. They cited wanting the food.
0 participants reported menu availability as an influential factor.
findings
Discoveries
We discovered that users reported looking at pictures (27%), ratings (19%), reviews (19%), and number of reviews (15%).
Users spent the most time looking at reviews (27%) and photos (22%).
There was only one case where the participant looked the feature being tested, chose the predicted option, then did not report the feature as influential. They cited quality of the business interior instead.
Iterating on the preliminary study
While photos were ranked fourth most important in the preliminary study, we found that users spend the most time looking at them.
The average rating and number of reviews features, found to be important in the preliminary study, were confirmed by the main study.
The importance of menu availability was not confirmed, possibly because of ambiguous wording. Survey respondents may have misunderstood the menu feature as photos of the menu.
Conclusions
We found that there was no correlation between duration of gaze and the importance of the feature, due to variation in detail (e.g. it takes longer to read a review than the amount of stars).
Future work might incorporate more qualitative methods of assessment during eye-tracking experiments to better understand users' thought process while making decision in the Yelp interface. ※