On the Need for Socio-Technical Testing
By: Emilee Rader
I’ve been working for the past couple of years to understand the consequences of feedback loops between people and algorithms in socio-technical systems. In systems like this, people, algorithms, and content are three components that interact to produce what we see every time we log in. For example, as users create new Facebook posts, there is new content available for the News Feed algorithm to select and show to other users. When they see posts in their News Feeds and receive feedback on their own posts through the system, they learn about the kinds of things one should/shouldn’t post to Facebook. The recent NY Times article about how unhappy marriages are a taboo topic for Facebook posts is a good example of this.
Through experience with Facebook over time, users figure out how to create posts that are more likely to be rewarded with more “engagement” from other users in the form of likes, shares and comments (among other things). “Engagement” is defined not just by the technical parts of the system (e.g., how many likes does it take for a post to be considered worthy of display by the algorithm), but by users’ social interaction as well (whether 10 likes is a lot or a little depends on how many likes other posts are getting). This means that the algorithm, which chooses what posts to show at the top of one’s News Feed, and users’ perceptions of what kinds of posts receive more attention than others, work together to produce the set of posts we see in our News Feeds each time we visit. And we are more likely to like, comment or share the posts we see than the posts we don’t see.
This feedback loop is highly relevant for understanding how something like including painful events in a user’s Facebook “year in review” could happen, and why it is actually harder than it seems in hindsight to anticipate and prevent. I’ve seen statements online in the past few days that amount to, “How could Facebook’s designers be so stupid? Why didn’t they just exclude posts that include negative words that might indicate something painful happened, like ‘death’? Or allow users to opt-in instead of showing ‘year in review’ by default?”
There are a couple of obvious reasons why, as a purely technical strategy, word-based filtering might not work. Human use of language is fluid and nuanced and full of non-literal meaning. What if I had created a post with a photo I took at a “death metal” concert, that subsequently went viral? Should that be excluded from my “year in review”? Defining broadly applicable vocabulary-based rules for an algorithm to follow is difficult, especially for a brand new feature like the “year in review” for which known failures don’t already exist. Also, when users’ posts are short, they don’t contain much other text that an algorithm could use to figure out whether the word “death” means someone died, or is an adjective describing a style of heavy metal music. In a system where posts are only relevant if they’re timely, designing the system to be able to retrieve and process old posts quickly and efficiently is not important. My guess is, an assumption about post timeliness could be encoded in various ways in Facebook’s infrastructure, which could mean that past posts may just be harder computationally to use to infer whether a sad thing happened during someone’s year.
However, on top of any potential technical issues, using algorithmic curation to create a painful “year in review” is a great example of what I call a socio-technical bug — an unintended consequence of the interaction between people and algorithms. In systems like Facebook, this interaction can produce edge cases that are hard to identify before they happen.
The News Feed Algorithm uses Facebook’s “engagement” metric to great effect. I have no doubt that people are more satisfied with their News Feeds when they include more posts that others have liked, commented on, and shared. It is very easy to imagine that this metric could also work really well for “year in review”, and for many users, I’m sure it worked just fine — I’ve seen some great examples from my own Facebook Friends. At the same time, it would probably be hard to argue for the development and testing of a new metric optimized specifically for “year in review”, when engagement works so well for day-to-day post ranking and display. But, the fact that the “year in review” failed dramatically for some users illustrates that engagement is a noisy metric that has been optimized for transient display, and performs inconsistently when it is used for something different, like highlighting events from the past.
Perhaps taking a socio-technical metric like “engagement” that’s optimized for one purpose and using it for something different is the equivalent of a regular software bug like an off by one error — a totally obvious problem, but (sometimes embarrassingly) easy to overlook during development. We just don’t know yet how to systematically and efficiently identify failure modes for a socio-technical system, feature or metric. The meaning of “engagement” is the product of a feedback loop between algorithms and people, which makes the edge cases hard to anticipate. It’s definitely harder than it seems in hindsight.