Self-fulfilling prophecies and Big Data

A look at inequality as it’s driven by algorithmic thinking. Some advertising data investigation. A discussion of the book “Weapons of Math Destruction” by Cathy O’Neil.

James Carzon https://jamescarzon.github.io/ (Carnegie Mellon University)http://www.stat.cmu.edu/
06-21-2021

A distinction can be made between the rigorous nature of mathematics and the fuzziness of its applications. Mathematical claims are proved logically and totally. Everything else, one might believe, is only proved to one’s satisfaction or not, but never irrevocably. In the process of translating between Science and Math, something perfect is lost.

This distinction is a philosophical preface to a long debate on addressing “racist algorithms,” “sexist science,” and so forth. There is a sense in which Big Data are understood to excite inequalities such as wage gaps between racial and gender groups or gaps in job application successes and graduation rates for socioeconomic groups. The gap between science and math constitutes the most immediate criticism of such claims to systemic or social inequality (whatever that means generally) caused (in entirety or in part) by “Big Data.” You know–numbers and stuff. Math. Perfect truth, and the facts that we find when we study the numbers and stuff that are generated and documented by modern technology at an incredible volume.

It certainly seems to me that these data, like any other data, have no idea what they are describing and that they themselves have no intent to cause anything. They simply arise in the world naturally. It is not the case that “the number of people in academia who consistently vote for Republicans” might be one of perhaps several existing entities conspiring to steepen political bias in higher education. It’s just (possibly) a fact which we (might) observe. So to claim that Big Data are a reason for some sort of inequality of this sort is, obviously, a misunderstanding of what they are.

However, there are fairly clear and uncontroversial examples of negative feedback loops which are uniquely enabled by our uses of Big Data which can serve as a clue to the nature of more impactful claims. That is, there are ways in which data and their trained algorithms help to widen measurable inequalities. I have written this post in an attempt to clarify this point of confusion from my personal understanding of the phenomenon of data-driven feedback loops. The idea came to me when I read Cathy O’Neil’s book Weapons of Math Destruction, whose subject matter is the exposure of examples of precisely these sorts of cases. I consider this subject to be one which is too easily dismissed on the basis that it participates in this Math v. Science fallacy, which I hold it does not.

Stereotypes arising from data

I am a cautious data scientist, say. I would like to be able to chase a claim from some data set to its source and totally and completely understand the story of that data set’s generation. That would be the ideal way to “do statistics about the world.” This is difficult to do.

For example, maybe I want to know how my online advertisements are determined. I observe the advertisements which are fed to me and find myself surprised on occasion by which ones I see. Is my phone really spying on me when I’m eating the last of my baby carrots only to inform Google to bump a Domino’s Pizza spot on my Jordan Peterson lecture on YouTube? As a casual experiment, I jumped onto The Grad Cafe, a forum for graduate students from across all of academia, and surveyed the ads I received after a few minutes of browsing: AllState, Hewlett-Packard, Go Daddy, Xfinity and Spectrum, T-Mobile, UWMCareers.com, and the University of Connecticut Pharmacy School. Why do you think I am getting these ads in particular?

On the one hand, my personal browsing history would suggest that I have spent time recently thinking about insurance and the internet and building websites, et cetera. These are topics that I’ve undoubtedly binge-Binged before. It is plausible that I am shown ads which, among those paid for by companies across the internet, were determined to be most interesting to me by some algorithm.

On the other hand, I am a user on The Grad Cafe after all and, as such, fit into the demographic of people interested in graduate school. Like many of my peers on the site, I’m in my early ’20s and I have a college degree. I have the time and resources to browse the internet freely and at length. People like me tend to be interested in things like academic careers, pharmacy schools, and cloud computing. It is plausible that I am shown ads which, among those paid for by companies across the internet, were determined to be most interesting to us users of The Grad Cafe.

Since both of these explanations are plausible ones for some or all of the ads I see, I can’t simply x-ray through the results and understand how these choices are being made. I could be judged according to my personal history or I could be judged according to the history of my cohort. In a complex way, I may be judged according to a combination of these.

You should know that you are welcome to investigate the character of Google’s advertising profile of you. An ad that is provided by Ads by Google should show a triangle button in the top right corner which, when clicked, sends you to an ads settings page. From there you can find a collection of interests for which you’ve been historically pegged.

Today I found out that I am a grains & pasta guy. I did not know this about myself, and had I been adequately tuned in to my body’s cravings I would have hesitated before committing to a keto cleanse that threw me out of lawn maintenance commission for a week and a half of yoga in bed with the lights off. Now I’ll think twice before cutting bread out of my diet, then three times, then four times before caving to a bowl of Frosted Mini Wheats that I can’t finish but do anyways.

There are several reasons why I have never lost sleep over the threat of an unplanned pregnancy upsetting my assumed way of life, but none of them are that I fail to sense the coming urge of regular bodily functions which, if nightly active, would surely cause me to lose some sleep. Have I forgotten some vast library of videos of clumsy toddlers that my grandma asked me to back up for her in case the family inheritance is pinched by a sleeper cell of the Irish mafia? Maybe that’s an anonymous Reddit account that I’ve forgotten about.

If you know me, then you know that I need my daily fix of humidifier / dehumidifier mod and app developer news sent to my inbox to start the morning off on a good note. I’m not ashamed to admit that all of my Twitch subscriptions are to water filter and purifier unboxing channels.

The stereotype that Google has of me is not totally correct, but I saw a bunch of interests which were accurate. This discovery was insightful for me because it inforces a general rule about stereotypes: the farther away from the individual that the data comes, the less accurate its conclusions can dependably be for the individual.

The negative feedback loop

Over time, Google has inferred these interests of mine from my activity indirectly. As long as the algorithm satisfies me, I reinforce the stereotype. In this example of online advertising, my stereotype is largely benign, but hopefully it is clear that the concept at work can be translated to different contexts. Stereotypes of all kinds are like algorithms. They efficiently approximate the data that we collect, but with that efficiency comes a compromise in accuracy. When we make efficient approximations of someone’s behavior, for example, based on the groups to which they belong, we lose information about their individual peculiarities. This claim, I think, is clear enough.

What is the value of a stereotype? If this is a purely sociological question, then it’s difficult to answer it as a statistician. The threats of “Big Data” are well noted, though. Cathy O’Neil’s book Weapons of Math Destruction furnishes a collection of more serious examples of algorithms which are the products of an attitude that squeezing meaning out of Big Data is the way of the future for sociological problems.

For example, one might look at university rankings to decide to which colleges they should apply, but there is nothing “true” about the University of Virginia being a better place to study than the University of Florida. Maybe if you asked a billion people which one is better, then you can bet that they will favor Virginia in a mass, but that’s not corresponding to any facts of the universe! The US News staff doesn’t count the goodness particles floating in the Gainesville sky and weigh its ranking on that metric. Yet it’s easier for us to steamroll ahead with decisions informed by numbers than to perform a cost-benefit analysis with even less complete data like impressions that we glean from a website and maybe a conversation or two; it’s also easier for university administrators to invest in those features of their university which are relevant to their ranking than to make personal judgment calls on what most needs fixing. Of course, I am now indulging in an oversimplification of both of these processes.

The assumption underlying these applications of Big Data is that people behave in accordance with categories. Political groups cohere in their decisions, this assumption further suggests, as do ethnic and religious and socioeconomic ones, and so they will be treated according to their group. The problem with this assumption is two-fold. Firstly, it can sharpen ideological boundaries by refining echo chambers and quelling cross-pollination. For example, if politically radical individuals are only recommended politically radical YouTube videos, and if those same videos are rarely recommended to politically opposed individuals, then the consumers of that content may falsely believe their opinions to be widely accepted and not reasonably opposed.

Secondly, it allows for centralized authority. For example, if schoolteachers’ efficacy can be judged by exam scores, then one person with a computer can decide who in a district should be fired by crunching the exam data set and ignoring unmeasurable facts of the exam and the settings in which it was administered. To let all decisions like these be made algorithmically is to decide from the outset that the truth of a judgment is less important than the specificity of it. When authority is centralized, these facts may be lost. So, enabling (and ennobling) algorithms with negative feedback loops of this kind for the purpose of increasing efficiency invites division and ignores local and individual economic peculiarities.

These two issues seem to invite plenty of extrapolation into political realms which I will not try to do here because I have not even argued for an ethical conclusion to the phenomenon of data-driven feedback loops. I did use the adjective “negative,” which is suggestive, but it’s not an argument. The problem of algorithmic thinking is not “my problem with algorithms.” I am only clarifying an idea which sees itself in these relevant debates.