Itemoids

Armughan Ahmad

America Already Has an AI Underclass

The Atlantic

www.theatlantic.com › technology › archive › 2023 › 07 › ai-chatbot-human-evaluator-feedback › 674805

On weekdays, between homeschooling her two children, Michelle Curtis logs on to her computer to squeeze in a few hours of work. Her screen flashes with Google Search results, the writings of a Google chatbot, and the outputs of other algorithms, and she has a few minutes to respond to each—judging the usefulness of the blue links she’s been provided, checking the accuracy of an AI’s description of a praying mantis, or deciding which of two chatbot-written birthday poems is better. She never knows what she will have to assess in advance, and for the AI-related tasks, which have formed the bulk of her work since February, she says she has little guidance and not enough time to do a thorough job.

Curtis is an AI rater. She works for the data company Appen, which is subcontracted by Google to evaluate the outputs of the tech giant’s AI products and search algorithm. Countless people do similar work around the world for Google; the ChatGPT-maker, OpenAI; and other tech firms. Their human feedback plays a crucial role in developing chatbots, search engines, social-media feeds, and targeted-advertising systems—the most important parts of the digital economy.

Curtis told me that the job is grueling, underpaid, and poorly defined. Whereas Google has a 176-page guide for search evaluations, the instructions for AI tasks are relatively sparse, she said. For every task she performs that involves rating AI outputs, she is given a few sentences or paragraphs of vague, even convoluted instructions and as little as just a few minutes to fully absorb them before the time allotted to complete the task is up. Unlike a page of Google results, chatbots promise authoritative answers—offering the final, rather than first, step of inquiry, which Curtis said makes her feel a heightened moral responsibility to assess AI responses as accurately as possible. She dreads these timed tasks for the very same reason: “It’s just not humanly possible to do in the amount of time that we’re given.” On Sundays, she works a full eight hours. “Those long days can really wear on you,” she said.

Armughan Ahmad, Appen’s CEO, told me through a spokesperson that the company “complies with minimum wages” and is investing in improved training and benefits for its workers; a Google spokesperson said Appen is solely responsible for raters’ working conditions and job training. For Google to mention these people at all is notable. Despite their importance to the generative-AI boom and tech economy more generally, these workers are almost never referenced in tech companies’ prophecies about the ascendance of intelligent machines. AI moguls describe their products as forces akin to electricity or nuclear fission, like facts of nature waiting to be discovered, and speak of “maximally curious” machines that learn and grow on their own, like children. The human side of sculpting algorithms tends to be relegated to opaque descriptions of “human annotations” and “quality tests,” evacuated of the time and energy powering those annotations.

[Read: Google’s new search tool could eat the internet alive]

The tech industry has a history of veiling the difficult, exploitative, and sometimes dangerous work needed to clean up its platforms and programs. But as AI rapidly infiltrates our daily lives, tensions between tech companies framing their software as self-propelling and the AI raters and other people actually pushing those products along have started to surface. In 2021, Appen raters began organizing with the Alphabet Workers Union-Communications Workers of America to push for greater recognition and compensation; Curtis joined its ranks last year. At the center of the fight is a big question: In the coming era of AI, can the people doing the tech industry’s grunt work ever be seen and treated not as tireless machines but simply as what they are—human?  

The technical name for the use of such ratings to improve AI models is reinforcement learning with human feedback, or RLHF. OpenAI, Google, Anthropic, and other companies all use the technique. After a chatbot has processed massive amounts of text, human feedback helps fine-tune it. ChatGPT is impressive because using it feels like chatting with a human, but that pastiche does not naturally arise through ingesting data from something like the entire internet, an amalgam of recipes and patents and blogs and novels. Although AI programs are set up to be effective at pattern detection, they “don’t have any sense of contextual understanding, no ability to parse whether AI-generated text looks more or less like what a human would have written,” Sarah Myers West, the managing director of the AI Now Institute, an independent research organization, told me. Only an actual person can make that call.

The program might write multiple recipes for chocolate cake, which a rater ranks and edits. Those evaluations and examples will inform the chatbot’s statistical model of language and next-word predictions, which should make the program better at writing recipes in the style of a human, for chocolate cake and beyond. A person might check a chatbot’s response for factual accuracy, rate how well it fits the prompt, or flag toxic outputs; subject experts can be particularly helpful, and they tend to be paid more.

Using human evaluations to improve algorithmic products is a fairly old practice at this point: Google and Facebook have been using them for almost a decade, if not more, to develop search engines, targeted ads, and other products, Sasha Luccioni, an AI researcher at the machine-learning company Hugging Face, told me. The extent to which human ratings have shaped today’s algorithms depends on who you ask, however. Major tech companies that design and profit from search engines, chatbots, and other algorithmic products tend to characterize the raters’ work as only one among many important aspects of building cutting-edge AI products. Courtenay Mencini, a Google spokesperson, told me that “ratings do not directly impact or solely train our algorithms. Rather, they’re one data point … taken in aggregate with extensive internal development and testing.” OpenAI has emphasized that training on huge amounts of text, rather than RLHF, accounts for most of GPT-4’s capabilities.

[From the September 2023 issue: Does Sam Altman know what he’s creating?]

AI experts I spoke with outside these companies took a different stance. Targeted human feedback has been “the single most impactful change that made [current] AI models as good as they are,” allowing the leap from GPT-2’s half-baked emails to GPT-4’s convincing essays, Luccioni said. She and others argue that tech companies intentionally downplay the importance of human feedback. Such obfuscation “sockets away some of the most unseemly elements of these technologies,” such as hateful content and misinformation that humans have to identify, Myers West told me—not to mention the conditions the people work under. Even setting aside those elements, describing the extent of human intervention would risk dispelling the magical and marketable illusion of intelligent machines—a “Wizard of Oz effect,” Luccioni said.

Despite tech companies’ stated positions, digging into their own press statements and research papers about AI reveals that they frequently do acknowledge the value of this human labor, if in broad terms. A Google blog post promoting a new chatbot last year, for instance, said that “to create safer dialogue agents, we need to be able to learn from human feedback.” Google has similarly described human evaluations as necessary to its search engine. The company touts RLHF as “particularly useful” for applying its AI services to industries such as health care and finance. Two lead researchers at OpenAI similarly described human evaluations as vital to training ChatGPT in an interview with MIT Technology Review. The company stated elsewhere that GPT-4 exhibited “large improvements” in accuracy after RLHF training and that human feedback was crucial to fine-tuning it. Meta’s most recent language model, released this week, relies on “over 1 million new human annotations,” according to the company.

To some extent, the significance of humans’ AI ratings is evident in the money pouring into them. One company that hires people to do RLHF and data annotation was valued at more than $7 billion in 2021, and its CEO recently predicted that AI companies will soon spend billions of dollars on RLHF, similar to their investment in computing power. The global market for labeling data used to train these models (such as tagging an image of a cat with the label “cat”), another part of the “ghost work” powering AI, could reach nearly $14 billion by 2030, according to an estimate from April 2022, months before the ChatGPT gold rush began.

All of that money, however, rarely seems to be reaching the actual people doing the ghostly labor. The contours of the work are starting to materialize, and the few public investigations into it are alarming: Workers in Africa are paid as little as $1.50 an hour to check outputs for disturbing content that has reportedly left some of them with PTSD. Some contractors in the U.S. can earn only a couple of dollars above the minimum wage for repetitive, exhausting, and rudderless work. The pattern is similar to that of social-media content moderators, who can be paid a tenth as much as software engineers to scan traumatic content for hours every day. “The poor working conditions directly impact data quality,” Krystal Kauffman, a fellow at the Distributed AI Research Institute and an organizer of raters and data labelers on Amazon Mechanical Turk, a crowdsourcing platform, told me.

Stress, low pay, minimal instructions, inconsistent tasks, and tight deadlines—the sheer volume of data needed to train AI models almost necessitates a rush job—are a recipe for human error, according to Appen raters affiliated with the Alphabet Workers Union-Communications Workers of America and multiple independent experts. Documents obtained by Bloomberg, for instance, show that AI raters at Google have as little as three minutes to complete some tasks, and that they evaluate high-stakes responses, such as how to safely dose medication. Even OpenAI has written, in the technical report accompanying GPT-4, that “undesired behaviors [in AI systems] can arise when instructions to labelers were underspecified” during RLHF.

Tech companies have at times responded to these issues by stating that ratings are not the only way they check accuracy, that humans doing those ratings are paid adequately based on their location and afforded proper training, and that viewing traumatic materials is not a typical experience. Mencini, the Google spokesperson, told me that Google’s wages and benefits standards for contractors do not apply to raters, because they “work part-time from home, can be assigned to multiple companies’ accounts at a time, and do not have access to Google’s systems or campuses.” In response to allegations of raters seeing offensive materials, she said that workers “select to opt into reviewing sensitive content, and can opt out freely at any time.” The companies also tend to shift blame to their vendors—Mencini, for instance, told me that “Google is simply not the employer of any Appen workers.”  

[Read: The coming humanist renaissance]

Appen’s raters told me that their working conditions do not align with various tech companies’ assurances—and that they hold Appen and Google responsible, because both profit from their work. Over the past year, Michelle Curtis and other raters have demanded more time to complete AI evaluations, benefits, better compensation, and the right to organize. The job’s flexibility does have advantages, they told me. Curtis has been able to navigate her children’s medical issues; another Appen rater I spoke with, Ed Stackhouse, said the adjustable hours afford him time to deal with a heart condition. But flexibility does not justify low pay and a lack of benefits, Shannon Wait, an organizer with the AWU-CWA, told me; there’s nothing flexible about precarity.

The group made headway at the start of the year, when Curtis and her fellow raters received their first-ever raise. She now makes $14.50 an hour, up from $12.75—still below the minimum of $15 an hour that Google has promised to its vendors, temporary staff, and contractors. The union continued raising concerns about working conditions; Stackhouse wrote a letter to Congress about these issues in May. Then, just over two weeks later, Curtis, Stackhouse, and several other raters received an email from Appen stating, “Your employment is being terminated due to business conditions.”

The AWU-CWA suspected that Appen and Google were punishing the raters for speaking out.  “The raters that were let go all had one thing in common, which was that they were vocal about working conditions or involved in organizing,” Stackhouse told me. Although Appen did suffer a drop in revenue during the broader tech downturn last year, the company also had, and has, open job postings. Four weeks before the termination, Appen had sent an email offering cash incentives to work more hours and meet “a significant spike in jobs available since the beginning of year,” when the generative-AI boom was in full swing; just six days before the layoffs, Appen sent another email lauding “record-high production levels” and re-upping the bonus-pay offer. On June 14, the union filed a complaint with the National Labor Relations Board alleging that Appen and Google had retaliated against raters “by terminating six employees who were engaged in protected [labor] activity.”

Less than two weeks after the complaint was filed, Appen reversed its decision to fire Curtis, Stackhouse, and the others; their positions were reinstated with back pay. Ahmad, Appen’s CEO, told me in an email that his company bases “employment decisions on business requirements” and is “happy that our business needs changed and we were able to hire back the laid off contributors.” He added, “Our policy is not to discriminate against employees due to any protected labor activities,” and that “we’ve been actively investing in workplace enhancements like smarter training, and improved benefits.”

Mencini, the Google spokesperson, told me that “only Appen, as the employer, determines their employees’ working conditions,” and that “Appen provides job training for their employees.” As with compensation and training, Mencini deflected responsibility for the treatment of organizing workers as well: “We, of course, respect the labor rights of Appen employees to join a union, but it’s a matter between them and their employer, Appen.”

That AI purveyors would obscure the human labor undergirding their products is predictable. Much of the data that train AI models is labeled by people making poverty wages, many of them located in the global South. Amazon deliveries are cheap in part because working conditions in the company’s warehouses subsidize them. Social media is usable and desirable because of armies of content moderators also largely in the global South. “Cloud” computing, a cornerstone of Amazon’s and Microsoft’s businesses, takes place in giant data centers.

AI raters might be understood as an extension of that cloud, treated not as laborers with human needs so much as productive units, carbon transistors on a series of fleshly microchips—objects, not people. Yet even microchips take up space; they require not just electricity but also ventilation to keep from overheating. The Appen raters’ termination and reinstatement is part of “a more generalized pattern within the tech industry of engaging in very swift retaliation against workers” when they organize for better pay or against ethical concerns about the products they work on, Myers West, of the AI Now Institute, told me.

Ironically, one crucial bit of human labor that AI programs have proved unable to automate is their own training. Human subjectivity and prejudice have long migrated their way into algorithms, and those flaws mean machines may not be able to perfect themselves. Various attempts to train AI models with other AI models have bred further bias and worsened performance, though a few have shown limited success. “I can’t imagine that we will be able to replicate [human intervention] with current AI approaches,” Hugging Face’s Luccioni told me in an email; Ahmad said that “using AI to train AI can have dire consequences as it pertains to the viability and credibility of this technology.” The tech industry has so far failed to purge the ghosts haunting its many other machines and services—the people organizing on warehouse floors, walking out of corporate headquarters, unionizing overseas, and leaking classified documents. Appen’s raters are proving that, even amid the generative-AI boom, humanity may not be so easily exorcized.