The human cost of AI: Is data labelling creating digital sweatshops?

Introduction

Artificial Intelligence (AI) has witnessed incredible growth over the past year. Platforms like ChatGPT and Midjourney are extending the reach of automated technologies into the deepest areas of human creativity, with continuous learning poised to make these systems even more effective. The global AI market is currently estimated to be worth USD 197bn.[1] According to current projections, that figure is set to reach USD 1.8tn by 2030.[2]

The discourse on AI has largely focused on job losses and reduced labour demand. However, these systems can’t operate without manual data labelling, where humans help to train machine language models by identifying the key sections of a text or image. These individuals have been described as the “construction workers of the digital age”,[3] as it is only through this process that AI models generate the accuracy and predictability that allow them to operate without external input.

Driving better data

Anyone who has used the Internet recently will have already performed data labelling, often without realising it. Remember the last time you tried to access a website but first had to submit a ReCAPTCHA test asking you to identify all the squares with traffic lights or road signs to “prove you are a human”? ReCAPTCHA is owned by Google, which uses these tests to help teach its Waymo autonomous cars. Those few seconds you spend squinting at a blurry photo of a bicycle are fed back into the system to train the model.

Unfortunately for Big Tech firms, the occasional ReCAPTCHA test is not enough to develop an AI to the required standard. In fact, data preparation represents around 80% of the time consumed in machine learning projects.[4] To stick with the example of autonomous vehicles, one hour of video data in this sector is estimated to require up to 800 hours of data labelling.[5]

Cheap labour?

In response to this demand, the tech sector is increasingly turning to workers in emerging markets. Africa, Latin America, and Southeast Asia are all attracting strong interest from companies in developed countries due to the ability to pay a living wage that remains well below the equivalent local rate. In countries like Kenya, workers who average around USD 2 per day in the “informal economy” can earn more than four times that figure to sit in an office and tag images on a computer screen.[6]

Some have compared these services to “digital sweatshops”,[7] noting similarities to the well-established concerns raised over the fashion industry in terms of long hours, demanding targets, and a lack of job security.[8] As with many of the world’s leading clothing brands, there’s little evidence to suggest that the use of such practices will reduce the popularity of AI systems.

Future trends

As demand for AI grows, so will the human-powered structures that underpin them. The data labelling market alone is projected to be worth USD 3.5 billion by 2024[9] and USD 8.2 billion by 2028,[10] with demand currently outpacing supply. As AI continues to expand into new sectors, more specialised applications of the technology are set to increase workloads even further.

In the long term, the main threat to the data labelling industry is the very thing that it exists to support: automation. This process is already underway, with AI-driven data annotation being offered by a range of companies. However, several factors mean that this process is unlikely to cause dramatic shifts for many years.

First, increased specialisation requires more complicated data and a nuanced understanding of how to interpret it. This is where automated labelling processes currently fall short. Furthermore, even where automated labelling is already in use, human oversight is still needed to ensure mistake-free annotation, without which the data is useless. As such, automation is more likely to support existing data labelling efforts than replace them for the foreseeable future.[11]

The digital divide

The shorter-term threat posed by AI is its capacity to widen the digital divide. Research by the International Monetary Fund has found that as machine learning develops, the countries that stand to gain the most are those where these systems are operational, such as in the West and China.[12] While poorer countries may currently benefit from the outsourcing of data labelling, these are the crumbs of AI’s economic potential.

Once automated systems across different sectors demonstrate the ability to boost productivity and reduce labour demands, investment will flow into the countries where this is already happening; i.e., developed economies. Emerging markets, by contrast, will suffer from their overwhelming reliance on humans. In short, the abundance of, and reliance on, cheap labour that currently makes these markets attractive will soon become the very factor that prevents them from reaping the wider benefits of these technologies.

Lost in translation

The existing reasons for the digital divide in emerging markets are well documented: limited infrastructure, reduced access to the Internet, and poor education opportunities. Each of these results in a lack of digital skills among the general population.

Language gaps demonstrate how this gap is widening. Strong AI systems need to be trained on large local data sets to ensure accuracy and adaptability, yet over half the world’s +7,000 languages have no digital footprint at all.[13] As AI becomes more ubiquitous, it has never been more important to ensure that these communities remain part of the conversation.

Urgent intervention required: Expanding digital skills

Raising productivity and expanding digital skills in emerging markets hold the key to reversing this trend. Without this, global leaders in AI are projected to capture an additional 20-25% in net economic benefits compared to today, whereas developing countries will be limited to only 5-15%.[14] Policymakers need to act now to prevent this reality by promoting digital learning, building the relevant infrastructure, and promoting themselves as magnets for investment beyond basic services. Data labelling should be used as a stepping stone to long-term digital growth and more skilled jobs, not as an end in itself. This must be a global effort, however, with existing AI leaders sharing knowledge and resources.

Access Partnership’s Tech Policy Exchange 2023 recently examined these challenges in greater detail by putting developing countries in the spotlight. Built around the theme of ‘Expanding Technological Frontiers in Digital Markets’, our day-long, in-person event brought together stakeholders from across industry and policymaking to discuss how to encourage innovation, promote investment, and, ultimately, close the digital divide.

Catch up on the four sessions here:

Session 1: Artificial Intelligence, Immersive Technology, and Diversity – Co-creating an Inclusive Digital Future

Session 2: Mind the Connectivity Gap – The Role of 5G, Wi-Fi, and Satellite-based Communication Technologies in Emerging Markets

Session 3: Content Moderation in Geographically, Culturally, and Religiously Diverse Environments – Balancing Free Speech and User Protection

Session 4: Digital Trade – Leveraging Technology for Economic Recovery and Resilience

The human cost of AI: Is data labelling creating digital sweatshops?

Introduction

Driving better data

Cheap labour?

Future trends

The digital divide

Lost in translation

Urgent intervention required: Expanding digital skills

[1] https://www.grandviewresearch.com/industry-analysis/artificial-intelligence-ai-market

[2] ibid

[3] https://www.forbes.com/sites/korihale/2019/05/28/google-microsoft-banking-on-africas-ai-labeling-workforce/

[4] https://www.techtarget.com/searchenterpriseai/feature/Data-preparation-for-machine-learning-still-requires-humansrequires humans | TechTarget

[5] https://factordaily.com/indian-data-labellers-powering-the-global-ai-race/

[6] https://www.bbc.co.uk/news/technology-46055595

[7] https://www.vice.com/en/article/qkjk35/as-more-work-moves-online-the-threat-of-digital-sweatshops-looms

[8] https://www.aljazeera.com/opinions/2023/1/23/sweatshops-are-making-our-digital-age-work

[9] https://medium.com/cognilytica/data-preparation-labeling-for-ai-2020-b512a5ed777c

[10] https://www.grandviewresearch.com/industry-analysis/data-collection-labeling-market

[11] https://www.datasciencecentral.com/the-impact-of-data-labeling-2023-current-trends-future-demands/

[12] https://www.imf.org/en/Blogs/Articles/2020/12/02/blog-how-artificial-intelligence-could-widen-the-gap-between-rich-and-poor-nations

[13] https://www.raconteur.net/digital/tech-bridge-global-digital-language-divide/

[14] https://www.mckinsey.com/featured-insights/artificial-intelligence/notes-from-the-ai-frontier-modeling-the-impact-of-ai-on-the-world-economy