Access Alert: AI Data Harvesting – Ethical? Monopolistic?

Access Alert: AI Data Harvesting – Ethical? Monopolistic?

Large language models and their dependence on data harvesting

Many creatives and authors believe that their online stories and posts have been used to train applications like ChatGPT. The process of collecting these stories, images, and posts is called data harvesting or web scraping, and once this data is harvested it is used to create large language models (LLMs). LLMs enable applications like ChatGPT to “teach themselves” how to analyse and generate text data. LLMs can be open-sourced (Meta announced that Llama2 will be open-sourced, though developers are contesting this claim since it doesn’t use an Open-Source Initiative approved license) or closed-source (Google’s Bard and ChatGPT).

Copyright infringement: Screenplays and books used without consent

Screenwriters and authors hold the belief that their works, available on shadow library platforms such as Bibliotik, Library Genesis, and Z-Library, have been utilised to develop LLMs utilised by Meta, Google, and OpenAI. They consider this usage a copyright infringement, as they never provided consent for their work to be employed in this manner. As a result, screenwriters, authors, and other creators seek fair compensation for the use of their intellectual property.

Copyright infringement – legal precedent by Google

From a public policy perspective, there are two areas of concern. Firstly, there is the issue of copyright; 4.9 billion people use the Internet every day, most of them making use of websites like Google and Facebook. All this data is captured and can be used to create LLMs.

Based on US intellectual property law, the fair use doctrine does allow copyrighted work to be used without explicit consent, if it is used for news reporting, criticism, teaching, research, and for transformative use (used in a manner in which it was not intended). In 2015, when Google was sued by the Authors Guild, it successfully argued that transformative use allows for data harvesting of text from books to create its search engine. In addition, the costs (computer resources, OpenAI used 10,000 Nvidia graphic cards to train ChatGPT, energy, data storage, and management) that companies like OpenAI incur to create these models need to be considered when issues of fair compensation arise. Notably, ChatGPT3 spent USD 5 million on its LLM.

Are close-sourced LLMs anti-competitive but safer?

The second issue concerns open-sourced and closed-sourced LLMs. The widespread and free availability of models like Meta’s Llama poses a significant challenge to the early dominance of companies like OpenAI, supported by Microsoft. These established players already offer their models to business customers through Azure, but the introduction of Llama could disrupt their current position in the market. As Mark Zuckerberg wrote, “open-source drives innovation because it enables many more developers to build with new technology. I believe it would unlock more progress if the ecosystem were more open”.

Nevertheless, the opposing viewpoint against open-source LLMs revolves around concerns about potential liabilities associated with how these LLMs are used and the specific purposes for which they are utilised. Microsoft is at the forefront of advocating for the licensing of foundational models to safeguard copyrights and implement necessary restrictions. However, Microsoft also contends that academic institutions and non-profit organisations should have access to AI resources despite these licensing measures. Earlier this year, Mozilla announced a USD 30 million investment in an open-source AI ecosystem.

If you would like to hear more about these topics, please subscribe to our AI newsletter and keep an eye out for our upcoming webinar where will discuss how to enable a competitive AI landscape. For more information regarding AI developments or engagements, please contact Head of AI Policy Lab, Melissa Govender, at [email protected].

Related Articles

Access Alert: Argentina’s ‘Neutral Wholesale Network’ Programme Launched

Access Alert: Argentina’s ‘Neutral Wholesale Network’ Programme Launched

The Argentine National Executive Power has officially launched the “Neutral Wholesale Network” Programme (Programa “RED MAYORISTA NEUTRAL”), a significant initiative...

10 Jul 2025 Opinion
Connecting the Future: How Connectivity and AI Unlock New Potential 

Connecting the Future: How Connectivity and AI Unlock New Potential 

The transformative potential of Artificial Intelligence (AI) cannot be realised without comprehensive, resilient, and secure digital connectivity. As AI continues...

7 Jul 2025 Reports
Unlocking the True Value of Carbon Credits: Alignment, Incentives, and the Voice of the Supplier

Unlocking the True Value of Carbon Credits: Alignment, Incentives, and the Voice of the Supplier

With only five years remaining to meet the 2030 climate goals outlined by the Paris Agreement, carbon credits have emerged...

7 Jul 2025 Opinion
The Saturation Point: Charting the Limits of Artificial Intelligence

The Saturation Point: Charting the Limits of Artificial Intelligence

It is postulated that AI’s rapid growth is constrained by its massive energy consumption. Training large models like GPT-3 can...

3 Jul 2025 Opinion