Access Alert: AI Data Harvesting – Ethical? Monopolistic?

Access Alert: AI Data Harvesting – Ethical? Monopolistic?

Large language models and their dependence on data harvesting

Many creatives and authors believe that their online stories and posts have been used to train applications like ChatGPT. The process of collecting these stories, images, and posts is called data harvesting or web scraping, and once this data is harvested it is used to create large language models (LLMs). LLMs enable applications like ChatGPT to “teach themselves” how to analyse and generate text data. LLMs can be open-sourced (Meta announced that Llama2 will be open-sourced, though developers are contesting this claim since it doesn’t use an Open-Source Initiative approved license) or closed-source (Google’s Bard and ChatGPT).

Copyright infringement: Screenplays and books used without consent

Screenwriters and authors hold the belief that their works, available on shadow library platforms such as Bibliotik, Library Genesis, and Z-Library, have been utilised to develop LLMs utilised by Meta, Google, and OpenAI. They consider this usage a copyright infringement, as they never provided consent for their work to be employed in this manner. As a result, screenwriters, authors, and other creators seek fair compensation for the use of their intellectual property.

Copyright infringement – legal precedent by Google

From a public policy perspective, there are two areas of concern. Firstly, there is the issue of copyright; 4.9 billion people use the Internet every day, most of them making use of websites like Google and Facebook. All this data is captured and can be used to create LLMs.

Based on US intellectual property law, the fair use doctrine does allow copyrighted work to be used without explicit consent, if it is used for news reporting, criticism, teaching, research, and for transformative use (used in a manner in which it was not intended). In 2015, when Google was sued by the Authors Guild, it successfully argued that transformative use allows for data harvesting of text from books to create its search engine. In addition, the costs (computer resources, OpenAI used 10,000 Nvidia graphic cards to train ChatGPT, energy, data storage, and management) that companies like OpenAI incur to create these models need to be considered when issues of fair compensation arise. Notably, ChatGPT3 spent USD 5 million on its LLM.

Are close-sourced LLMs anti-competitive but safer?

The second issue concerns open-sourced and closed-sourced LLMs. The widespread and free availability of models like Meta’s Llama poses a significant challenge to the early dominance of companies like OpenAI, supported by Microsoft. These established players already offer their models to business customers through Azure, but the introduction of Llama could disrupt their current position in the market. As Mark Zuckerberg wrote, “open-source drives innovation because it enables many more developers to build with new technology. I believe it would unlock more progress if the ecosystem were more open”.

Nevertheless, the opposing viewpoint against open-source LLMs revolves around concerns about potential liabilities associated with how these LLMs are used and the specific purposes for which they are utilised. Microsoft is at the forefront of advocating for the licensing of foundational models to safeguard copyrights and implement necessary restrictions. However, Microsoft also contends that academic institutions and non-profit organisations should have access to AI resources despite these licensing measures. Earlier this year, Mozilla announced a USD 30 million investment in an open-source AI ecosystem.

If you would like to hear more about these topics, please subscribe to our AI newsletter and keep an eye out for our upcoming webinar where will discuss how to enable a competitive AI landscape. For more information regarding AI developments or engagements, please contact Head of AI Policy Lab, Melissa Govender, at [email protected].

Related Articles

Driving Brazil’s app ecosystem: The economic impact of Google Play and Android

Driving Brazil’s app ecosystem: The economic impact of Google Play and Android

With the largest Internet population in Latin America and the fourth-largest market for app adoption globally, Brazil is an established...

15 Apr 2024 Opinion
Access Alert: Brazilian authorities ask for contributions on AI and connectivity

Access Alert: Brazilian authorities ask for contributions on AI and connectivity

On 9 April, Brazil’s National Telecommunications Authority (Anatel) released a public consultation to gather contributions and insights about the role...

11 Apr 2024 Latest AI Thought Leadership
Responsible AI Readiness Index (RARI)

Responsible AI Readiness Index (RARI)

In an era where AI increasingly influences every aspect of society, the need for responsible and ethical practices has become...

11 Apr 2024 General
Access Alert: Orbiting innovation – key satellite industry trends unveiled at SATELLITE 2024

Access Alert: Orbiting innovation – key satellite industry trends unveiled at SATELLITE 2024

The SATELLITE 2024 conference in Washington, DC, took place between 18-21 March 2024. The event brought together close to 15,000...

28 Mar 2024 Opinion