Access Alert: AI Data Harvesting – Ethical? Monopolistic?

Access Alert: AI Data Harvesting – Ethical? Monopolistic?

Large language models and their dependence on data harvesting

Many creatives and authors believe that their online stories and posts have been used to train applications like ChatGPT. The process of collecting these stories, images, and posts is called data harvesting or web scraping, and once this data is harvested it is used to create large language models (LLMs). LLMs enable applications like ChatGPT to “teach themselves” how to analyse and generate text data. LLMs can be open-sourced (Meta announced that Llama2 will be open-sourced, though developers are contesting this claim since it doesn’t use an Open-Source Initiative approved license) or closed-source (Google’s Bard and ChatGPT).

Copyright infringement: Screenplays and books used without consent

Screenwriters and authors hold the belief that their works, available on shadow library platforms such as Bibliotik, Library Genesis, and Z-Library, have been utilised to develop LLMs utilised by Meta, Google, and OpenAI. They consider this usage a copyright infringement, as they never provided consent for their work to be employed in this manner. As a result, screenwriters, authors, and other creators seek fair compensation for the use of their intellectual property.

Copyright infringement – legal precedent by Google

From a public policy perspective, there are two areas of concern. Firstly, there is the issue of copyright; 4.9 billion people use the Internet every day, most of them making use of websites like Google and Facebook. All this data is captured and can be used to create LLMs.

Based on US intellectual property law, the fair use doctrine does allow copyrighted work to be used without explicit consent, if it is used for news reporting, criticism, teaching, research, and for transformative use (used in a manner in which it was not intended). In 2015, when Google was sued by the Authors Guild, it successfully argued that transformative use allows for data harvesting of text from books to create its search engine. In addition, the costs (computer resources, OpenAI used 10,000 Nvidia graphic cards to train ChatGPT, energy, data storage, and management) that companies like OpenAI incur to create these models need to be considered when issues of fair compensation arise. Notably, ChatGPT3 spent USD 5 million on its LLM.

Are close-sourced LLMs anti-competitive but safer?

The second issue concerns open-sourced and closed-sourced LLMs. The widespread and free availability of models like Meta’s Llama poses a significant challenge to the early dominance of companies like OpenAI, supported by Microsoft. These established players already offer their models to business customers through Azure, but the introduction of Llama could disrupt their current position in the market. As Mark Zuckerberg wrote, “open-source drives innovation because it enables many more developers to build with new technology. I believe it would unlock more progress if the ecosystem were more open”.

Nevertheless, the opposing viewpoint against open-source LLMs revolves around concerns about potential liabilities associated with how these LLMs are used and the specific purposes for which they are utilised. Microsoft is at the forefront of advocating for the licensing of foundational models to safeguard copyrights and implement necessary restrictions. However, Microsoft also contends that academic institutions and non-profit organisations should have access to AI resources despite these licensing measures. Earlier this year, Mozilla announced a USD 30 million investment in an open-source AI ecosystem.

If you would like to hear more about these topics, please subscribe to our AI newsletter and keep an eye out for our upcoming webinar where will discuss how to enable a competitive AI landscape. For more information regarding AI developments or engagements, please contact Head of AI Policy Lab, Melissa Govender, at [email protected].

Related Articles

Access Alert: India General Elections 2024 – What’s Next?

Access Alert: India General Elections 2024 – What’s Next?

Between 19 April and 1 June, India held the world’s largest democratic elections, with 969 million eligible voters. This marathon...

8 Jul 2024 Opinion
Access Alert: 2024 UK general election – Labour triumphs with pledge for change

Access Alert: 2024 UK general election – Labour triumphs with pledge for change

Labour landslide UK voters have elected the first Labour government since 2010, ending 14 years of Conservative-led administrations. At the...

5 Jul 2024 Opinion
India’s App Market: Creating Global Impact

India’s App Market: Creating Global Impact

The Indian app market is experiencing rapid growth and continues to solidify its position as a major global player. For...

2 Jul 2024 Opinion
The State of Broadband 2024 Annual Report: Leveraging AI for Universal Connectivity

The State of Broadband 2024 Annual Report: Leveraging AI for Universal Connectivity

With the artificial intelligence (AI) revolution already well underway, the Broadband Commission has added yet another task to AI’s to-do...

2 Jul 2024 Opinion