Access Alert: AI Data Harvesting – Ethical? Monopolistic?

Access Alert: AI Data Harvesting – Ethical? Monopolistic?

Large language models and their dependence on data harvesting

Many creatives and authors believe that their online stories and posts have been used to train applications like ChatGPT. The process of collecting these stories, images, and posts is called data harvesting or web scraping, and once this data is harvested it is used to create large language models (LLMs). LLMs enable applications like ChatGPT to “teach themselves” how to analyse and generate text data. LLMs can be open-sourced (Meta announced that Llama2 will be open-sourced, though developers are contesting this claim since it doesn’t use an Open-Source Initiative approved license) or closed-source (Google’s Bard and ChatGPT).

Copyright infringement: Screenplays and books used without consent

Screenwriters and authors hold the belief that their works, available on shadow library platforms such as Bibliotik, Library Genesis, and Z-Library, have been utilised to develop LLMs utilised by Meta, Google, and OpenAI. They consider this usage a copyright infringement, as they never provided consent for their work to be employed in this manner. As a result, screenwriters, authors, and other creators seek fair compensation for the use of their intellectual property.

Copyright infringement – legal precedent by Google

From a public policy perspective, there are two areas of concern. Firstly, there is the issue of copyright; 4.9 billion people use the Internet every day, most of them making use of websites like Google and Facebook. All this data is captured and can be used to create LLMs.

Based on US intellectual property law, the fair use doctrine does allow copyrighted work to be used without explicit consent, if it is used for news reporting, criticism, teaching, research, and for transformative use (used in a manner in which it was not intended). In 2015, when Google was sued by the Authors Guild, it successfully argued that transformative use allows for data harvesting of text from books to create its search engine. In addition, the costs (computer resources, OpenAI used 10,000 Nvidia graphic cards to train ChatGPT, energy, data storage, and management) that companies like OpenAI incur to create these models need to be considered when issues of fair compensation arise. Notably, ChatGPT3 spent USD 5 million on its LLM.

Are close-sourced LLMs anti-competitive but safer?

The second issue concerns open-sourced and closed-sourced LLMs. The widespread and free availability of models like Meta’s Llama poses a significant challenge to the early dominance of companies like OpenAI, supported by Microsoft. These established players already offer their models to business customers through Azure, but the introduction of Llama could disrupt their current position in the market. As Mark Zuckerberg wrote, “open-source drives innovation because it enables many more developers to build with new technology. I believe it would unlock more progress if the ecosystem were more open”.

Nevertheless, the opposing viewpoint against open-source LLMs revolves around concerns about potential liabilities associated with how these LLMs are used and the specific purposes for which they are utilised. Microsoft is at the forefront of advocating for the licensing of foundational models to safeguard copyrights and implement necessary restrictions. However, Microsoft also contends that academic institutions and non-profit organisations should have access to AI resources despite these licensing measures. Earlier this year, Mozilla announced a USD 30 million investment in an open-source AI ecosystem.

If you would like to hear more about these topics, please subscribe to our AI newsletter and keep an eye out for our upcoming webinar where will discuss how to enable a competitive AI landscape. For more information regarding AI developments or engagements, please contact Head of AI Policy Lab, Melissa Govender, at [email protected].

Related Articles

Advantage Southeast Asia: Emerging AI Leader

Advantage Southeast Asia: Emerging AI Leader

Artificial intelligence (AI) is offering a once-in-a-generation opportunity for economic growth and societal transformation, with the conversation dominated by the...

2 Oct 2024 General
Google and Korea: 20 years of partnership and AI innovation

Google and Korea: 20 years of partnership and AI innovation

Korean entertainment groups like Blackpink and BTS have continually taken the world by storm, while Android revolutionized mobile access for...

26 Sep 2024 General
Access Alert: A new era of global governance – the Pact for the Future

Access Alert: A new era of global governance – the Pact for the Future

On 22 September, the United Nations adopted the Pact for the Future at the Summit of the Future in New...

24 Sep 2024 Opinion
The future of trust: Why AI governance and regulation are crucial in the age of deepfakes

The future of trust: Why AI governance and regulation are crucial in the age of deepfakes

In 2024, a milestone in AI development was reached with Elon Musk’s Grok platform, which enables the production of photorealistic...

23 Sep 2024 Opinion