Access Alert: AI Data Harvesting – Ethical? Monopolistic?

Access Alert: AI Data Harvesting – Ethical? Monopolistic?

Large language models and their dependence on data harvesting

Many creatives and authors believe that their online stories and posts have been used to train applications like ChatGPT. The process of collecting these stories, images, and posts is called data harvesting or web scraping, and once this data is harvested it is used to create large language models (LLMs). LLMs enable applications like ChatGPT to “teach themselves” how to analyse and generate text data. LLMs can be open-sourced (Meta announced that Llama2 will be open-sourced, though developers are contesting this claim since it doesn’t use an Open-Source Initiative approved license) or closed-source (Google’s Bard and ChatGPT).

Copyright infringement: Screenplays and books used without consent

Screenwriters and authors hold the belief that their works, available on shadow library platforms such as Bibliotik, Library Genesis, and Z-Library, have been utilised to develop LLMs utilised by Meta, Google, and OpenAI. They consider this usage a copyright infringement, as they never provided consent for their work to be employed in this manner. As a result, screenwriters, authors, and other creators seek fair compensation for the use of their intellectual property.

Copyright infringement – legal precedent by Google

From a public policy perspective, there are two areas of concern. Firstly, there is the issue of copyright; 4.9 billion people use the Internet every day, most of them making use of websites like Google and Facebook. All this data is captured and can be used to create LLMs.

Based on US intellectual property law, the fair use doctrine does allow copyrighted work to be used without explicit consent, if it is used for news reporting, criticism, teaching, research, and for transformative use (used in a manner in which it was not intended). In 2015, when Google was sued by the Authors Guild, it successfully argued that transformative use allows for data harvesting of text from books to create its search engine. In addition, the costs (computer resources, OpenAI used 10,000 Nvidia graphic cards to train ChatGPT, energy, data storage, and management) that companies like OpenAI incur to create these models need to be considered when issues of fair compensation arise. Notably, ChatGPT3 spent USD 5 million on its LLM.

Are close-sourced LLMs anti-competitive but safer?

The second issue concerns open-sourced and closed-sourced LLMs. The widespread and free availability of models like Meta’s Llama poses a significant challenge to the early dominance of companies like OpenAI, supported by Microsoft. These established players already offer their models to business customers through Azure, but the introduction of Llama could disrupt their current position in the market. As Mark Zuckerberg wrote, “open-source drives innovation because it enables many more developers to build with new technology. I believe it would unlock more progress if the ecosystem were more open”.

Nevertheless, the opposing viewpoint against open-source LLMs revolves around concerns about potential liabilities associated with how these LLMs are used and the specific purposes for which they are utilised. Microsoft is at the forefront of advocating for the licensing of foundational models to safeguard copyrights and implement necessary restrictions. However, Microsoft also contends that academic institutions and non-profit organisations should have access to AI resources despite these licensing measures. Earlier this year, Mozilla announced a USD 30 million investment in an open-source AI ecosystem.

If you would like to hear more about these topics, please subscribe to our AI newsletter and keep an eye out for our upcoming webinar where will discuss how to enable a competitive AI landscape. For more information regarding AI developments or engagements, please contact Head of AI Policy Lab, Melissa Govender, at [email protected].

Related Articles

Access Alert: Saudi Arabia – Pioneering Sustainable Space Endeavours

Access Alert: Saudi Arabia – Pioneering Sustainable Space Endeavours

In an effort to showcase its focus on sustainable space exploration and utilisation, Saudi Arabia hosted the Space Debris Conference...

1 Mar 2024 Opinion
FinTech’s Role in SMEs and Inclusive Growth

FinTech’s Role in SMEs and Inclusive Growth

SMEs account for 90% of businesses and 50% of employment globally. However, limited financial access and regulatory complexities prevent these...

29 Feb 2024 Opinion
Tech Policy Trends 2024: Consumer protection – The next battleground for digital pioneers

Tech Policy Trends 2024: Consumer protection – The next battleground for digital pioneers

With technologies emerging and advancing at a rapid pace, tech players are under unprecedented scrutiny for how it impacts consumers’...

28 Feb 2024 Opinion
Access Alert: Advancing pan-African mobile roaming to unlock seamless connectivity and growth opportunities

Access Alert: Advancing pan-African mobile roaming to unlock seamless connectivity and growth opportunities

Digital transformation is accelerating in Africa, driven by the growing adoption of mobile services. According to the World Bank, sub-Saharan...

23 Feb 2024 Opinion