A key challenge when dealing with training artificial intelligence (AI) models is the ethical use of datasets for training AI models. For example, a number of technology companies investing in AI have faced allegations of improper use of personal data, such as data scraping, in their quest to develop advanced AI models.[1] Similarly, some other companies training their AI have also been reported to use datasets containing pirated copyrighted content to train AI.[2] These challenges continue to be contested with the application of various jurisdictions’ data protection and privacy laws, as well as intellectual property protection laws, although the reporting for compliance requirements could potentially slow the progress of AI development.
One unusual challenge for AI model trainers is the rapidly diminishing supply of human-generated content for AI models to learn from. In fact, some projections on large language model (LLM) training appetite have suggested that LLM training will deplete the store of human-generated text between 2026 and 2032.[3] The “Dead Internet Theory” [4] for example, proposes that AI and bot-generated content has now exceeded content on the human-generated internet, and therefore the internet – and the content that AI models are being trained on – is no longer populated by real human beings, and therefore the insights and AI models being trained are no longer representative of a “human” response.
In this context, the concept of synthetic data has emerged as a viable solution for AI model trainers to balance data protection and intellectual property rights with the needs of AI innovation.
What is synthetic data?
- Synthetic data is artificially generated data that replicates the statistical characteristics and patterns of real-world data, while anonymising all private information and removing traceability to real individuals.[5] It uses Privacy Enhancing Technologies (PET) like AI-driven models (e.g., Generative adversarial networks and transformers) to obfuscate original data. Synthetic data has been applied in many industrial scenarios – for example, Tesla uses synthetic data to improve self-driving tech,[6] Microsoft creates bias-reduced synthetic faces for facial recognition,[7] and IBM’s Watson generates fake cyberattacks to enhance malware detection.[8]
Why synthetic data is popular in AI training?
(1) First, using synthetic data improves data protection and privacy compliance. Data privacy is broadly protected through mandatory purpose limitation and user consent in most data protection regulations and laws. For example, the EU GDPR Article 5(1)(b) restricts data use to specific, legitimate purposes, while Article 6(1)(a) requires user consent for processing.
> However, synthetic data does not contain any identifiable information, and requires no user consent and can therefore prevent unauthorized data use. This supports AI innovation as businesses can use that data for AI model training and share information safely in their organisations without exposing any sensitive data. Synthetic data allows the companies to securely comply with data protection and privacy regulations by ring-fencing and minimizing the original dataset’s distribution, minimizing breaches, leaks, and cyberattacks. Using synthetic data enables business insights and supporting AI innovation, while ensuring compliance and trust.
(2) Secondly, using synthetic data strengthens data quality and reduces data bias. The quality, quantity, and diversity of data directly impact how well an AI model performs. Using the EU AI Act as a benchmark, Article 10 requires AI systems, especially high-risk ones, to be trained on high-quality, representative, and unbiased data to ensure fairness and prevent discrimination.
> Synthetic data generation and use can reduce the need for cleansing, integrating, and maintaining real data while allowing control over sample sizes, such as specific gender or ethnicity. This helps ensure AI models work effectively across all groups, reducing algorithmic bias caused by data imbalances, strengthening the equity and quality of data being used for AI model training.
(3) Thirdly, synthetic data allows for unlimited data scalability. From a relatively limited original dataset, synthetic data methods can generate unlimited, high-quality datasets to fit different testing needs.
> Synthetic data scaling will allow for greater training data for AI models, which is crucial for enhancing model accuracy, reducing bias, and enabling learning from complex patterns. For example, IBM’s Watsonx.ai platform defaults to generating 100,000 synthetic rows from an original dataset containing as few as 1,000 rows.[9]
(4) Fourthly, synthetic data is a cost-effective method for AI training and innovation and reduces copyright violation risk/costs. Gathering high-quality human-generated data is costly and time-consuming. In addition, there may be costs involved if the content used requires license payments e.g. for copyrighted materials.
> Using synthetic data replaces the need for real data collection and privacy compliance, making it especially beneficial for SMEs aiming to compete in data-driven solutions on a limited budget and limited dataset. For example, a synthetic data expert suggested that labelling one image by human labour could cost USD 6 while artificially generated for 6 cents.[10]
Additionally, synthetic data that is entirely AI-generated from a non-copyrighted dataset is typically also copyright-free. This reduces the risk of copyright infringement, and the cost of copyright licensing.
Regulatory updates on synthetic data
According to research firm Gartner, by 2026, 75% of businesses will leverage generative AI to generate synthetic customer data.[11] While comprehensive regulation is lacking for synthetic data, a number of governments have issued initial guidelines for its use, which is expected to mature as synthetic data adoption expands.
- UK: The UK Statistics Authority’s 2022 guidelines ethical synthetic data use, while the Office for National Statistics allows. In 2025, the UK Government Digital Service released modelling.[12]
- Singapore: In 2024, Singapore’s Personal Data Protection Commission the Proposed Guide to Synthetic Data Generation, detailing best practices, risk management, and governance controls for responsible use.[13]
Synthetic data – not a cure-all for data protection obligations
While synthetic data offers a practical solution for companies to comply with data laws while leveraging AI and data-driven solutions, it is not a cure-all for addressing all of a company’s obligations and ambitions:
(1) The use of synthetic data does not replace data protection rules – the considerations for user consent and minimization still apply.
(2) From a data perspective, synthetic data may be adjusted to prevent overfitting and privacy risks, but there may be other theoretical data constraints or biases in play (e.g. over-adjusting data to suit certain parameters).
(3) Future regulations may require consent and restrict synthetic data generation and sharing.
AI’s future depends on trust, built through responsible data practices. Synthetic data supports data protection compliance, scalability, and secure data sharing, and helps companies balance innovation with legal and ethical safeguards. However, organizations must continue to adhere to best practices to ensure data governance mechanisms achieve compliance and adapt to evolving legal frameworks.
Reach out for more information
At Access Partnership, we help clients navigate technology regulatory developments and AI market opportunities, offering strategic advice on how to seize new technologies like synthetic data to tackle data privacy challenges. For more information, contact Xiaomeng Wu at [email protected], or Lim May-Ann at [email protected].