Synthetic Data, Data Protection and Copyright in an Era of Generative AI

Authors

  • Kalpana Tyagi

Abstract

Data protection, privacy and copyright may be closely aligned, yet distinctly respond to the common element called data – that comprises personal as well as non-personal elements. Data can be of many different types, and when extracted from human-authored works, the expressive form of the work is subject to copyright protection. When personal data are included in a given dataset, it may trigger the application of the EU General Data Protection Regulation. Together, all the different sources form training data, which forms a key input for the training of generative AI models. These models have substantially devoured data to reach their current level of sophistication and capabilities. However, generative AI models are advancing at a rapid pace, such that they are no longer a mere consumer of data; they are also a key producer of new data – one that mimics the original data. This data is known as ‘synthetic data’. Once the currently available models go a step further than their present level of development, follow-on synthetic data may look like independent works, with remote resemblance, if any, to the original data. While on the one hand, this may be a big promise to meet compliance with the 2016 EU General Data Protection Regulation, it heralds notable challenges for the current IPR (particularly copyright and database rights) framework and the accompanying balancing of authors’ and users’ rights. This interplay – considering its inter- and intra-disciplinary complexity – remains under-explored in the literature. This contribution, accordingly, explores the interaction between copyright (and other IPRs), database rights and data protection and privacy in the context of synthetic data and generative AI.

Downloads

Published

2025-09-04