Do LLMs actually store data?

Why the decision of the Munich I Regional Court in the Dispute GEMA vs OpenAI is more than a copyright dispute.

11 November 2025

Publication

Loading...

Listen to our publication

0:00 / 0:00

Today, the Munich Regional Court I (Landgericht München I) ruled in the proceedings (Case No.: 42 O 14139/24) against a provider of an AI system with general-purpose use, finding that the provider had infringed copyright in respect of nine protected song lyrics. However, the significance of the judgment extends beyond copyright law:

The judgment touches upon fundamental questions regarding the legal assessment of the functioning of LLMs. These questions are of fundamental importance for compliance, particularly for data protection compliance.

The relevance to data protection law arises especially when AI system providers use pre-trained LLMs from third parties. The main challenge in those cases is determining who is responsible for ensuring GDPR compliance with these models. The focus is on whether an (in an AI system operated, integrated) LLM stores personal data. Given the importance of this issue, it is to be expected that the discussion is not final and that a final clarification will likely be reserved for the European Court of Justice (ECJ).

1. Decision of the Munich Regional Court I

The claimant alleges that the defendants, through the use of their AI chatbots, infringe copyright. The claimant is proceeding both against the content generated by the chatbot in response to user inputs (prompts) and against the alleged reproductions in the underlying LLM.

Specifically, according to the claimant, the chatbot, in response to corresponding prompts, reproduced copyright-protected song lyrics by well-known German artists - for example, Helene Fischer or Herbert Grönemeyer - almost verbatim. The claimant saw this as evidence that the lyrics were memorised in the defendant's language model.

The court largely followed the claimant's argument and decided that the defendants had committed copyright infringements. According to the court, these infringements occurred both "through the outputs of the defendant's AI chatbot, which are generated in response to prompts," and "through reproductions in the underlying LLM." This is justified by the court, among other things, as follows:

The song lyrics were "contained in the training data of the LLM" and were output "after entering simple prompts."

In the court's view, new technologies such as LLMs should also fall within the scope of copyright law and constitute a "reproduction" under copyright understanding - regardless of whether something is stored one-to-one or not. The court draws a comparison with mp3 files, which also do not represent an exact copy of the protected work, but a simplified version thereof.

The court takes the view that the "technological neutrality" of the EU Copyright Directive in the digital single market should not be to the detriment of copyright.

In summary, the court holds: The fact that the song lyrics originally flowed into the defendant's language model in some form and were ultimately (almost) completely output is sufficient to assume a copyright infringement.

This judgment is not only significant for copyright law. The view that large language models store data has far-reaching consequences for a multitude of legal questions.

The question of data protection law is relevant, for example, when AI system providers use already trained LLMs from third parties - who, in this case, is obliged to ensure GDPR compliance of the AI models? The key question here is: Does an LLM store data? A violation of the GDPR could, for example, lead to high fines.

2.1 European Data Protection Board

In its Opinion 28/2024 on certain data protection aspects of the processing of personal data in connection with AI models, adopted on 2 December 2024, the European Data Protection Board sets strict standards.

In its Opinion 28/2024, it takes the view that an LLM can only be considered anonymous in its use if the probability of extracting personal data from the model via prompts is overall negligible. Accordingly, providers of AI systems must always carry out "due diligence" before operational use of an LLM: if personal data could be "extracted," the AI system provider must also check GDPR compliance. In particular, the training of the LLMs must be examined, especially whether this was carried out in accordance with the GDPR.

Carrying out such an examination would only be possible with considerable effort on the part of the AI system provider.

2.2 The Hamburg Commissioner for Data Protection and Freedom of Information (HmbBfDI - hereinafter "Hamburg Data Protection Commissioner")

The Hamburg Data Protection Commissioner takes a fundamentally different view in his discussion paper: Large Language Models and Personal Data. He provides a legal assessment that evaluates the technical functioning of LLMs. Of considerable importance is that LLMs do not operate like classic databases, but process prompts and then generate outputs. LLMs break down texts into tokens, convert them into numbers, and learn statistical relationships. The result is parameters that map patterns, not original texts. Outputs are generated probabilistically, not by retrieving from storage. Therefore, LLMs would not contain longer words, sentence parts, or entire sentences. The sentence "Is an LLM personal?" could, for example, be split by a typical tokenizer into 12 tokens: [I][s ][a[n ] [LL] [M][ pers][onal][?].

In these outputs, personal data may appear in the overall view. However, the personal reference is not anchored in the model itself, but arises only in the interplay of input and output.

Therefore, according to the Hamburg Data Protection Commissioner, the data subject rights of the GDPR cannot be directed against an LLM used operationally within an AI system, but only against the inputs and outputs of an AI system. Possible data protection violations during the training of an LLM would, according to this view, not be attributable to the provider of the AI system, but at most to the developer of the model. This would be a pragmatic solution for providers of AI systems.

3. How Do Other Courts Decide, e.g. in England?

The question of whether LLMs store data is not only occupying German courts. The English High Court ruled in the case Getty Images (US) Inc & Ors v Stability AI Ltd 2025 EWHC 2863 (Ch) on 4 November 2025 that a trained model is not a library of works, but a network of statistical parameters. The parameters are not a copy of a copyright-protected work and therefore do not generally open the door to copyright infringements. An LLM does not store data like a database.

"While it is true that the model weights are altered during training by exposure to Copyright Works, by the end of that process the Model itself does not store any of those Copyright Works; the model weights are not themselves an infringing copy and they do not store an infringing copy. They are purely the product of the patterns and features which they have learnt over time during the training process."

4. Assessment and Conclusion

The judgment of the Munich I Regional Court appears unconvincing, even though the detailed reasoning of the judgment is still awaited. In our view, the argumentation is purely result-oriented. The technical intricacies of the functioning of LLMs are apparently of little importance to the court:

Although the court roughly described the functioning of an LLM, the judgment is primarily based on the fact that the song lyrics were considered during model training and, in combination with a specific prompt, were partially output. This misjudges the functioning of an LLM: Output is not the same as input. An output is rather the interplay of training and model architecture (LLM-side) and prompting via an AI system (user-side).

Memorisation requires the storage of specific data. This is precisely what does not occur with LLMs, as both the UK High Court and the view of the Hamburg Data Protection Commissioner convincingly demonstrate.

A prompt can, depending on probability distributions, generate texts that resemble an original text, without this being present in the model as a file or sequence.

Moreover, new technologies should not be forced into existing concepts, but should be interpreted in a manner appropriate to their function. Equating an LLM with an MP3 file is not appropriate, as the latter has deterministic coding (i.e. the result is fixed/inevitable), whereas LLMs represent a probabilistic parameter landscape.

Furthermore, in our view, the court did not sufficiently appreciate Section 44b of the German Copyright Act (UrhG) "Text and Data Mining." The judgment undermines the legislative intent by qualifying the mere consideration of training data as a copyright infringement.

The argumentation of the Hamburg Data Protection Commissioner and the High Court, both of which have dealt in detail with the functioning of large language models, should also shape the debate in courts within the European Union. Given the high relevance of the question of whether LLMs store data or not, we expect that this question will ultimately be clarified by the ECJ.

This document (and any information accessed through links in this document) is provided for information purposes only and does not constitute legal advice. Professional legal advice should be obtained before taking or refraining from any action as a result of the contents of this document.