Skip to content

How The Atlantic's Music Training Database Exposes AI's Copyright Crisis

The Atlantic has created a searchable database of 21 million songs used to train AI models, revealing the scale of unlicensed music in training data and raising urgent copyright questions.

Daniel Evershaw(ML Engineer & Technical Writer)June 21, 20266 min read0 views

Last updated: June 21, 2026

How The Atlantic's Music Training Database Exposes AI's Copyright Crisis
Quick Answer

The Atlantic created a searchable database of over 21 million songs used to train AI models, revealing that Google and Stability AI are among the users. This transparency empowers artists and could reshape copyright law.

More than 21 million songs have been fed into AI models without explicit permission from artists or labels, and now the public can search every single one of them. The Atlantic’s reporter Alex Reisner uncovered four datasets used to train AI systems and made them fully searchable. Two of these datasets are enormous: one contains 12 million tracks, another 9 million. Two smaller sets still represent over 100,000 songs each. This database is a stark reminder that the foundation of many generative AI products rests on a vast, often unacknowledged, use of copyrighted creative work.

  • Two massive music datasets of 12 million and 9 million tracks have been used to train AI models, with thousands of downloads recorded.
  • The datasets have been downloaded thousands of times, making it impossible to know exactly who has used them, though Google and Stability AI are confirmed users.
  • The searchable database empowers artists, labels, and policymakers to see exactly what music has been used without permission.
  • This transparency could accelerate legal battles and regulatory action around AI training data and copyright.
  • The gap between the scale of unlicensed training data and current legal frameworks is now fully visible to the public.
  • The music industry may need new licensing models that account for bulk training data usage, not just individual song rights.

How Do These Music Datasets Actually Get Used to Train AI?

The process begins with massive collections of audio files, often scraped from the internet or compiled from existing digital libraries. In the case of the datasets uncovered by Reisner, the two largest collections of 12 million and 9 million tracks are likely drawn from sources like YouTube, personal music libraries, and potentially illegal file-sharing networks. These raw audio files are then processed into a format suitable for machine learning: typically, they are converted into spectrograms (visual representations of sound frequencies over time) or encoded into compressed latent representations. An AI model, often a transformer or diffusion architecture, is trained to predict the next segment of audio or to generate new spectrograms that mimic the patterns in the training data. The model learns statistical correlations between musical elements like melody, harmony, rhythm, and timbre across millions of examples. Because the datasets are so large, the model develops a broad but shallow understanding of musical styles, enabling it to generate plausible but derivative compositions. The key issue is that this training process copies the musical patterns, and potentially the copyrighted expression, of every single song in the dataset without any compensation or attribution to the original creators.

Artists should search their own catalogs in the database to understand if their work is in these datasets. This information is critical for any future legal claims or negotiations with AI companies.

The sheer scale of the datasets — 21 million tracks combined — changes the nature of the copyright debate. Previously, arguments about fair use for AI training often centered on the idea that models learn “patterns” rather than copying specific works. But when a model is trained on 12 million songs, it is statistically almost certain to have memorized fragments, chord progressions, and even entire hooks from countless individual tracks. This makes it far harder for AI companies to argue that their models do not reproduce copyrighted expression. The legal concept of “substantial similarity” in copyright infringement cases typically compares an allegedly infringing work to a specific original. With these datasets, the defense that the model only learned “general style” becomes untenable when the training data includes entire discographies of thousands of artists. The database also exposes the asymmetry of power: artists had no way to know their music was in these datasets, while AI companies had full access. This transparency is likely to fuel class-action lawsuits and push regulators to require opt-in consent for training data, not just opt-out mechanisms.

Aspect Pre-Database Era Post-Database Era Impact on AI Industry
Artist Knowledge No way to know if music was used Full searchable visibility Empowers legal action and licensing demands
Legal Defense Fair use based on “pattern learning” Harder to deny copying specific works Increased litigation risk
Regulatory Pressure Low, data was opaque High, public can see scale of use Likely to accelerate new laws
Licensing Models Per-song or per-stream Bulk dataset licenses needed New revenue streams for labels

What Should Artists and Labels Do With This New Transparency?

The database is a powerful tool, but it requires action. Artists and labels should immediately search their catalogs to see which of their songs appear in the four datasets. This information is the foundation for any legal claim. Beyond individual action, the industry needs to push for collective licensing frameworks. The music industry already has robust mechanisms for mechanical and performance royalties through organizations like ASCAP, BMI, and the Harry Fox Agency. These bodies could expand to offer bulk licenses for AI training data, setting a per-song fee or a revenue share from AI products. Labels should also reconsider their data-sharing agreements with tech companies. Many labels have already signed deals with AI firms, but these deals may undervalue the long-term worth of their catalogs. The database provides leverage: if a label can prove its entire catalog is in a dataset, it can demand a fairer share. Finally, artists should document any instances where their music has been used to generate output that competes with their own work, as this strengthens claims of market harm.

Who Benefits Most From This Searchable Database?

  • Individual artists: For the first time, independent musicians who may not have legal teams can see if their work is in these datasets. This empowers them to join class actions or demand removal.
  • Copyright lawyers and litigators: The database provides concrete evidence of mass copying, making it easier to build cases for infringement. It shifts the burden of proof toward AI companies.
  • Policymakers and regulators: Lawmakers in the EU, US, and elsewhere can use this data to draft legislation that requires transparency in training data. It provides a real-world example of the scale of unlicensed use.
  • Journalists and researchers: The database enables deeper investigation into which artists are most affected and which AI companies are the heaviest users of unlicensed data.
  • Music industry trade groups: Organizations like the RIAA can use this data to negotiate better terms with AI companies and to lobby for stronger copyright protections.

The database only shows which songs are in the datasets, not how the AI models actually use that data. A song being in a dataset does not automatically prove infringement in any specific output. Legal analysis is still required.

Which Warning Signs Predict Problems Ahead for AI Music Companies?

Several red flags emerge from this investigation. First, the fact that these datasets have been downloaded thousands of times means that dozens of AI startups, not just Google and Stability AI, may have trained on unlicensed music. Any company that used these datasets faces potential liability. Second, the lack of transparency from AI companies about their training data will become an increasingly untenable position. As more databases like this one emerge, companies that refuse to disclose their data sources will face reputational and legal damage. Third, the music industry is historically aggressive in defending its copyrights, as seen in the lawsuits against Napster, Grokster, and YouTube. AI companies are likely to face similar coordinated legal action. Finally, the regulatory environment is shifting fast. The EU’s AI Act already requires disclosure of training data for high-risk systems, and the US Copyright Office is actively studying AI and copyright. Companies that have not already secured licenses for their training data are sitting on a ticking legal time bomb.

The Atlantic’s database is more than a journalistic tool; it is a turning point in the relationship between AI and the creative industries. For the first time, the scale of unlicensed use is fully visible. The question now is whether AI companies will move toward licensing and transparency, or whether they will fight a protracted legal war that could define the future of generative AI. The music industry has shown it is willing to sue. The database gives it the ammunition to do so effectively.

Source: The Verge AI

Share:

Frequently Asked Questions

How can I search the database to see if my music was used?

The database is hosted by The Atlantic and is fully searchable online. You can enter an artist name, song title, or album to see if it appears in any of the four datasets. The database covers over 21 million tracks total.

Which AI companies are confirmed to have used these datasets?

The article specifically names Google and Stability AI as confirmed users of the datasets. However, because the datasets have been downloaded thousands of times, many other companies and researchers may have used them as well.

Does a song being in the database mean it was illegally copied?

Not necessarily. The database shows that the song was included in a training dataset, but whether that constitutes copyright infringement depends on the specific output of the AI model and the legal arguments around fair use. Legal analysis is needed for each case.

What should I do if I find my music in the database?

Document the evidence, including the dataset name and which of your songs appear. You may want to contact a lawyer specializing in copyright or AI law. You can also reach out to your music publisher or performance rights organization to discuss potential licensing or legal action.

Sources

  1. The Verge AI

Comments

Leave a comment. Your email won't be published.

Supports basic formatting: **bold**, *italic*, `code`, [links](url)

Related Articles