A recent investigation has uncovered that major tech companies, including Apple, Nvidia, and Anthropic, have been using YouTube video subtitles to train their artificial intelligence (AI) models without creators' permission. This practice goes against YouTube's rules and raises serious questions about data ethics in the AI industry.
Key points
The investigation, conducted by Proof News, revealed that these companies used subtitles from over 170,000 YouTube videos, spanning more than 45,000 channels. The data came from a wide range of sources, including educational channels like Harvard and MIT, news outlets such as the BBC, and popular YouTubers like PewDiePie and MrBeast.
This data was part of a larger collection called “The Pile,” created by a non-profit AI research lab called EleutherAI. The Pile includes various datasets, with YouTube Subtitles being one of them. Companies like Apple, Nvidia, Salesforce, and others used this data to train their AI models, likely unaware of its exact origins.
The use of this data raises several concerns. First, it violates YouTube's Terms of Service, which prohibit using content without creators' permission. Second, it includes material from deleted videos and channels, potentially infringing on creators' rights to remove their content from the internet. Lastly, some of the data contains biased or inappropriate content, which could affect the AI models' outputs.
Many creators were unaware that their content had been used in this way. Some, like Dave Farina of the YouTube channel Professor Dave Explains, argue that companies profiting from creators' work should provide compensation or face regulation.
The issue extends beyond YouTube. Similar concerns have been raised about AI companies using books and other copyrighted material without permission. Several authors have filed lawsuits against AI companies for alleged copyright violations.
As AI technology continues to advance rapidly, the debate over data usage, creator rights, and ethical AI training practices is likely to intensify. This incident highlights the need for clearer regulations and more transparent practices in the AI industry to protect content creators and ensure responsible AI development.