The Hidden World of YouTube: Fueling AI with Obscure Videos

Researchers from the University of Massachusetts Amherst have analyzed YouTube videos to understand their impact on AI training. Their findings reveal many videos aimed at personal audiences, including children under 13. This research raises concerns about privacy and copyright as companies like OpenAI use these videos to develop AI models.


PTI | Amherst | Updated: 28-06-2024 11:06 IST | Created: 28-06-2024 11:06 IST
The Hidden World of YouTube: Fueling AI with Obscure Videos
AI Generated Representative Image
  • Country:
  • United States

Amherst, Jun 28 (The Conversation)—As the artificial intelligence revolution gathers pace, data remains its lifeblood. OpenAI and Google have turned to YouTube as a rich source of training data. However, what exactly comprises this YouTube archive? A team from the University of Massachusetts Amherst set out to investigate, analyzing random samples of YouTube videos to demystify this extensive dataset.

Their 85-page publication sheds light on the surprising contents of YouTube. They discovered many videos intended for personal use or small groups, with a significant proportion created by children under 13.

While most users experience YouTube through algorithmically recommended videos, a vast iceberg of obscure content remains unexplored. Researchers documented thousands of personal videos with minimal views but high engagement, indicating they were meant for a small audience, such as friends and family. This contrasts with the widely known popular content, exposing another layer of YouTube as a video-centered social network for close-knit groups.

The research gains urgency in the context of a New York Times exposé revealing that OpenAI and Google are leveraging these videos to train their large language models. Concerns about YouTube's terms of service, copyright issues, and the sheer volume of data—including content from kids—are growing.

The researchers, while not condemning Google, underscore that OpenAI's opacity about training materials and the potential inclusion of user-generated content from children pose serious ethical questions. With the Federal Trade Commission's Children's Online Privacy Protection Rule in mind, regulatory efforts are needed to ensure legal protections for user data, particularly as AI continues to evolve.

(This story has not been edited by Devdiscourse staff and is auto-generated from a syndicated feed.)

Give Feedback