AI Data Trained On Models Containing Child Sexual Abuse

by Laurie Sullivan @lauriesullivan, December 20, 2023

LAION-5B, a dataset of images and text used to train and build artificial intelligence (AI) models, found thousands of additional pieces of suspected child sexual abuse material (CSAM), according to a report.

LAION (a large-scale artificial intelligence open network) is a nonprofit organization that makes machine-learning technology available to the public.

The massive public dataset, which contains more than 5 billion images and related captions from across the internet, contains at least 1,008 instances of CSAM, a new report from the Stanford Internet Observatory found.

The dataset could enable AI products built on this data, such as image-generation tools, to create new and potentially realistic child abuse content, the report warned.

The methodology -- outlined in a diagram -- that is used to assess all URLs in the first segment of the LAION-2B-multi dataset revealed 27 positive matches of PhotoDNA, which is used to detect, disrupt, and report child pornography.

The Team submitted the URLs to the C3P -- an AI-enabled platform that automates the processes of collection, analysis and testing of information on management systems.

Once CSAM was positively identified, researchers took the 27 positive matches and performed a "k-nearest neighbor" search query for each, resulting in 2,773 additional candidates for inspection by PhotoDNA. About 1,954 of these images were still live and retrievable in the data set.

The research found that “88 additional PhotoDNA hits were found; 43 of them were unique instances (as determined by the PhotoDNA unique match ID), with some images duplicated up to 8 times.”

Duplication also is a concern. The more repetition that occurs with an image in a training set, the more likely it is that a model trained on it will produce output that resembles the repeated training data.

A summary of the findings via PhotoDNA, with message digest algorithm (MD5) match and KNN search, is shown in the report.
MD5 is an algorithm intended for use in digital signature applications.

A number of sites with matches included content delivery networks Reddit, Twitter, Blogspot, and WordPress, as well as those of mainstream adult sites such as XHamster and XVideos. A high percentage of hits, according to the report, were from sites dedicated to teen models or nudity.

The dataset -- LAION-5B, created by a German-based nonprofit -- can also contain copyrighted material. AI image generators rely on datasets that include pairs of images and text descriptions to determine a wide range of concepts and create pictures in response to prompts from users.

LAION told Bloomberg that the group has a “zero tolerance policy” for illegal content and was temporarily removing LAION datasets from the internet “to ensure they are safe before republishing them.”

The nonprofit said that prior to releasing its datasets, it created and published filters for spotting and removing illegal content from them, the spokesperson said.

A subset of LAION-5B has been used to build multiple versions of Stable Diffusion, a deep-learning model released in 2022 that can generate detailed images from text descriptions.

artificial intelligence, children, content, content issues, copyright, data, data management

Next story loading