
LAION-5B, a dataset of images and text
used to train and build artificial intelligence (AI) models, found thousands of additional pieces of suspected child sexual abuse material (CSAM), according to a report.
LAION (a
large-scale artificial intelligence open network) is a nonprofit organization that makes machine-learning technology available to the public.
The massive public dataset, which contains more
than 5 billion images and related captions from across the internet, contains at least 1,008 instances of CSAM, a new report from the Stanford Internet Observatory found.
The dataset
could enable AI products built on this data, such as image-generation tools, to create new and potentially realistic child abuse content, the report warned.
advertisement
advertisement
The methodology -- outlined in a
diagram -- that is used to assess all URLs in the first segment of the LAION-2B-multi dataset revealed 27 positive matches of PhotoDNA, which is used to detect, disrupt, and report child
pornography.
The Team submitted the URLs to the C3P -- an AI-enabled platform that automates the processes of collection, analysis and testing of information on management systems.
Once CSAM was positively identified, researchers took the 27 positive matches and performed a "k-nearest neighbor" search query for each, resulting in 2,773 additional candidates for inspection by
PhotoDNA. About 1,954 of these images were still live and retrievable in the data set.
The research found that “88 additional PhotoDNA hits were found; 43 of them were unique instances
(as determined by the PhotoDNA unique match ID), with some images duplicated up to 8 times.”
Duplication also is a concern. The more repetition that occurs with an image in a training
set, the more likely it is that a model trained on it will produce output that resembles the repeated training data.
A summary of the findings via PhotoDNA, with message digest algorithm (MD5)
match and KNN search, is shown in the report.
MD5 is an algorithm intended for use in digital signature applications.
A number of sites with matches included content delivery networks
Reddit, Twitter, Blogspot, and WordPress, as well as those of mainstream adult sites such as XHamster and XVideos. A high percentage of hits, according to the report, were from sites dedicated to teen
models or nudity.
The dataset -- LAION-5B, created by a German-based nonprofit -- can also contain copyrighted material. AI image generators rely on datasets that include pairs of images and
text descriptions to determine a wide range of concepts and create pictures in response to prompts from users.
LAION told Bloomberg that the group has a “zero tolerance policy” for
illegal content and was temporarily removing LAION datasets from the internet “to ensure they are safe before republishing them.”
The nonprofit said that prior to releasing its
datasets, it created and published filters for spotting and removing illegal content from them, the spokesperson said.
A subset of LAION-5B has been used to build multiple versions of Stable
Diffusion, a deep-learning model released in 2022 that can generate detailed images from text descriptions.