As the government races to invest in AI research, federally funded researchers stand to encounter a troubling problem: datasets tainted with dangerous, and even illegal, content.
AI models are often trained on datasets that represent billions of samples from the internet — like text, images, and links — that are scraped from the open web. These databases are helpful because they’re extremely large and represent a diverse range of topics, but they can end up collecting illicit imagery that goes undetected. This risk raises critical ethical and legal issues — and also poses a challenge as federal agencies ramp up their efforts to support AI research.
The National Science Foundation, a federal agency that provides funding to scientific researchers, pointed to the need for a National Artificial Intelligence Research Resource in the aftermath of a major Stanford report that highlighted the presence of child sexual abuse material on LAION-5B — an open dataset created by the Large-scale Artificial Intelligence Open Network that’s popular among researchers and some generative AI systems. The report highlighted a major vulnerability with AI research, experts told FedScoop.
(LAION doesn’t have a relationship with NSF, but a researcher affiliated with an NSF-funded AI research institute appears to have collaborated on a paper presenting the dataset.)
According to the Stanford Internet Observatory, a research center that studies the abuse of the internet, the LAION-5B dataset represents nearly 6 billion samples, which include URLs, descriptions, and other data that might be associated with an image scraped from the internet. A report published in December determined that “having possession of a LAION‐5B dataset populated even in late 2023 implies the possession of thousands of illegal images,” and in particular, child sexual abuse material. While LAION told FedScoop that it used filters before releasing the dataset, it’s since been taken down “in an abundance of caution.”
“We collaborate with universities, researchers and NGOs to improve these filters and are currently working with the Internet Watch Foundation (IWF) to identify and remove content suspected of violating laws. We invite Stanford researchers to join LAION to improve our datasets and to develop efficient filters for detecting harmful content,” a spokesperson added.
Still, the issue highlights the challenge that working with openly available data presents, particularly as the Biden administration pushes federal agencies to develop their own models and conduct more AI research, per the White House’s October 2023 AI executive order. In addition to the immense trauma inflicted on victims, the potential use of child sexual abuse materials in AI systems raises legal concerns, since interacting with the material can constitute a crime.
The NSF doesn’t have plans to release dataset guidelines right now, but a spokesperson said the issues with LAION-5B exemplified why building alternative resources for AI development is important. The agency spokesperson pointed to the importance of the National Artificial Intelligence Research Resource, which would create new tools for academics to research and develop these kinds of technologies. A roadmap for the NAIRR was released by the White House early last year and the first meeting for the pilot program led by NSF and partners took place last November.
“This incident demonstrates the critical role for the independent research community in building a trustworthy and accountable AI ecosystem,” an NSF spokesperson said in an email to FedScoop. “It is essential for the research community to be funded to investigate, examine, and build trustworthy AI systems.”
“When established, the National Artificial Intelligence Research Resource (NAIRR) is envisioned to provide the academic community with access to trustworthy resources and tools that are needed to move towards a more equitable and responsible AI ecosystem,” they added.
Illicit, dangerous, and disturbing content is frequently included in open web databases. But child sexual abuse materials, often referred to as CSAM, presents a specific challenge: searching for or looking at the material within a dataset, even to remove it, raises obvious legal complexities. Along with government investigators, only certain organizations are legally allowed to study the presence of CSAM on the internet.
Researchers use creative technical tools, such as a form of hashing, to determine what images are in this system without viewing the images themselves, as the Stanford report did.
“If you’re pulling all of this material off of the web into a dataset, then you’re also scraping a lot of really undesirable content because it is a reflection of human activity online and humans aren’t always doing great things, right?” said Eryk Salvaggio, an AI researcher who has written about this issue for Tech Policy Press.
“You throw out a net into the ocean, you catch fish, but you’re also going to pick up garbage,” Salvaggio added. “In this case, that garbage is highly traumatic and dangerous material that humans have posted online.”
The long-term impacts of a dataset like this could be wide-ranging. A primer on generative AI published by the Government Accountability Office in May of last year pointed out that AI image generators have now been built using the LAION dataset. In addition to concerns that datasets can include child sexual abuse material, they also introduce the risk that sexual abuse material can end up shaping the output of generative AI.
The risk for federal agencies
The LAION-5B incident raises questions for federal agencies looking to support AI research. NSF, which has a series of AI research initiatives and supports a network of National AI Research Institutes, does not track what datasets are used on the specific projects pursued by its principal investigators. The Department of Energy, which oversees the national lab system, declined to provide a comment.
The Intelligence Advanced Research Projects Activity “is aware that tainted datasets pose serious risks, which is why the data used in IARPA’s programs undergo review by Institutional Review Boards to ensure that it meets quality, legal, and privacy standards,” a spokesperson for the organization, which is housed in the Office of the Director of National Intelligence, said in an email to FedScoop.
“ODNI’s Civil Liberties, Privacy, and Transparency Team also routinely reviews and monitors research programs to confirm that they meet these benchmarks. If problematic data were to be identified, IARPA would take immediate steps to remediate it,” the spokesperson added.
The LAION-5B database analyzed by Stanford was sponsored by Hugging Face, a French-American AI company, and Stability AI, which made the image-generator Stable Diffusion, according to an appendix released with the paper. LAION told FedScoop that it did not have a relationship with the NSF.
Still, there is a real risk that federally funded researchers could or will use tools similar to LAION-5B. Many research institutions have cited LAION-5B in their work, according to Ritwik Gupta, an AI researcher based at the University of California at Berkeley who shared a database of such institutions with FedScoop.
Notably, a researcher affiliated with the NSF-funded Institute for Foundations of Machine Learning (IFML), an AI Research Institute based at the University of Texas, is listed as an author of the paper announcing the creation of the LAION-5B dataset. A blog post from the IFML about researcher collaborations details work related to LAION and LAION-5B, and the LAION 5-B paper is also listed in the NSF’s public access repository. Neither the University of Texas at Austin, the researcher, or IFML responded to requests for comment.
Inadvertently including CSAM in foundation models — the Biden administration executive order highlighted the building of foundation models as a priority — can be a risk if proper precautions are not taken, according to David Thiel, the chief technologist at the Stanford Internet Observatory and author of the analysis of LAION-5B.
“Images need to be sourced appropriately and scanned for known instances of CSAM and [Non-Consensual Intimate Images], as well as using models to detect and categorize potentially explicit material and/or imagery of children,” Thiel said.
Gupta, the Berkeley researcher, expressed a similar sentiment, saying that “the NSF, and all other government agencies which are funding AI research must thoroughly mandate and enforce the implementation of the [National Institute of Standards and Technology’s] Risk Management Framework into their grants. The NIST RMF is vague when it comes to screening for CSAM, but it largely covers how any dataset curated for AI should be properly acquired.”
NIST did not respond to a request for comment by the time of publication.