In addition to its extensive collection of books, the Institutional Data Initiative is working with the Boston Public Library to scan millions of articles from various newspapers that are currently in the public domain, and is poised for similar collaborations in the future. states that there is. Exactly how the books dataset will be released has not yet been determined. The Institutional Data Initiative has asked Google to cooperate with the public, but details are still being worked out. Kent Walker, Google’s global president, said in a statement that the company is “proud to support” the project.
No matter how IDI’s dataset is released, it will be joined by a number of similar companies that promise to give businesses access to substantive, high-quality AI training materials without the risk of running into copyright issues. You’ll be joining a project, startup, or initiative. Companies such as Calliope Networks and ProRata have emerged as companies that issue licenses and manage compensation schemes aimed at rewarding creators and rights holders for providing AI training data.
There are other new public domain projects as well. French AI startup Pleias launched its own public domain dataset, Common Corpus, last spring, according to project coordinator Pierre Karl Lenglet. This dataset contains an estimated 3-4 million books and periodicals. With support from the French Ministry of Culture, the Common Corpus has been downloaded more than 60,000 times this month alone on the open source AI platform Hugging Face. Last week, Pleias announced it was releasing the first set of large-scale language models trained on this dataset. Lenglet told WIRED that this will be the first model “trained solely on open data and compliant with (EU) AI laws.”
Efforts are also underway to create similar image datasets. This summer, AI startup Spawning released its own service called Source.Plus. It includes public domain images from Wikimedia Commons and various museums and archives. Some important cultural institutions, like New York’s Metropolitan Museum of Art, have long made their archives accessible to the public as independent projects.
Ed Newton-Rex, a former executive at Stability AI who now runs a nonprofit organization that certifies ethically trained AI tools, says the growth of these datasets means that high-performance, high-quality It says it shows there is no need to steal copyrighted material to build AI models. OpenAI previously told British lawmakers that it is “impossible” to create products like ChatGPT without using copyrighted material. “Such large public domain datasets are a ‘defense of necessity’ that some AI companies use to justify scraping copyrighted works to train their models. further destroy,” Newton-Rex said.
But I still have doubts about whether IDI and projects like it will actually change the status quo of AI training. “These datasets will only have a positive impact if they are used to replace scraped copyrighted works, perhaps in conjunction with the licensing of other data. “Simply being added to the mix as part of a dataset that also includes the unauthorized life work of the creators inside will provide overwhelming benefits for AI companies,” he says.
Updated 12/24/11:18 a.m. ET: This story has been updated with comment from Google.