Meta has opened up a machine learning resource that could one day supplant Wikipedia as the world’s largest publicly available knowledge-checking database.
Double Sphere, it can be used to perform knowledge-intensive natural language processing, or KI-NLP, we are told. Concretely, this means that it can be used to answer complicated questions using natural language and find sources of assertions.
An example given of its use is to ask Sphere, “Who is Joëlle Sambi Nzeba?” Wikipedia has no entry for her, but Sphere said she was “born in Belgium and raised partly in Kinshasa (Congo). She currently lives in Brussels. She is a writer and slammer, alongside her activism in a feminist movement,” and links to a website where he got this information about his work.
Wikipedia pretty much served as a corpus of records, Meta eggheads wrote in a paper discussing Sphere’s design, saying the volunteer-run uber-wiki is “accurate, well-structured, and small enough to be used easily in test environments.”
The technological and social impact of the powerful emerging “base models” of AI
Seeking to build something bigger and better than Wikipedia, Meta gathered content from all over the web – sans wikipedia.org – to form a “universal, unorganized and unstructured source of knowledge for several KI-NLP tasks at a time”. The result is Sphere, which is more or less a mountain of processed data that can be queried using a bunch of machine learning tools.
The team adds that Sphere “can match and surpass Wikipedia-based baselines” on some tasks using the KILT AI reference. That is, Sphere performs better than AI systems built on Wikipedia content.
The main goal of Sphere was to see what impact replacing Wikipedia, as a source, had on the performance of knowledge-intensive systems, and although the team reported that Sphere had some issues, its performance indicates that at the very least it can add value to KI-NLP tasks beyond what Wikipedia corpora can offer.
The researchers behind Sphere say their work marks “the first time a general-purpose search index has improved language models on common-sense tasks.”
Sphere isn’t the only AI platform Meta has released on GitHub: last week it released NLLB-200, the first translation AI to cross the 200-language threshold, or so it does. the Facebook parent claims. Like Sphere, NLLB-200 has been used on Wikipedia; the first system to automatically check citations in edited articles, and the second to improve the translation of pages into less commonly spoken languages.
When moving to a web corpus, we no longer have the certainty that any document is good, truthful or unique
Sphere goes beyond similar web corpora in terms of scale, consisting of 906 million passes and 134 million documents. The second largest in terms of passages/documents is the Augmented Internet Dialog generator, which extracts data from 250 million passes and 109 million documents.
But the internet contains no quality or accuracy checks, which the researchers say is a key problem with actually deploying this thing. “Using Wikipedia as a source of knowledge allows researchers to assume the high quality of documents in the corpus. When transitioning to a web corpus, we no longer have confidence that a document is good, truthful or unique “, wrote the researchers.
The creators of Sphere believe that iterative efforts should focus on assessing the quality of retrieved data, detecting false claims and contradictions, determining the priority of reliable sources and when to decide not to respond to a question due to a lack of information. You know, what actually makes it useful.
If he succeeds in turning Sphere into a white-box AI with reliable and trustworthy information, Meta said, Sphere “could be the next big break in NLP.” ®