Meta AIthe research and advancement team of recently developed a system based on a neural network, called SIDEcapable of scanning hundreds of thousands of Wikipedia citations at a time and check if they really support the corresponding content.
Wikipedia is a free multilingual online encyclopedia written and maintained by volunteers through open collaboration and a wiki-based editing system. Wikipedia has some 6.5 million articles. Wikipedia is crowdsourced, so it generally requires facts to be substantiated; quotes, controversial statements and controversial documents about living people must include a quote. Volunteers double-check Wikipedia’s footnotes, but as the site continues to grow, it’s hard to keep pace with the more than 17,000 new articles added each month. Readers often question the accuracy of the Wikipedia entries they read. Human editors need the help of technology to identify gibberish or statements that lack citations, but understand that determining whether or not a source confirms a claim is a complex task for AI, as it requires understanding depth to perform an accurate analysis.
To this end, the Meta AI research team created a new dataset of 134 million public web pages (divided into 906 million passes of 100 tokens each), an order of magnitude more data than the sources of knowledge considered in current NLP research and significantly more complex than ever used for this type of research. The second largest dataset in terms of passages/documents is the Augmented dialogue generator on the Internetwhich extracts data from 250 million passes and 109 million documents.
This new dataset is the knowledge source for the neural network model that finds quotes that seem irrelevant and suggests a more applicable source event, pointing to the specific passage that supports the assertion. Natural language understanding (NLU) are used to perform the tasks that allow the system to evaluate a citation. In NLU, a model translates human sentences (or words, sentences, or paragraphs) into complex mathematical representations. The tool is designed to compare these representations to determine whether one assertion supports or contradicts another.
SIDE’s decision flow, from a Wikipedia claim to a new citation suggestion, works as follows:
SIDE workflow. Paper : Improving Wikipedia’s verifiability with AI
The request is sent to the Sphere retrieval engine, which produces a list of potential candidate documents from the Sphere corpus. The sparse recovery subsystem uses a seq2seq pattern to translate quote context into query text, then matches the resulting query (a sparse bag-of-words vector) to a BM25 Sphere index. The seq2seq model is trained using data from Wikipedia itself: target queries are defined to be web page titles of existing Wikipedia citations. The dense retrieval subsystem is a neural network that learns from Wikipedia data to encode citation context into a dense query vector. This vector is then matched against the vector encodings of all passes in Sphere and the closest ones are returned.
The verification engine then ranks the candidate documents and the original citation in reference to the claim. A neural network takes the claim and an input document, and predicts how well it supports the claim. For efficiency, it operates on a per-pass level and calculates a document’s verification score as the maximum over its per-pass scores. Verification scores are calculated by a BERT transformer that uses the concatenated claim and pass as input.
In other words, the model creates and compares mathematical representations of the meaning of whole utterances rather than individual words. Because web pages can contain long stretches of text, models evaluate content in blocks and consider only the most relevant passage when deciding whether or not to recommend a URL.
Clues pass potential sources to an evidence ranking model, which compares the new text to the original citation. Using a fine-grained understanding of the language, the model ranks the cited source and retrieved alternatives based on the likelihood that they support the claim. If the original citation is not ranked above the candidate documents, a new citation from the retrieved candidates is suggested.
Sphere has been tested on Intensive language skills Benchmark tasks, and surpassed the state of the art on two.
A computer system with human-level understanding of language is yet to be built, but projects like this, which teach algorithms to understand dense material with an ever-increasing degree of sophistication, are helping AI to make sense of the real world. The Meta AI Research and Advancement Team says the goal of this work is to create a platform to help Wikipedia editors systematically spot citation issues and quickly correct the citation or correct the large-scale matching article content. East Coast open source and can be tested here.