Key Initiatives Taken to Overcome Wikipedia Challenges


Wikipedia celebrated its 20th anniversary last year and is now the seventh most popular website in the world. The seventh most popular website in the world hosts more than 55 million articles and receives 15 billion visits every month. Unlike most platforms that struggle to be inclusive in language, Wikipedia is available in 309 languages.

That said, the platform has several challenges of inaccuracies, lack of inclusiveness, and gender gap, among others. The Wikimedia Foundation is a research foundation working to overcome these issues. There are other third-party companies that offer innovative solutions. Here we list some of these tools.

Meta women biographies

Only about 20% of Wikipedia’s total biographies are about women, and it’s even smaller for women from intersectional groups. Facebook’s latest initiative relies on AI to address this imbalance. The system can research and write first drafts of Wikipedia-style biographical entries. The open-source tool is an end-to-end AI model that automatically creates high-quality biographical articles about important real-world public figures. The model searches websites for relevant information to write an entry about a personality and quotes. The FAIR team is also releasing a new dataset that evaluates the model’s performance on 1,527 biographies of women from marginalized groups.

The model uses a retrieval-augmented generation architecture based on large-scale pre-training to identify only the relevant information it receives during topic introduction. Then, the generation module creates the text, followed by the citation module, which builds the bibliography referring to the sources. Finally, the biography is created section by section, with the process repeating for each section, using a caching mechanism for better context.

Wikimedia Image/Caption Matching Contest on Kaggle

Wikimedia is a global network movement whose mission is to provide free educational content. However, as Wikimedia explains, “Wikipedia articles lack images and Wikipedia images lack captions.” To alleviate this problem, the Wikipedia Image/Caption Matching Contest was introduced on Kaggle earlier this year. It aims to develop systems capable of automatically associating images with their corresponding image captions and article titles.

Google WIT Dataset

In September 2021, Google released a WIT (Wikipedia-Based Image Text) dataset in partnership with the Wikimedia Foundation to fill the Wikipedia knowledge gap. WIT is a large multimodal dataset created by extracting various text selections associated with an image from image links and Wikimedia articles. The entire development process was conducted through rigorous filtering to maintain high quality image-text sets. It aims to create a large and high-quality multilingual dataset with varied content. It increases language coverage and large size over previous datasets, resulting in a curated set of 37.5 million feature-rich image-text examples and 11.5 million images unique in 108 languages. Google believes this dataset will help researchers build better multimodal multilingual models and identify better representation techniques, thereby improving machine learning models in real-world tasks compared to visio-linguistic data.

The hidden voices of IIT Madras

In March 2022, IIM’s research wing, the Robert Bosch Center for Data Science and Artificial Intelligence (RBCDSAI), collaborated with business consulting firm SuperBloom Studios to launch “Hidden Voices”, an initiative to reduce the gender data gap in digital sources. Its primary source is Wikipedia. The goal is to automatically generate biographies of several notable women to close the gender gap in data, including gender and interest from editors and contributions from external sources. Through information-theoretical approaches, ML-assisted self-identification, and validation of external sources and textual analysis methods, Hidden Voices aims to automatically generate the first Wikipedia-style biography draft. Then he will use this approach to generate Wikipedia articles for notable women in STEMM (Science, Technology, Engineering, Medicine and Management).


Another Wikimedia initiative, Wiki-reliability is a large-scale dataset for the reliability of content on Wikipedia. Despite the reach of the platform, it is still run by a community of volunteer editors who keep its content real. Wiki-Reliability is the first dataset of English-language Wikipedia articles annotated with a wide range of content reliability issues. It was built on Wikipedia templates, tags used by expert Wikipedia editors to indicate content issues. It offers an approach to label nearly a million sample Wikipedia article reviews as positive or negative. Then, they provide information about the possible downstream tasks enabled by this data. It can be used to train large-scale models for content reliability prediction.


About Author

Comments are closed.