The Tamil Wikipedia superhero himself added 22,500 articles!

0

Wikipedia has become a necessary touchpoint for online research, whether you like it or not – it’s the fourth most visited online destination in the world, behind Facebook, YouTube and Google. But where Google has Sundar Pichai, Facebook is Mark Zuckerberg, and other big tech platforms have familiar faces, Wikipedia’s brand is not personified by any particular individual. Rather, Wikipedia’s ubiquity rests on the shoulders of anonymous community contributors like Neechalkaran, who are carving out a Tamil wiki niche like no other.

Originally from Madurai, Tamil Nadu, Neechalkaran started reading and writing in Tamil after starting work for Infosys which involved moving to Pune, Maharashtra. “As I was away from my native Tamil community, I kept in touch with Tamil by reading Tamil blogs online at the time, and eventually started my own Tamil blog on Blogspot,” Neechalkaran explained, who even started writing Tamil poetry.

For the love of Tamil

Soon, he discovered the Tamil Wikipedia page and started contributing articles to it in his spare time, signing up there in 2010. A year or two later, he discovered a problem. “People who read my articles on Tamil Wikipedia and even on my own blog started saying that my articles contain a lot of grammatical errors,” Neechalkaran admitted wholeheartedly. But far from discouraged or discouraged, Neechal took it as a challenge to be solved through technology.

“It was really the turning point for me, when I spent time learning Tamil grammar and built Naavi, a Tamil spell checker specifically related to the Sandhi rule,” Neechal recalls. Tamil readers all over the world, including a few teachers, liked Neechalkaran’s Tamil spell checker, he claimed. This only boosted his self-confidence, making him learn not only Javascript but also Python and C#, and focusing on solving some of the problems Tamil Wikipedia volunteers had. At the time, Tamil Wikipedia contributors were able to write articles, but they found it difficult to perform household tasks, like editing certain values ​​across multiple pages or bulk editing, many small tasks which needed to be automated, according to Neechalkaran.

It was in 2015-2016, when the government of Tamil Nadu was seeking to digitize and publish data at the village level, where the Tamil Wikipedia got involved in the project, Neechalkaran said. “The Tamil Wikipedia community had several discussions with government officials on how best to publish over 13,000 raw data points related to the panchayat level – this included assembly names, population, demographics, literacy rate, the number of buildings in an area, etc.,” Nééchal said. “All the data was in Excel format, without any sentences. So internally, within the Tamil Wikipedia community, we discussed and trained a model with proper grammar, automating NLP-related tasks for singular and plural words (for example), and we arrived at a perfect article model. We only uploaded 12,000 data points in papers, discarding the other raw data points due to discrepancies in the data,” he explained.

Now that they had a proof of concept of automating government records on the Tamil Wikipedia, Neechalkaran and the community further automated and published the Hindu Religious and Charitable Endowments Council’s temple records containing 24,000 items related to names, addresses , places and deities of Hindu temples. name, and more, for Tamil Nadu State in 2016-17.

Build the bot

Raw data in non-Unicode format in Google Sheets

Neechalkaran pointed out that before releasing the data received from the Tamil Nadu government, it should first be sanitized to fit Wikipedia compatible data standards. He had to create a tool to convert the incoming data to Unicode and save it to his Google Drive. He also shared the links to his Google Drive with the entire Tamil Wikipedia community, which Neechal said was made up of Tamil expats from Sri Lanka, Malaysia and other places who also contributed their time and of their efforts.

The next step was to integrate everything into the Neechal Bot, which was created using Google Apps Script. “I used Google Apps Script because I was more comfortable with JavaScript at the time, and Google Apps Script is nothing but Google’s version of JavaScript, and the fact that it’s ‘seamlessly integrates with Google Sheets data has also helped,’ he explained, as much of the automation is built into Google Apps Script, allowing developers to start a project without worrying about web hosts or server infrastructure – it’s all free to get started.

wiki

Final Wikipedia article created by Neechal’s bot

“The bot I developed helped convert raw Google Sheets data into a readable format, and linking it to the Wikipedia API allowed me to create Tamil Wikipedia articles,” Neechal said, explaining how he had to ask Wikipedia for access to the API and the bot. He also pointed out how he also used the Tamil NLP library which he had built a few years earlier.

Then something amazing happened, where the power of community began to flourish. Thousands of visitors started organically reaching these Tamil Wikipedia pages created by Neechal’s automated crawler.

“We had just created basic and simple articles in Tamil on Wikipedia, but people found enough value in them to come and modify them and add more data. For example, most of our articles did not contain photographs, but those that had contextual photographs or images related to the page in question, they uploaded it freely to improve the article. They provided the ultimate validation of the basic idea of ​​creating these articles on Wikipedia, precisely so that anyone from Tamil Nadu can always contribute to these Wiki pages and improve the data of their own village panchayat or constituency Neechal remarked.

Wikipedia

Neechalkaran congratulated in Canada, 2015

Neechalkaran’s “Neechal Bot”, which can undertake page building, editing and statistics gathering activities, has created over 22,500 posts with community consensus – it would have taken 22 humans over three years to create as many Wikipedia pages manually as a conservative estimate. The bot can also perform many housekeeping activities on Tamil, Bhojpuri, Hindi Wikipedia and other Wikimedia projects automatically. It collects periodic statistics in these languages ​​and can update them on the corresponding Wikipedia pages.

Another brilliant tool that Neechal has created is VaaniNLP, a one-of-a-kind open-source Tamil python library. This tool is used by a startup Thiral, an AI-based Tamil news aggregator. His latest work is a Tamil chatbot for Wikidata, which has executed over 72,000 edits so far in three languages, Tamil, Hindi and Bhojpuri.

For more technology news, product reviewssci-tech features and updates, keep reading Digit.in.

Share.

About Author

Comments are closed.