Wikipedia has become a necessary touchpoint for online research, whether you like it or not – it’s the fourth most visited online destination in the world, behind Facebook, YouTube and Google. But where Google has Sundar Pichai, Facebook is Mark Zuckerberg, and other big tech platforms have familiar faces, Wikipedia’s brand is not personified by any particular individual. Rather, Wikipedia’s ubiquity rests on the shoulders of anonymous community contributors like Neechalkaran, who are carving out a Tamil wiki niche like no other.
Originally from Madurai, Tamil Nadu, Neechalkaran started reading and writing in Tamil after starting work for Infosys which involved moving to Pune, Maharashtra. “As I was away from my native Tamil community, I kept in touch with Tamil by reading Tamil blogs online at the time, and eventually started my own Tamil blog on Blogspot,” Neechalkaran explained, who even started writing Tamil poetry.
For the love of Tamil
Soon, he discovered the Tamil Wikipedia page and started contributing articles to it in his spare time, signing up there in 2010. A year or two later, he discovered a problem. “People who read my articles on Tamil Wikipedia and even on my own blog started saying that my articles contain a lot of grammatical errors,” Neechalkaran admitted wholeheartedly. But far from discouraged or discouraged, Neechal took it as a challenge to be solved through technology.
It was in 2015-2016, when the government of Tamil Nadu was seeking to digitize and publish data at the village level, where the Tamil Wikipedia got involved in the project, Neechalkaran said. “The Tamil Wikipedia community had several discussions with government officials on how best to publish over 13,000 raw data points related to the panchayat level – this included assembly names, population, demographics, literacy rate, the number of buildings in an area, etc.,” Nééchal said. “All the data was in Excel format, without any sentences. So internally, within the Tamil Wikipedia community, we discussed and trained a model with proper grammar, automating NLP-related tasks for singular and plural words (for example), and we arrived at a perfect article model. We only uploaded 12,000 data points in papers, discarding the other raw data points due to discrepancies in the data,” he explained.
Now that they had a proof of concept of automating government records on the Tamil Wikipedia, Neechalkaran and the community further automated and published the Hindu Religious and Charitable Endowments Council’s temple records containing 24,000 items related to names, addresses , places and deities of Hindu temples. name, and more, for Tamil Nadu State in 2016-17.
Build the bot
Raw data in non-Unicode format in Google Sheets
Neechalkaran pointed out that before releasing the data received from the Tamil Nadu government, it should first be sanitized to fit Wikipedia compatible data standards. He had to create a tool to convert the incoming data to Unicode and save it to his Google Drive. He also shared the links to his Google Drive with the entire Tamil Wikipedia community, which Neechal said was made up of Tamil expats from Sri Lanka, Malaysia and other places who also contributed their time and of their efforts.
Final Wikipedia article created by Neechal’s bot
“The bot I developed helped convert raw Google Sheets data into a readable format, and linking it to the Wikipedia API allowed me to create Tamil Wikipedia articles,” Neechal said, explaining how he had to ask Wikipedia for access to the API and the bot. He also pointed out how he also used the Tamil NLP library which he had built a few years earlier.
Then something amazing happened, where the power of community began to flourish. Thousands of visitors started organically reaching these Tamil Wikipedia pages created by Neechal’s automated crawler.
“We had just created basic and simple articles in Tamil on Wikipedia, but people found enough value in them to come and modify them and add more data. For example, most of our articles did not contain photographs, but those that had contextual photographs or images related to the page in question, they uploaded it freely to improve the article. They provided the ultimate validation of the basic idea of creating these articles on Wikipedia, precisely so that anyone from Tamil Nadu can always contribute to these Wiki pages and improve the data of their own village panchayat or constituency Neechal remarked.
Neechalkaran congratulated in Canada, 2015
Neechalkaran’s “Neechal Bot”, which can undertake page building, editing and statistics gathering activities, has created over 22,500 posts with community consensus – it would have taken 22 humans over three years to create as many Wikipedia pages manually as a conservative estimate. The bot can also perform many housekeeping activities on Tamil, Bhojpuri, Hindi Wikipedia and other Wikimedia projects automatically. It collects periodic statistics in these languages and can update them on the corresponding Wikipedia pages.
Another brilliant tool that Neechal has created is VaaniNLP, a one-of-a-kind open-source Tamil python library. This tool is used by a startup Thiral, an AI-based Tamil news aggregator. His latest work is a Tamil chatbot for Wikidata, which has executed over 72,000 edits so far in three languages, Tamil, Hindi and Bhojpuri.
For more technology news, product reviewssci-tech features and updates, keep reading Digit.in.