A paper entitled ‘A Survey of Konkani NLP Resources’ has been accepted for publication in Computer Science Review, a journal published by Elsevier. NT KURIOCITY gets more details

RAMANDEEP KAUR | NT KURIOCITY

For the first time, a comprehensive survey in Natural Language Processing (NLP) research in Konkani language has been published in Computer Science Review journal of Elsevier (impact factor 7.7). This journal is “aimed at a general computer science audience seeking a full and expert overview of the latest in computer science research”.

The paper is written by associate professor, department of computer science, DCT’s Dhempe College, Annie Rajan. Her research supervisor, associate professor, University of Mumbai, Ambuja Salgaonkar, and Ramprasad Joshi of BITS Pilani, Goa campus, are the co-authors.

NLP is a subfield of linguistics, computer science, and artificial intelligence. It is concerned with interaction between computers and human language. In fact, NLP technologies are the key to connect billions of people around the globe to access and avail the facilities, provisions, learning resources, knowledge, and benefits of the internet from education to communication and many other applications in their language

of convenience.

The commonly used applications are: part of speech tagging (PoS), named entity recognition or NER (this locates and classifies named entities mentioned in unstructured text into pre-defined categories such as person names, organisations, locations), sentiment analysis (identifying and categorising opinions expressed in a piece of text), morphological analysis (the study of words, their structure, and identification of word stems from full word forms, eg word ‘going’ stem word ‘go’], translation, segmentation, etc.

Building these data sets involves the active participation of both linguists and computer scientists: the linguists annotate the words of the language and the computer scientists represent them in a form that the computer can decode and learn.

However, while pursuing her Mphil (2014) at Goa University from the Department of Computer Science, Rajan realised that the Konkani language did not have any of the basic linguistic tools in a user-friendly format that could be used by a researcher or general language speaker. And thus, she began her research work into this, continuing it further for her PhD programme (2017) at the University of Mumbai.

“Morphological analyser was done by Shilpa Dessai from Pilar College but there was no tool wherein a lay person or any other researcher can go and put in the data and get an output. So, in my research work I have developed these tools ( spelling checker, morphological analyser, stemmer) which are important for any kind of language development,” she says, adding that she is grateful to her MPhil guide, Jyoti Pawar (Associate Professor at Goa University) and PhD guide, Ambuja Salgaonkar.

Rajan says that the research is aimed at enhancing resources and tools for the automatic processing of linguistic constructs in Konkani, the official language of Goa. “Automatic natural language processing (NLP) of Indian languages of small communities has been limited by the scarcity of language-specific digital linguistic resources. Creating and enriching the existing corpora for Konkani and providing annotations are fundamental requirements. We trained general-purpose models for sentiment analysis, named entity recognition, morphological analysis, and part of speech tagging for Konkani, and demonstrated their application to automatic sentence translation from Konkani to Hindi,” she explains.

At the moment the public can access to these NLP tools (free of cost) for Konkani on her website www.annierajan.com. And she is ready to give these tools to any organisation. In fact, she will get in touch with Central University for Indian Languages at Mysore or Technology Development for Indian Languages (TDIL).

Coming to the benefits of this research she says that while the volume and scope of digital transactions in regional languages are rapidly increasing all over the world, low-resource languages like Konkani are deprived of automatic NLP. She adds: “Moreover, the number of native Konkani speakers is getting reduced. Creating a comprehensive set of well-performing NLP tools for Konkani written in Devanagari, its official script, is a necessity for enhancing the presence and continuance of the language in this digital era. The language would get diminished otherwise.”

And the major languages like Hindi, Tamil, Malayalam, Marathi, etc, she says has made good progress in NLP related applications. She says: “In comparison with this, low resource language like Konkani is in the infant stage and just making its presence on the map of NLP. Some works are done in Konkani but not in a consistent manner, this is because each researcher did a part of their work and did not proceed much deeper into the work they did. This is usually the problem with low resource language since this language does not have a sufficient annotated corpus for training the machine to learn.”

Their work, she says, will serve as a launching pad for researchers to develop more and advanced resources and tools for this language and advance automatic NLP of Konkani. Rajan further adds their contribution to the goal that Konkani speakers should benefit by accessing knowledge in their mother tongue is in conformity with what the HRD Ministry has envisioned while articulating the National Mission of Education through ICT (NMEICT).

However, the challenges with regards to NLP of Konkani, she says, is unique due to the multiplicity of scripts and dialects, and the resulting lack of standardisation of both language and script. “More generally, the Konkani language, its evolution, and history have been affected by a variety of deep historical, regional, and cultural factors. Compared to the other languages with a similar number of native speakers, automation efforts in Konkani are delayed and slower to develop,” she says.

And hence with the purpose to show the various dialects in Konkani she is also doing a documentary on ‘Documenting various dialects spoken in Konkani Language’. Presently doing PhD on ‘Machine Translation Toolkit for Konkani’ from the Department of Computer Science from the University of Mumbai she says that during her PhD she came across text material on various dialects of this language, but there was no material in audio or video format. Thus, she started recording the various dialects for her understanding, which her PhD guide appreciated. “This recording has helped me to meet Konkani speakers from Goa, Karwar, Mangaluru, Kerala and they have given their opinion on the language. I am sure this will be valuable material for researchers to work on. All this material will be made available on YouTube for the public.”

Besides this, she has also created a YouTube channel for users to know the terminologies of words related to computer science. She says: “For interviews of government or private sector, there is a compulsion to know the terminologies in Konkani language, this can help students from the computer

science field.”