Week from 05/23 to 05/29

June 02, 2020

Start of the week

This week I was assigned to develop a Language Detector.

Why?

As the approach for extending the NSpM framework to support multiple languages; first, we must identify the source language. For that, we must develop a language detector, which can detect languages of small sentences; since the questions are not long texts as documents.

How?

Creating the dataset

The datasets were created using the QLAD-7 datasets which are attained using IRbench here. Since the dataset contained additional information, I extracted the sentences and the languages and created a dataset.

Testing existing tools

First, I tried out the Apache OpenNLP language detector tool, which claimed it is able to detect 103 languages. But for that, they have required to have at least 2 or more sentences from the same language. Therefore I decided to evaluate QLAD-7 dataset for the languages English, Spanish, German, Italian, French, Dutch, Romanian. The accuracy score was 0.8593. As they mentioned I assume that the low score is a result of shorter text in the QLAD-7 dataset.

Building a Language detector on my own

Since It was a classification problem, I decided to build a simple Naive Bayes classifier. Trained the dataset on QLAD-7 train dataset and tested it in QLAD-8 test dataset. The model accuracy for the test dataset was 0.9883. The progress is in this github repo here. Even the training dataset consisted of sentences of English, Spanish, German, Italian, French, Dutch, Romanian languages; the test dataset only contained 4 languages. The confusion matrix created using the model predictions.

End of the week

At the end of the week, I discussed my results with my mentors. They advise to research more about language detection and evaluate more existing products.

Search This Blog

My GSoC blog