Week from 05/23 to 05/29

Start of the week

This week I was assigned to develop a Language Detector. 

Why?

As the approach for extending the NSpM framework to support multiple languages; first, we must identify the source language. For that, we must develop a language detector, which can detect languages of small sentences; since the questions are not long texts as documents.

How?

Creating the dataset

The datasets were created using the QLAD-7 datasets which are attained using IRbench here. Since the dataset contained additional information, I extracted the sentences and the languages and created a dataset. 

Testing existing tools

First, I tried out the Apache OpenNLP language detector tool, which claimed it is able to detect 103 languages. But for that, they have required to have at least 2 or more sentences from the same language. Therefore I decided to evaluate QLAD-7 dataset for the languages English, Spanish, German, Italian, French, Dutch, Romanian. The accuracy score was 0.8593. As they mentioned I assume that the low score is a result of shorter text in the QLAD-7 dataset.

Building a Language detector on my own

Since It was a classification problem, I decided to build a simple Naive Bayes classifier. Trained the dataset on QLAD-7 train dataset and tested it in QLAD-8 test dataset. The model accuracy for the test dataset was 0.9883. The progress is in this github repo here. Even the training dataset consisted of sentences of  English, Spanish, German, Italian, French, Dutch, Romanian languages; the test dataset only contained 4 languages. The confusion matrix created using the model predictions.



End of the week

At the end of the week, I discussed my results with my mentors. They advise to research more about language detection and evaluate more existing products.

Comments

Popular posts from this blog

DBpedia Neural Multilingual QA - GSoC project summery

Week from 07/12 to 07/18

Machine Translation Task