Week from 05/23 to 05/29
Start of the week This week I was assigned to develop a Language Detector. Why? As the approach for extending the NSpM framework to support multiple languages; first, we must identify the source language. For that, we must develop a language detector, which can detect languages of small sentences; since the questions are not long texts as documents. How? Creating the dataset The datasets were created using the QLAD-7 datasets which are attained using IRbench here . Since the dataset contained additional information, I extracted the sentences and the languages and created a dataset. Testing existing tools First, I tried out the Apache OpenNLP language detector tool, which claimed it is able to detect 103 languages. But for that, they have required to have at least 2 or more sentences from the same language. Therefore I decided to evaluate QLAD-7 dataset for the languages English, Spanish, German, Italian, French, Dutch, Romanian. The accuracy score was 0...