Week from 05/30 to 06/06

This week I started evaluating existing language detection tools like openNLP, Tika and langdetect. These tools are evaluated on qald test datasets.

Why? 

As the approach for extending the NSpM framework to support multiple languages; first, we must identify the source language. For that, we developed a language detector past week, which can detect languages of small sentences. To compare how our model and existing language models perform on qald datasets and choose a better tool for the use of our project.

How?

First I extracted the qald test and training datasets using IRbench and then created csv files using them. The qald training datasets consisted of these languages,

QALD 3 : English, Deutsch, French, Spanish, Italian and Dutch.

QALD 4 : English, Deutsch, French, Spanish,  Italian, Dutch and Romanian.

QALD 5 : English, Deutsch, French, Spanish, Italian, Dutch and Romanian.

QALD 6 : English, Deutsch, French, Spanish, Italian, Dutch, Romanian and Persian.

QALD 7: English, Deutsch, French, Spanish, Brazilian Portuguese, Italian, Dutch, Hindi, Romanian and Persian.

QALD 8:English, Deutsch, French, Spanish, Italian, Dutch, Hindi, Romanian and Persian
QALD 9: English, Deutsch, French, Spanish, Brazilian Portuguese, Italian, Dutch, Hindi, Romanian, Persian, Portuguese and Russian.

qald test datasets consisted of these languages, 

QALD 3 : English, Deutsch, French, Spanish, Italian and Dutch.

QALD 4 : English, Deutsch, French, Spanish,  Italian, Dutch and Romanian.

QALD 5 : English, Deutsch, French, Spanish, Italian, Dutch and Romanian.

QALD 6 : English, Deutsch, French, Spanish, Italian, Dutch, Romanian and Persian.

QALD 7: English, Deutsch, French, Spanish, Brazilian Portuguese, Italian, Dutch, Hindi, Romanian and Persian.

QALD 8 : English

QALD 9 : English, Deutsch, French, Spanish, Brazilian Portuguese, Italian, Dutch, Hindi, Romanian, Persian, Portuguese and Russian.


This is how the openNLP, Tika and langdetect performed on qald 7 test dataset,

LangTag(s)  : 0.98
openNLP      : 0.90
langdetect   : 0.95
Tika              : 0.91

LangTag(s) is the model we trained on the qald-7-training dataset.

Problems encountered?
  • Uncompleted test datasets.(QALD 8 dataset consisted only English)

End of the week

After discussing with the mentor and the co-mentors we decided on writing a research paper on LangTag.

Comments

Popular posts from this blog

DBpedia Neural Multilingual QA - GSoC project summery

Week from 07/12 to 07/18

Week from 07/19 to 07/25