My GSoC blog

Posts

Showing posts from July, 2020

Week from 06/07 to 06/13

July 13, 2020

This week my metor and I started writing our research paper "Evaluating Language Identification Methods over Linked Data". I was assigned the task of evaluating the language detection tools LangTag(s), LangTag(C), langdetect, openNLP and Tika. LangTag(s) is the model we trained on qald-7 training dataset and LangTag(C) is the model we trained on all the qald datasets from qald-3 to qald-9. How the evaluations are done? All the language detection tools are evaluated on keywords and full sentences of the qald test datasets. First I created .csv files of the qald datasets that consists only the question, keywords and language columns. Then the evaluation is done. We evaluated the tools not only just full qald datasets. Also the tools are evaluated on English, French and German languages seperately. We recorded the accuracy of the language detection tools. Observations We observed that for full sentences, our models perform better than the other three models. But all the models p...

Week from 05/30 to 06/06

July 01, 2020

This week I started evaluating existing language detection tools like openNLP, Tika and langdetect. These tools are evaluated on qald test datasets. Why? As the approach for extending the NSpM framework to support multiple languages; first, we must identify the source language. For that, we developed a language detector past week, which can detect languages of small sentences. To compare how our model and existing language models perform on qald datasets and choose a better tool for the use of our project. How? First I extracted the qald test and training datasets using IRbench and then created csv files using them. The qald training datasets consisted of these languages, QALD 3 : English, Deutsch, French, Spanish, Italian and Dutch. QALD 4 : English, Deutsch, French, Spanish, Italian, Dutch and Romanian. QALD 5 : English, Deutsch, French, Spanish, Italian, Dutch and Romanian. QALD 6 : English, Deutsch, French, Spanish, Italian, Dutch, Romanian and Persian. QALD 7: English, ...