Week from 08/02 to 08/08

August 28, 2020

My mentor points me out that maybe the reason for our models' better performance is that we are evaluating fewer languages.

Approach	#Languages
LangTag(S)	10
LangTag(C)	12
langdetect	55
Tika	18
openNLP	103

Since our model only supports 12 languages and openNLP supports for 103 languages, that can be the reason for our models' better performance.

How did you evaluate the other tools by restricting the languages?

My mentor suggested that by getting the vector of probabilities that used to predict the language by this model, we can obtain the desired result. I was able to get the probability vectors of openNLP model and langdetect . But for Tika it was not possible. I reevaluated the openNLP and langdetect tools on QALD datasets, entity-labels and abstracts. And updated the results tables in the repo.

Observations

It was observed that the openNLP and langdetect tools performed much better when the languages are restricted.

Search This Blog

My GSoC blog