Week from 06/21 to 06/27

August 28, 2020

Since evaluating on QALD datasets were not sufficient. I was asked to evaluate the language detection models on entity-labels. I used the DBpedia Knowledge Base for extracting entity-labels.

How the datasets are created?

The entity-labels are extracted using the below SPARQL query. The results are downloaded in CSV format.

select distinct ?label where {?s rdfs:label ?label. filter(lang(?label) = 'en')} LIMIT 10000

It is noted that by using this query it has returned entity-labels that consist only numeric data. Therefore these labels are removed and created the dataset contains text and language columns. 10,000 entity labels are gathered in English, Deutsch, and Spanish languages. For other languages, the count was less than 1000.

How the evaluation is done?

Each language detection tool is evaluated on each entity-label dataset.

Approach	EN	DE	RU	IT	ES	FR	PT	AVG
#Resources	10,000	10,000	83	243	10,000	782	227	Accuracy	Runtime(s)
LangTag(S)	0.21	0.91	-	0.25	0.09	0.34	0.36	0.36	0.00162
LangTag(C)	0.26	0.88	0.12	0.35	0.15	0.36	0.44	0.34	0.00186
langdetect	0.40	0.43	0.57	0.63	0.31	0.59	0.43	0.48	0.01761
Tika	0.24	0.39	50	0.68	0.15	0.59	0.35	0.41	0.41428
openNLP	0.16	0.18	0.12	0.30	0.15	0.33	0.25	0.21	0.01125

Observations

We can observe that for short-text each and every model has performed poorly. The maximum average accuracy is obtained by the langdetect tool, which is 0.48.

Search This Blog

My GSoC blog