Week from 07/05 to 07/11

August 28, 2020

For evaluating the performance of the language detection models, LangTag(s), LangTag(C), langdetect, openNLP, and Tika we designed different text length and domain benchmarks, (1) Short,(2)Moderate and (3) Long. Until now we evaluated all the models in Short and Moderate text benchmarks. This week I was tasked to do the evaluation on Long Texts.

How the datasets are created?

For long text, we used dbo:abstracts of the top 10,000 resources returned by the DBpedia SPARQL. The below SPARQL query was used for extracting the abstracts.

select distinct ?abstract where {?s dbo:abstract ?abstract. filter(lang(?abstract) = 'en')} limit 10000

I was able to extract 10,000 text resources except for Russian. From the gathered resources I created datasets with text and language columns.

How the evaluation is done?

Every language detection model is evaluated on each dataset. Here are the results,

Approach	EN	DE	RU	IT	ES	FR	AVG
#Resources	10,000	10,000	285	10,000	10,000	10,000	Accuracy	Runtime(s)
LangTag(S)	0.96	0.99	-	0.99	0.99	0.99	0.98	0.00267
LangTag(C)	0.96	0.99	0.86	0.99	0.99	0.99	0.96	0.00287
langdetect	0.95	0.99	0.95	0.99	0.99	0.99	0.97	0.01657
Tika	0.95	0.99	0.95	0.99	0.98	0.99	0.97	0.43918
openNLP	0.79	0.81	0.13	0.76	0.78	0.71	0.66	0.01427

Observations

We can observe that every language tool has performed well in accuracy. It is clear that LangTag models runtimes are lower than other tools.

Search This Blog

My GSoC blog