Week from 07/05 to 07/11

For evaluating the performance of the language detection models, LangTag(s), LangTag(C), langdetect, openNLP, and Tika we designed different text length and domain benchmarks, (1) Short,(2)Moderate and (3) Long. Until now we evaluated all the models in Short and Moderate text benchmarks. This week I was tasked to do the evaluation on Long Texts.

How the datasets are created?

For long text, we used dbo:abstracts of the top 10,000 resources returned by the DBpedia SPARQL. The below SPARQL query was used for extracting the abstracts.

select distinct ?abstract where {?s dbo:abstract ?abstract. filter(lang(?abstract) = 'en')} limit 10000

I was able to extract 10,000 text resources except for Russian. From the gathered resources I created datasets with text and language columns.

How the evaluation is done?

Every language detection model is evaluated on each dataset. Here are the results,

Approach

EN

DE

RU

IT

ES

FR

AVG

#Resources

10,000

10,000

285

10,000

10,000

10,000

Accuracy

Runtime(s)

LangTag(S)

0.96

0.99

-

0.99

0.99

0.99

0.98

0.00267

LangTag(C)

0.96

0.99

0.86

0.99

0.99

0.99

0.96

0.00287

langdetect

0.95

0.99

0.95

0.99

0.99

0.99

0.97

0.01657

Tika

0.95

0.99

0.95

0.99

0.98

0.99

0.97

0.43918

openNLP

0.79

0.81

0.13

0.76

0.78

0.71

0.66

0.01427

Observations

We can observe that every language tool has performed well in accuracy. It is clear that LangTag models runtimes are lower than other tools.

Comments

Popular posts from this blog

DBpedia Neural Multilingual QA - GSoC project summery

Week from 07/12 to 07/18

Week from 07/19 to 07/25