Week from 07/05 to 07/11
For evaluating the performance of the language detection models, LangTag(s), LangTag(C), langdetect, openNLP, and Tika we designed different text length and domain benchmarks, (1) Short,(2)Moderate and (3) Long. Until now we evaluated all the models in Short and Moderate text benchmarks. This week I was tasked to do the evaluation on Long Texts.
How the datasets are created?
For long text, we used dbo:abstracts of the top 10,000 resources returned by the DBpedia SPARQL. The below SPARQL query was used for extracting the abstracts.
select distinct ?abstract where {?s dbo:abstract ?abstract. filter(lang(?abstract) = 'en')} limit 10000
I was able to extract 10,000 text resources except for Russian. From the gathered resources I created datasets with text and language columns.
How the evaluation is done?
Every language detection model is evaluated on each dataset. Here are the results,
Observations
We can observe that every language tool has performed well in accuracy. It is clear that LangTag models runtimes are lower than other tools.
Comments
Post a Comment