Week from 06/21 to 06/27

Since evaluating on QALD datasets were not sufficient. I was asked to evaluate the language detection models on entity-labels.  I used the DBpedia Knowledge Base for extracting entity-labels.

How the datasets are created?

The entity-labels are extracted using the below SPARQL query. The results are downloaded in CSV format.

select distinct ?label where {?s rdfs:label ?label.  filter(lang(?label) = 'en')} LIMIT 10000

It is noted that by using this query it has returned entity-labels that consist only numeric data. Therefore these labels are removed and created the dataset contains text and language columns. 10,000 entity labels are gathered in English, Deutsch, and Spanish languages. For other languages, the count was less than 1000. 

How the evaluation is done?

Each language detection tool is evaluated on each entity-label dataset. 

Approach

EN

DE

RU

IT

ES

FR

PT

AVG

#Resources

10,000

10,000

83

243

10,000

782

227

Accuracy

Runtime(s)

LangTag(S)

0.21

0.91

-

0.25

0.09

0.34

0.36

0.36

0.00162

LangTag(C)

0.26

0.88

0.12

0.35

0.15

0.36

0.44

0.34

0.00186

langdetect

0.40

0.43

0.57

0.63

0.31

0.59

0.43

0.48

0.01761

Tika

0.24

0.39

50

0.68

0.15

0.59

0.35

0.41

0.41428

openNLP

0.16

0.18

0.12

0.30

0.15

0.33

0.25

0.21

0.01125


Observations

We can observe that for short-text each and every model has performed poorly. The maximum average accuracy is obtained by the langdetect tool, which is 0.48.

Comments

Popular posts from this blog

DBpedia Neural Multilingual QA - GSoC project summery

Week from 07/12 to 07/18

Week from 07/19 to 07/25