Week from 06/21 to 06/27
Since evaluating on QALD datasets were not sufficient. I was asked to evaluate the language detection models on entity-labels. I used the DBpedia Knowledge Base for extracting entity-labels.
How the datasets are created?
The entity-labels are extracted using the below SPARQL query. The results are downloaded in CSV format.
select distinct ?label where {?s rdfs:label ?label. filter(lang(?label) = 'en')} LIMIT 10000
It is noted that by using this query it has returned entity-labels that consist only numeric data. Therefore these labels are removed and created the dataset contains text and language columns. 10,000 entity labels are gathered in English, Deutsch, and Spanish languages. For other languages, the count was less than 1000.
How the evaluation is done?
Each language detection tool is evaluated on each entity-label dataset.
Observations
We can observe that for short-text each and every model has performed poorly. The maximum average accuracy is obtained by the langdetect tool, which is 0.48.
Comments
Post a Comment