Posts

DBpedia Neural Multilingual QA - GSoC project summery

 GSoC blog This blog includes a detailed description of the work I carried out during the GSoC period. The above mentioned blog includes the tasks carried out on a weekly basis. Brief Description of my work Implement airML using KBox and pip.  This task was to create and distribute KBox  into a pip package called airML, which will allow users to share and dereference ML models Share the Monument dataset with airML. Train the monument dataset and place it in a public repository and dereference it with airML. Create the language detector dataset. Iterate over all questions in IRbench , creating the dataset. Create a language detector model and train it. After creating the model, I evaluated other existing language models and wrote a research paper with my mentor. Experimentation in Machine Translation methods. Creating datasets. Annotating Datasets. Evaluating different methods. The full description of Tasks carried out and the obtained results are included in this blog . The results ob

Machine Translation Task

 This experimentation was done as follows, Dataset Creation using QALD datasets. Train and evaluate the initial dataset on Tensorflow Neural Machine translation with attention  model . Using  DBpedia Spotlight  to annotate the entities and create templates. Manually evaluate the entity recognition process. Train and evaluate the correctly annotated text on Tensorflow Neural Machine translation with attention  model . 1. Dataset Creation using QALD datasets. From the QALD datasets from QALD-3 to QALD-7, I created a dataset which consists of language pairs such as English-Spanish, English-Deutsch, etc. These pairs are created for all languages Deutsch, Spanish, French Brazilian Portuguese, Portuguese, Italian, Dutch, Hindi, Romanian, Persian, and Russian. 2.  Train and evaluate the initial dataset on Tensorflow Neural Machine translation with attention model. Using the " Tensorflow Neural Machine translation with attention model ", trained the initial datasets created as said a

Week from 08/02 to 08/08

 My mentor points me out that maybe the reason for our models' better performance is that we are evaluating fewer languages. Approach #Languages LangTag(S) 10 LangTag(C) 12 langdetect 55 Tika 18 openNLP 103 Since our model only supports 12 languages and openNLP supports for 103 languages, that can be the reason for our models' better performance. How did you evaluate the other tools by restricting the languages? My mentor suggested that by getting the vector of probabilities that used to predict the language by this model, we can obtain the desired result. I was able to get the probability vectors of openNLP model and langdetect . But for Tika  it was not possible. I reevaluated the openNLP and langdetect tools on QALD datasets, entity-labels and abstracts. And updated the results tables in the repo . Observations It was observed that the openNLP and langdetect tools performed much better when the languages are restricted.

Week from 07/19 to 07/25

 This week we started to focus on the Machine Translation task. First I tried to understand how the "Tensorflow Neural Machine Translation" model is implemented. And then trained the model on QALD datasets. How the datasets are created? From the QALD datasets from QALD-3 to QALD-7, I created a dataset which consists of language pairs such as English-Spanish, English-Deutsch, etc. These pairs are created for all languages Deutsch, Spanish, French Brazilian Portuguese, Portuguese, Italian, Dutch, Hindi, Romanian, Persian, and Russian. How the evaluation is done? Using the " Tensorflow Neural Machine translation with attention model ", trained the datasets created as said above and got the following results, language accuracy % spanish 60.6299 german 65.8595 french 63.3587 russian 14.6666 italian 31.6301 portugese 3.33333 pt_BR 4.54545 hindi 37.3333 dutch 61.9422 persian 8.14479 romanian 52.3026 Observations It is observed that the results are very poor. This can be du

Week from 07/12 to 07/18

 Since we are writing a research paper on "Assessing Language Identification Methods and Frameworks over Linked Data", I was tasked to do some work regarding the paper this week. These are the work done in this week, Created a readme file including the results obtained so far the research. The complete results can be found in this Github repo .   Wrote the evaluation part for the research paper including how the experimentation is done. Refactored the code and put a PR to the main repo.

Week from 07/05 to 07/11

For evaluating the performance of the language detection models, LangTag(s), LangTag(C), langdetect, openNLP, and Tika we designed different text length and domain benchmarks, (1) Short,(2)Moderate and (3) Long. Until now we evaluated all the models in Short and Moderate text benchmarks. This week I was tasked to do the evaluation on Long Texts. How the datasets are created? For long text, we used dbo:abstracts of the top 10,000 resources returned by the DBpedia SPARQL. The below SPARQL query was used for extracting the abstracts. select distinct ?abstract where {?s dbo:abstract ?abstract. filter(lang(?abstract) = 'en')} limit 10000 I was able to extract 10,000 text resources except for Russian. From the gathered resources I created datasets with text and language columns. How the evaluation is done? Every language detection model is evaluated on each dataset. Here are the results, Approach EN DE RU IT ES FR AVG #Resources 10,000 10,000 285 10,000 10,000 10,000 Accuracy Runtime

Week from 06/21 to 06/27

Since evaluating on QALD datasets were not sufficient. I was asked to evaluate the language detection models on entity-labels.  I used the DBpedia Knowledge Base for extracting entity-labels. How the datasets are created? The entity-labels are extracted using the below SPARQL query. The results are downloaded in CSV format. select distinct ?label where {?s rdfs:label ?label.  filter(lang(?label) = 'en')} LIMIT 10000 It is noted that by using this query it has returned entity-labels that consist only numeric data. Therefore these labels are removed and created the dataset contains text and language columns. 10,000 entity labels are gathered in English, Deutsch, and Spanish languages. For other languages, the count was less than 1000.  How the evaluation is done? Each language detection tool is evaluated on each entity-label dataset.  Approach EN DE RU IT ES FR PT AVG #Resources 10,000 10,000 83 243 10,000 782 227 Accuracy Runtime(s) LangTag(S) 0.21 0.91 - 0.25 0.09 0.34 0.