Machine Translation Task
This experimentation was done as follows,
- Dataset Creation using QALD datasets.
- Train and evaluate the initial dataset on Tensorflow Neural Machine translation with attention model.
- Using DBpedia Spotlight to annotate the entities and create templates.
- Manually evaluate the entity recognition process.
- Train and evaluate the correctly annotated text on Tensorflow Neural Machine translation with attention model.
1. Dataset Creation using QALD datasets.
From the QALD datasets from QALD-3 to QALD-7, I created a dataset which consists of language pairs such as English-Spanish, English-Deutsch, etc. These pairs are created for all languages Deutsch, Spanish, French Brazilian Portuguese, Portuguese, Italian, Dutch, Hindi, Romanian, Persian, and Russian.
2. Train and evaluate the initial dataset on Tensorflow Neural Machine translation with attention model.
Using the "Tensorflow Neural Machine translation with attention model", trained the initial datasets created as said above and got the following results,
After analyzing the results, I observed that the model failed to identify some words because they are not in its vocabulary. Most of the words are different entities such as CPU, Brooklyn bridge, etc.
3. Using DBpedia Spotlight to annotate the entities and create templates.
For this experiment, I used the English-Spanish language pair dataset. Creating templates included these steps,
- Annotate Text
- Get Resource and Entity
- Replace the Entity
For this task, we used DBpedia Spotlight, which is a tool for annotating mentions of DBpedia resources in text, which helps to link information to the Linked Open Data cloud through DBpedia.
- Annotate Text
Using the provided API from DBpedia Spotlight, the English sentences are annotated. By sending a request to '/annotate' endpoint the entities were annotated. Spotlight offers a parameter "confidence" to adjust the sensitivity of identifying an entity. I observed that using 0.2,0.3 confidence caused annotating non-entity words such as "when", "give" etc. Therefore the confidence 0.5 is used for optimal performance. I observed that in that confidence level, the tool fails to identify some entities.
2. Get Resouce and Entity
As in the previous step the text is annotated, from the response given by Spotlight, I extracted the DBpedia resource and the entity. For example, if the annotated entity was "New York" the resource would be "http://dbpedia.org/resource/New_York".
3. Replace the Entity
The expectation of this step was to get a result as follow,
1)Which river does the Brooklyn Bridge cross? ==> Which river does the <entity> cross?
¿Por qué río cruza la Brooklyn Bridge? ==> ¿Por qué río cruza la <entidad>?
2)How many rivers and lakes are in South Carolina? ==> How many rivers and lakes are in <entity>?
¿Cuántos ríos y lagos hay en Carolina del Sur? ==> ¿Cuántos ríos y lagos hay en <entidad>?
For the 1) example shown above, the entity is "Brooklyn Bridge" which was annotated in the English text. The entity appears on the Spanish text as the same, therefore it can be replaced easily.
But in the 2) example the entity is different in Spanish text. For identifying the correct translation of the entity and replacing it, I used DBpedia SPARQL queries.
query = select ?label where {<http://dbpedia.org/resource/South_Carolina> rdfs:label ?label. filter(lang(?label) = 'es')}
Which returns the Spanish translation of the entity "Carolina del Sur". Using this result the entity is replaced in the Spanish text. This process is done for all the text pairs. Finally, the template dataset is created by replacing the entities.
4. Manually evaluate the entity recognition process.
We used the resource returned by DBpedia spotlight to annotate the target language text. Since there were some resources not identified correctly, the annotation process was flawed. Therefore I had to manually check and identify the correctly annotated text.
5. Train and evaluate the template dataset on Tensorflow Neural Machine translation with attention model.
As the final step, the NMT model was trained on the correctly annotated training dataset and evaluated on the correctly annotated test dataset. The results were bad since the training dataset is small.
Conclusion
- The used dataset is small and using the entity annotation method we used the dataset became even smaller. Therefore the model was not able to recognize patterns.
- Since the dataset is small, the vocabulary was small and when it comes to translation 'Key Errors' were thrown.
Future work
- Try different datasets.
- Try annotating the input language and target language texts separately using DBpedia spotlight.
Comments
Post a Comment