Machine Translation Task

 This experimentation was done as follows,

  1. Dataset Creation using QALD datasets.
  2. Train and evaluate the initial dataset on Tensorflow Neural Machine translation with attention model.
  3. Using DBpedia Spotlight to annotate the entities and create templates.
  4. Manually evaluate the entity recognition process.
  5. Train and evaluate the correctly annotated text on Tensorflow Neural Machine translation with attention model.

1. Dataset Creation using QALD datasets.

From the QALD datasets from QALD-3 to QALD-7, I created a dataset which consists of language pairs such as English-Spanish, English-Deutsch, etc. These pairs are created for all languages Deutsch, Spanish, French Brazilian Portuguese, Portuguese, Italian, Dutch, Hindi, Romanian, Persian, and Russian.

2. Train and evaluate the initial dataset on Tensorflow Neural Machine translation with attention model.


Using the "Tensorflow Neural Machine translation with attention model", trained the initial datasets created as said above and got the following results,

languageaccuracy %train set sizetest set sizeerror %wrong translation %
Spanish60.62995133811722
German65.85955634131221
French63.35875573931323
Russian14.66663991507411
Italian31.63015634111255
Portuguese3.333333991507422
Portuguese_BR4.5454521722095
Hindi37.3333447150557
Dutch61.94225133811720
Persian8.144795672213259
Romanian52.30265341592125

After analyzing the results, I observed that the model failed to identify some words because they are not in its vocabulary. Most of the words are different entities such as CPU, Brooklyn bridge, etc.

3. Using DBpedia Spotlight to annotate the entities and create templates.

For this experiment, I used the English-Spanish language pair dataset. Creating templates included these steps,
  1. Annotate Text
  2. Get Resource and Entity
  3. Replace the Entity        
For this task, we used DBpedia Spotlight, which is a tool for annotating mentions of DBpedia resources in text, which helps to link information to the Linked Open Data cloud through DBpedia. 

  1. Annotate Text

    Using the provided API from DBpedia Spotlight, the English sentences are annotated.  By sending a request to '/annotate' endpoint the entities were annotated. Spotlight offers a parameter "confidence" to adjust the sensitivity of identifying an entity. I observed that using 0.2,0.3 confidence caused annotating non-entity words such as "when", "give" etc. Therefore the confidence 0.5 is used for optimal performance.  I observed that in that confidence level, the tool fails to identify some entities.

    2. Get Resouce and Entity

As in the previous step the text is annotated, from the response given by Spotlight, I extracted the DBpedia resource and the entity. For example, if the annotated entity was "New York" the resource would be "http://dbpedia.org/resource/New_York".

    3. Replace the Entity 

The expectation of this step was to get a result as follow,

     1)Which river does the Brooklyn Bridge cross? ==> Which river does the <entity> cross?
        ¿Por qué río cruza la Brooklyn Bridge? ==> ¿Por qué río cruza la <entidad>?

    2)How many rivers and lakes are in South Carolina? ==> How many rivers and lakes are in <entity>?
        ¿Cuántos ríos y lagos hay en Carolina del Sur? ==> ¿Cuántos ríos y lagos hay en <entidad>?

For the 1) example shown above, the entity is "Brooklyn Bridge" which was annotated in the English text. The entity appears on the Spanish text as the same, therefore it can be replaced easily.
    But in the 2) example the entity is different in Spanish text. For identifying the correct translation of the entity and replacing it, I used  DBpedia SPARQL queries.

query = select ?label where {<http://dbpedia.org/resource/South_Carolina> rdfs:label ?label. filter(lang(?label) = 'es')}

Which returns the Spanish translation of the entity "Carolina del Sur". Using this result the entity is replaced in the Spanish text. This process is done for all the text pairs. Finally, the template dataset is created by replacing the entities.

4. Manually evaluate the entity recognition process.

We used the resource returned by DBpedia spotlight to annotate the target language text. Since there were some resources not identified correctly, the annotation process was flawed. Therefore I had to manually check and identify the correctly annotated text.

5. Train and evaluate the template dataset on Tensorflow Neural Machine translation with attention model.

As the final step, the NMT model was trained on the correctly annotated training dataset and evaluated on the correctly annotated test dataset. The results were bad since the training dataset is small.

Conclusion

  • The used dataset is small and using the entity annotation method we used the dataset became even smaller. Therefore the model was not able to recognize patterns.
  • Since the dataset is small, the vocabulary was small and when it comes to translation 'Key Errors' were thrown.

Future work

  • Try different datasets.
  • Try annotating the input language and target language texts separately using DBpedia spotlight.

Comments

Popular posts from this blog

DBpedia Neural Multilingual QA - GSoC project summery

Week from 07/12 to 07/18