Human translation is typically higher quality and in most cases more expensive than machine translation. Yet, this doesn’t mean that human translators can’t make mistakes.
Translation errors might result in the loss or the alteration of information compared to the original segment. And when the trade involves selling and using high-quality segment pairs for boosting machine translation it’s not a trivial issue as these errors might affect the quality of the translation.
Some errors can be easily filtered out by checking whether a certain segment matches the target language or whether it matches the source segment length. However, these simple tools can’t check for more nuanced differences between the meaning or structure of source and target segments. This is where sentence embeddings come into the picture.
Putting language into numbers
Multilingual sentence encoders such as LASER (developed by Facebook) and LaBSE (developed by Google) are trained on enormous amounts of texts from 100+ languages, learning how to create accurate representations, so-called sentence embeddings, for sentences in any of these languages. These embeddings capture aspects of sentence meaning, allowing direct comparison of sentences across languages.
This ability to compare sentences across languages is what we exploit to process our data. If two segments are accurate translations of each other, they should be matching in terms of meaning, which means their sentence embeddings should be similar enough. Selecting an appropriate degree of similarity, we can discount bad translations.
This makes sentence embeddings a valuable tool for additional (and well-needed) filters for controlling translation quality, but they are not without their drawbacks.
Dimensions and noise
While sentence embeddings encode a lot of information about the segments they’re applied on, all this information content makes them very large. The dimensionality of sentence embeddings created by the LASER model, for instance, is 1024, which means every single sentence is represented by 1024 different numbers. Storing and retrieving these large sentence embeddings for hundreds of thousands or millions of sentences becomes a practical issue in that they can become a serious bottleneck in data processing.
Additionally, sentence embeddings aren’t flawless. While sentence embeddings can capture fairly sophisticated aspects of sentence meaning, they are far from understanding language the way humans do.
- Sentence embeddings are sensitive to punctuation, in ways humans aren’t:
- The sentence embeddings LaBSE creates are not particularly similar in the case of Lithuanian Bendru sprendimu priimtų teisės aktų pasirašymas (žr. protokolą) and English Signature of acts adopted under codecision: see Minutes, even though they have the same meaning. This is likely due to the presence of parentheses in the Lithuanian sentence.
- Sentence embeddings are sensitive to overlapping phrases, even in sentences where the main meaning is different:
- Both LASER and LaBSE create sentence embeddings that are highly similar for sentences such as Bulgarian Освобождаване от отговорност 2005: Европейски център за профилактика и контрол върху заболяванията (вот) and English Discharge 2005: European Food Safety Authority (vote), even though the Bulgarian sentence actually means “Discharge 2005: European Foundation for the Improvement of Living and Working Conditions (vote)”. However, the presence of words, such as Discharge, European, (vote), and the year 2005 is enough for the models not to be able to differentiate between the sentence meanings.
To improve the workflow of checking for translation quality, it seems we need to find a way to reduce the size of the sentence embeddings and improve the noise in them. This would allow us to use embeddings faster and more effectively.
If only we had a tool that was designed to reduce dimensionality by learning how to ignore the noise in the data…
Excursion to image processing
In the previous section, we were busy searching for a tool that reduces dimensionality and ignores noise. Autoencoders were actually designed to do exactly this. Autoencoders are simple neural networks with three main parts: the encoder, the bottleneck, and the decoder, with the goal to exactly reconstruct their input. The encoder reduces the input size until it reaches the most compressed part of the model: the bottleneck. From the bottleneck, the decoder tries to reconstruct the original input with the same dimensionality.
This reconstruction is naturally going to result in losing some information. However, this can be advantageous if the information we lose is just noise, and we retain information that is actually important. Autoencoders have found use in denoising images, for instance the ones we create with our smartphone cameras.
It was exactly these principles of autoencoder — the reduction of information to a bottleneck, and the denoising of information, that inspired this research.
Research, research, research
How do we find out if we can use autoencoders to reduce sentence embedding noise and size? We need experiments.
Experiments involve training an autoencoder in reconstructing original sentence embeddings, then extracting the reduced sentence embeddings from the bottleneck layer of the autoencoder, and using these as new sentence representations. Comparing performance on original and reduced sentence embeddings can reveal whether we can benefit from autoencoders.
A common way of evaluating different setups in NLP is to measure performance on a specific task. In our research, we used two tasks: domain classification and cleaning. While cleaning involves checking the translation quality of sentences, domain classification involves deciding which domain a certain sentence belongs to.
For instance, a sentence such as The London-based bank later let all of its U.S. “Lehman Brothers” trademark registrations expire might belong to a legal or financial domain, but not to a technical one.
Another way of checking how well autoencoders perform is to check the similarity between sentence embeddings and their reconstructions, the output of the autoencoders. The more similar the reconstructed sentence embeddings are to the original sentence embeddings, the more information the autoencoder retains. This makes it all the more successful for compressing the size of sentence embeddings.
One of the main difficulties in using an autoencoder is to set the correct parameters for its training. These parameters are something that all neural networks are sensitive to. The way they relate to expected network performance is far from obvious, which makes parameter tuning as much of an art as a science.
The most important parameters we focused on were:
- the number of epochs, i.e., how many times is the training data passed through the model during training,
- batch size, i.e., the size of the subset of the data that is presented to the model at every training step,
- and learning rate, i.e., the degree with which model parameters change at every training step.
These parameters may affect the success of training an autoencoder in a number of different ways. Certain parameters might affect the speed of training: more epochs or smaller batch sizes prolong training time.
More epochs, however, allow the model to learn better from the data presented to it. Too many epochs, however, might mean the model learns the data too well, and can’t generalize to new examples.
The size of the learning rate determines how much a model learns from a given training step. Select too little of a learning rate and the model barely learns. Pick a very large learning rate, and the model will erase what it previously learned at every new training step.
Finding an exact balance between different parameters is a challenge that makes model training especially difficult.
Experiments and results
We ran experiments on domain classification, cleaning (translation quality control), and the capability of autoencoders to reconstruct sentence embeddings.
For domain classification and the reconstruction of embeddings, we used the same English language domain classification dataset, encoding input sentences with the Universal Sentence Encoder (USE) model (domain classification), and with USE, LASER and LaBSE (reconstruction). On cleaning, we used a subset of the Europarl dataset covering 9 language pairs of English and target languages.
We measured success on both tasks using f1-score, a common NLP metric sensitive to both false negative and false positive examples. F1-score is commonly calculated based on a weighted average of all labels (weighted average f1-score), or averaged over all classes (macro average f1-score).
On the reconstruction of sentence embeddings, we measured performance using average cosine similarity between the original and the reconstructed embeddings. The closer this is to 1, the more original and reconstructed embeddings resemble each other, and the more successful reconstruction is.
- The first conclusion of the experiments is that there was not a single task where sentence embeddings reduced by autoencoders achieved a comparable performance when compared to using original sentence embeddings.
- Sentence embeddings reduced by autoencoders achieve 0.05 and 0.09 less on weighted average f1-score and macro average f1-score compared to the original USE sentence embeddings on the task of domain classifier, not a satisfying result.
- On the cleaning task, reduced LASER sentence embeddings reach a top performance of 0.3 less in terms of macro average, and 0.2 in terms of weighted average when compared to the original sentence embeddings, a truly bad result. LaBSE embeddings perform even worse on this task.
- Finally, autoencoders don’t perform particularly well when it comes to reconstructing information from the original sentence embeddings. The highest average cosine similarity scores are reached by reducing and reconstructing LASER embeddings, 0.89, which signals quite some loss of information. The cosine similarity between original and reconstructed sentence embeddings created with LaBSE and USE are much lower, around 0.3.
- The second conclusion is that while changing parameters, such as learning rate, epoch number and batch size, affects performance, the performance impact is not exactly well predictable. Training for more epochs helps for some tasks and in the case of certain sentence embedding models, but not nearly for all. Small batch sizes help certain tasks while hurting the performance of others. Finally, the learning rate seems to have a fairly moderate impact on performance.
- The final conclusion is that sentence embedding models seem to be already fairly efficient, and their sizes cannot be effectively reduced with autoencoders while also maintaining or especially enhancing their information content.
In total, we carried out 90+ different experiments over three different tasks using three different sentence embedding models. We changed training parameters such as the learning rate, the number of epochs and the batch size in order to come up with a network that can reduce the size of sentence embeddings while keeping the information contained in the embeddings. While parameter changes did affect the quality of the resulting sentence embeddings, their impact was not consistent across tasks. Consequently, we could not find a good use of autoencoders for our purposes.
All in all, this research shed light on the fact that despite their size, sentence embeddings created by various models are already efficient and capable.
Contact us to learn more about language data and NLP-related services we offer.
This blog article has been written by Marcell Fekete, Junior Machine Learning Engineer at TAUS based on research carried out by him.