Abstract:We first present our work in machine translation, during which we used aligned sentences to train a neural network to embed n-grams of different languages into an $d$-dimensional space, such that n-grams that are the translation of each other are close with respect to some metric. Good n-grams to n-grams translation results were achieved, but full sentences translation is still problematic. We realized that learning semantics of sentences and documents was the key for solving a lot of natural language processing problems, and thus moved to the second part of our work: sentence compression. We introduce a flexible neural network architecture for learning embeddings of words and sentences that extract their semantics, propose an efficient implementation in the Torch framework and present embedding results comparable to the ones obtained with classical neural language models, while being more powerful.
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the challenges in machine translation and sentence compression (semantic extraction). Specifically:
1. **Machine translation problems**:
- Current translation tools, such as Google Translate, mainly rely on "mapping tables", that is, by searching a huge database to find known - translated sentence fragments and combine them into the final translation result. Although this method can give satisfactory results, it requires a large amount of data support and is not practical for individual users.
- The paper proposes a new solution: training a neural network to embed words and sentences into a \(d\)-dimensional space, in which words or sentences with similar meanings will be close to each other regardless of the language they belong to. In this way, translating a word is equivalent to finding the nearest - neighbor word in another language in this embedding space.
2. **Sentence compression (semantic extraction) problems**:
- In the information age, electronic devices generate a vast amount of data, making it difficult for humans to directly process and understand this data. Therefore, we need tools that can extract useful information from a large amount of unlabeled data.
- The goal of the paper is to develop a tool that can embed text into a \(d\)-dimensional space so that this representation contains all information about its meaning. In this way, a large amount of information can be easily classified, sorted, and clustered.
### Technical details of the machine translation part
To achieve the above goals, the author uses the following methods and techniques:
- **Dataset**: The Europarl parallel corpus is used, which is a multilingual dataset extracted from the minutes of the European Parliament meetings and contains versions in 11 different languages.
- **Network architecture**: A multi - layer perceptron (MLP) is constructed. The input is a pair of sentences (one sentence in each language). Each word is represented by its index in the global dictionary and is converted into a \(d\)-dimensional vector through a lookup table. Then the average of all word vectors in each sentence is calculated as the representation of the sentence.
- **Training algorithm**: For each line - aligned corpus, two samples are generated: one is a positive sample (containing a pair of corresponding sentences), and the other is a negative sample (in which one sentence is randomly replaced). Training is carried out by comparing the scores of these two samples.
- **Loss function**: The Margin Ranking Criterion is used, and the formula is as follows:
\[
L_w(x)=\max(0, m - f_w(x_{pos})+f_w(x_{neg}))
\]
where \(x_{pos}\) and \(x_{neg}\) are the positive sample and the negative sample respectively, and \(m\) is a fixed margin value.
### Technical details of the sentence compression part
- **General idea**: Semantic representations of words and sentences are learned through embedding techniques, and two main methods are proposed: auto - encoder networks and ranking networks.
- **Auto - encoder networks**: Try to learn a compact representation by reconstructing the input sentence.
- **Ranking networks**: Optimize the representation by learning the relative order between sentences.
In conclusion, this paper aims to solve the problems of machine translation and sentence compression in natural language processing through neural network embedding techniques, thereby improving the performance of automated tools in these tasks.