Most of us are not good writers, and if you are like me you may have sometimes struggled in communicating your ideas in a written form.
The new field of automated paraphrasing can provide a solution to this issue. In the optimal version of a software paraphrasing system, you would be able to write your essay on your own, not-very-great style and then give it to the paraphrasing system. This system will modify your text to make it shorter, more formal, or less biased.
In this post, we will explore in a very shallow manner how you can build your own paraphrasing system.
What is a Paraphrase
The concept of paraphrasing is most generally defined as semantic equivalence (having the same meaning): A paraphrase is an alternative entity in the same language expressing the same meaning as the original form. And these paraphrases may occur at several levels.
- Words having the same meaning are usually referred to as lexical paraphrases or, more commonly: synonyms. For example: (hot, warm) and (eat, consume). However, this concept can also include hyperonymy, where one of the words is either more general or more specific than the other, for example, (reply, say) and (landlady, hostess).
- Phrasal paraphrase refers to phrases sharing the same meaning. Although these phrases usually form full syntactic phrases like (work on, soften up) and (take over, assume control of ) they may also be patterns with linked variables. For example, Y was built by X, X is the creator of Y, etc.
- Two sentences that represent the same semantic content are termed Sentential paraphrases, for example, (I finished my work, I completed my assignment). Although it is possible to generate very simple sentential paraphrases by simply substituting words and phrases in the original sentence with their respective semantic equivalents, it is significantly more difficult to generate more interesting ones such as (e.g. He needed to make a quick decision in that situation, the scenario required him to make a split-second judgment)
What is It
The task of automated paraphrasing is basically an umbrella that covers multiple more specialized tasks. Most notable are:
- Paraphrases extraction: when given a large corpus of text the goal is to extract paraphrases i.e sentences, phrases, or words that have the same meaning, this is important for data generation and thus supporting the rest of these tasks list.
- Sentence compression: here, given a sentence, the goal is to reduce the size of the sentence either through deleting unimportant parts or merging words, see this list of examples of how sentence compression works in English from a linguistic point of view.
- Style transfer: following the success of style transfer in the domain of computer vision and image processing, the goal of style transfer in NLP is to altering a piece of text to follow a certain style of writing (imitating a famous writer, a more formal/ less formal way)
- Text simplification: this can be considered a subtask of style transfer where the goal is making the text more readable and simpler. Here is an example of this task
- Original: Owls are the order Strigiformes, comprising 200 bird of prey species.
- Simplified: An owl is a bird. There are about 200 kinds of owls.
Why Should I Invest in It
Some of the main applications of an automated paraphrasing system include:
- Query and Pattern Expansion: the automatic generation of query variants for searching in an IR system (e.g. search engine)
Original: circuit details
Variant 1: details about the circuit
Variant 2: the details of circuits
- Expanding Sparse Human Reference Data: some tasks need human-generated output for either training or evaluation, most famous of which is machine translation and abstractive summarization, and due to the great cost of human labor the ability to use paraphrasing to generate more synthetic data for either evaluation or training is greatly important.
- Improving the responsiveness of chatbots.
- Summarization: some of the paraphrasing approaches employ compression to generate the paraphrases this can be helpful as a level on top of an extractive summarizer.
How Can We Evaluate It
Similar to many NLG tasks, given a generated paraphrase and a reference paraphrase metrics we want to measure how accurate is the generated paraphrase relative to the reference. BLEU, ROUGE and METOR are usually used here. If you are not familiar with these metrics, please follow this gentle introduction.
Get Me Some Data
In order to build any machine learning system, the main concern is having a large-enough and diverse enough dataset for this model to learn from. In the case of paraphrasing, each data point is composed of a set of 2 sentences/paragraphs that hold the same semantic content but are written in a relatively different manner.
In this section, we include dataset specifically developed for the task of paraphrasing or those that can be utilized for this task with minimal effort.
For Multiple Languages
- PPDB  is an automatically extracted database containing millions of paraphrases in 16 different languages. The goal of PPBD is to improve language processing by making systems more robust to language variability and unseen words. The entire PPDB resource is freely available here but since the dataset is automatically generated the quality might be low, and it supports Arabic. This dataset includes phrasal level paraphrases that are relatively short, here are some examples of pairs from this dataset:
البرنامج الجارى لتامين ||| البرنامج الراهن لتامين
الجناة والضحايا : المساءلة ||| المجرمون والضحايا: المساءلة
الرسائل المقدمة بموجب المادة 22 ||| الرسائل الواردة بموجب المادة 22
- In , the authors build the first dataset for Arabic sentence compression, which is a very small dataset of fewer than 100 documents, and it is not publicly available yet.
Specifically for English
- Wikipedia offers a simpler version for English learners Zu et al. (2010) a compiled a parallel corpus with more than 108K sentence pairs from 65,133 Wikipedia articles, allowing 1-to-1 and 1-to-N alignments. The latter type of alignments represents instances of sentence splitting. The original full corpus can be found here. This simple Wikipedia is only available for the English language.
- The Grammarly’s Yahoo Answers Formality Corpus (GYAFC) was constructed by identifying 110 000 informal responses containing between 5 and 25 words on Yahoo Answers. Each of these was then rewritten to use more formal language by Amazon Mechanical Turk workers. However, this corpora covers only the English language.
- Another dataset of Shakespeare plays and their modern translations are also available. This corpus contains 17 plays and their modernizations from http://nfs.sparknotes.com. And versions of eight of these plays from http://enotes.com. While the alignments appear to mostly be of high quality, they were still produced using automatic sentence alignment which may not perform the task as proficiently as a human. This dataset includes around 21k sentences in total.
Utilizing Datasets from Other Tasks
- The data usually used in machine translation takes the form of pairs of sentences, where one sentence is from the source language e.g. English and the other is from the target language, say Arabic. In this case, for every source sentence there is only one target reference. However, some translation datasets like this or this include multiple references for the same source sentences, i.e. every example in the dataset looks like this (E, F1, F2, F3), where F1-3 are different translations of the same sentence E. It is accurate to assume that different translations (references) constitute paraphrases of each other. The same logic applies to the summarization task were multiple references are usually used to measure the performance of the model.
- The same idea can be extended to other tasks e.g. MSCOCO was originally an image captions dataset, containing over 120K images, with five different captions from five different annotators per image. All the annotations toward one image describe the most prominent object or action in this image, which is suitable for the paraphrase generation task (However, we know of no image captioning datasets in Arabic).
How Can We Collect More Data
In this section, we explore various methods to build a dataset from paraphrasing models.
Assuming 2 languages: E and F, the basic assumption is that any two source strings e1 and e2 that both translate to a reference string f1 have a similar meaning. In , the authors apply this approach to the EUROPARL machine translation dataset, basically the English-Arabic translation pair. They first align the Arabic and English sentences, to get a sentence to sentence translation pairs, then they find Arabic sentences that translate to the same English sentence (or to relatively close sentences).
Using Automatic Machine Translation
The authors in  as well suggest a second method also using a machine translation dataset. In very simple terms, given a translation data between, say French and English, for every sentence pair (Fi, Ei), use machine translation to translate both sentences to Arabic, this will generate two new sentences in Arabic (A1, A2), that can be considered paraphrases.
The approach is depicted in the following figure:
The authors evaluated 200 pairs generated through this approach (which is very low considering that the generated dataset had over 2M pairs). They claim that these sentences were found to be grammatical or near grammatical with minor mistakes, but were found to be highly similar in meaning. Also, they were sufficiently different in surface form and thus usable for training data for the a seq-to-seq paraphrasing model.
The main drawbacks of such method are:
- The possibility that the en→ar and fr→ar translation tools we used were trained on the same Arabic data with similar model architecture and parameters, in which case the output would not be diverse enough to be useful as input to the paraphrasing system.
- The possibility that the French source was translated to English and then to Arabic.
- The low quality of the machine-translated sentences.
One way to improve this approach is to start with a translation dataset that already includes Arabic sentences. For example, if we have an English-Arabic translation pair, only the English sentences need to be machine-translated to Arabic. This way we can gain two improvements:
- Lower effort in machine translation.
- Higher quality: basically by training the paraphrasing model (seq2seq) to take the machine translated sentences as input and generate their human labeled paraphrases, we can ensure the fluency of the model output.
Duplicate Questions/Posts on Online Community Sites
Many online community sites like stack exchange and quora label duplicated questions in order to control their content. These duplicated questions represent very close paraphrases, for example, in Quora alone there are over 400K lines of potential question pairs that are duplicated to each other. However, we know of no sites that practices this behavior in Arabic.
Using Parallel Translations
In [Evaluating prose style transfer with the Bible], the authors utilize the various versions of the bible as a source for paraphrases. The bible is split into chapters and verses, where each verse is between 1 and 6 sentences. These verses can be considered parallel across the various version. Overall, there is 31,102 verses in the bible. Each version of the bible is understood as embodying a unique writing style. The versions in this corpus were created with a wide range of intentions. Versions such as the Bible in Basic English were written to be simple enough to be understood by people with a very limited vocabulary. Other versions, like the King James Version, were written centuries ago and use very distinctive archaic language. The authors didn’t publish the whole dataset mainly because of distributional licenses, However, they claim that re-scraping the dataset is relatively simple.
There are 34 stylistically different versions of the English bible (new and old testament). However, in the case of Arabic, there are 8 versions, with only 6 of them covering both the new and old testament, two of these versions are available through bible Gateway. The following figure table illustrates the variations between the Arabic bible versions:
|Translation||Genesis 1:1–3 (التكوين)||John 3:16 (يوحنا)|
|Van Dyck (or Van Dyke)|
|في البدء خلق الله السماوات والأرض. وكانت الأرض خربة وخالية وعلى وجه الغمر ظلمة وروح الله يرف على وجه المياه. وقال الله: «ليكن نور» فكان نور.||لأنه هكذا أحب الله العالم حتى بذل ابنه الوحيد لكي لا يهلك كل من يؤمن به بل تكون له الحياة الأبدية.|
|Book of Life|
|في البدء خلق الله السماوات والأرض، وإذ كانت الأرض مشوشة ومقفرة وتكتنف الظلمة وجه المياه، وإذ كان روح الله يرفرف على سطح المياه، أمر الله: «ليكن نور» فصار نور.||لأنه هكذا أحب الله العالم حتى بذل ابنه الوحيد لكي لا يهلك كل من يؤمن به بل تكون له الحياة الأبدية.|
الترجمة الكاثوليكية المجددة
|في البدء خلق الله السموات والأرض وكانت الأرض خاوية خالية وعلى وجه الغمر ظلام وروح الله يرف على وجه المياه. وقال الله: «ليكن نور»، فكان نور.||فإن الله أحب العالم حتى إنه جاد بابنه الوحيد لكي لا يهلك كل من يؤمن به بل تكون له الحياة الأبدية.|
|في البدء خلق الله السماوات والأرض، وكانت الأرض خاوية خالية، وعلى وجه الغمر ظلام، وروح الله يرف على وجه المياه. وقال الله: «ليكن نور» فكان نور.||هكذا أحب الله العالم حتى وهب ابنه الأوحد، فلا يهلك كل من يؤمن به، بل تكون له الحياة الأبدية.|
|في البداية خلق الله السماوات والأرض. وكانت الأرض بلا شكل وخالية، والظلام يغطي المياه العميقة، وروح الله يرفرف على سطح المياه. وقال الله: «ليكن نور.» فصار نور.||أحب الله كل الناس لدرجة أنه بذل ابنه الوحيد لكي لا يهلك كل من يؤمن به، بل ينال حياة الخلود.|
النسخة سهل للقراءة
|في البدء خلق الله السماوات والأرض. كانت اﻷرض قاحلة وفارغة. وكان الظلام يلفّ المحيط، وروح الله تحوّم فوق المياه. في ذلك الوقت، قال الله: «ليكن نور.» فصار النور.||فقد أحبّ الله العالم كثيرا، حتى إنه قدّم ابنه الوحيد، لكي لا يهلك كل من يؤمن به، بل تكون له الحياة الأبدية.|
Finally, there are some dialectical translations of the bible in the public domain, like this. However, we are not sure if they are in an electronically accessible format.
How Can We Implement This Task
The approaches to this task can be easily categorized into either supervised, unsupervised and ad-hoc.
In this type, the task boils down to a machine translation task, where given a relatively large parallel dataset of sentences, where the target sentences represent a paraphrase of the source sentence, for example, training a model to translate between van-dyke and the “book of life” versions of the bible.
The two predominant approaches to this task are the same two types of approaches used in machine translation, namely statistical machine translation using frameworks like Moses or the neural machine translation approach following the encoder-decoder structure, in this type framework like OpenNMT can be of help.
The earliest work on paraphrases generation used statistical machine translation techniques to generate novel paraphrases . More recently, phrase-based statistical machine translation software was used to create paraphrases, in , they report a relatively high BLEU score of 50.4.
The Shakespeare dataset was used with a Seq2Seq model . Their results are impressive, showing improvement over statistical machine translation methods as measured by automatic metrics. They experiment with many settings, but in order to overcome the small amount of training data, their best models all require the integration of a human-produced dictionary which translates approximately 1500 Shakespearean words to their modern equivalent.
In , the authors use both Moses phrase-based statistical translation and a neural seq2seq model, in order to overcome the limited size of corpora when training the Seq2Seq model they utilized the tagging trick from , where they build a single seq2seq model to translate between the different version in a “multi-lingual” fashion, this is accomplished by adding a small token to the start of the source sentence to represent the target version, for example, if the target style for a verse pair is that of the American Standard Version, we start off the source sentence with an ‘ASV’ token. They report BLEU results as high as 71 for the statistical machine translation variant and 52 for the Seq2Seq model. The following figure shows some of the results
The main bottleneck in many of the paraphrasing tasks is the lack of parallel datasets to train a supervised model, especially for tasks like style transfer, where the target of the training can be specialized (following a certain style like Shakespeare, or following a certain sentiment positive vs negative).
Most of the available methods to tasks like style transfer are unsupervised, meaning that it is possible to train such models without any parallel data. This is an amazing list of papers and code-bases in this field.
- One of the main methods used to generate more synthetic data and a baseline for comparison with the other methods is lexical synonym replacement. Basically, for some words in the article (mostly adjectives and nouns), the system will automatically change the word using one of its synonyms from lexicons like WordNet, datasets like the aforementioned PPDB can also be helpful since it contains paraphrases on the lexical, phrasal, and syntactic levels. To get an idea of the performance of such an approach you can try some of the online tools that use them like this, this, or this.
- The literature regarding the field of sentence compression includes several rule-based methods to delete or replace parts of the sentence in order to reduce its size, some of the older approaches to the task apply modifications on the parse tree of a sentence, these approaches employ rules in the form of context-free grammar CFG that is either written by hand or learned in a stochastic manner, see the following figure for an example. However, we couldn’t cover these approaches in detail within the scope of this post but further details can be found in this amazing summary .
- Another line of work is sentence fusion , , this task is heavily used in multi-document summarization. This approach is used in Quill-bot. Check it out to get an idea of the performance of these systems. In very simple terms the main approach is clustering sentences into semantically similar clusters and then for each cluster a single word is generated. For example in , the main approach used is to generate a word graph from the cluster sentences and then finding the shortest path in this graph to create the fused sentence. For example, for the following two sentences the graph is shown in the following figure, again note that covering this task in detail is beyond the limits of this post and the reader is left to explore the references.
- In Asia Japan Nikkei lost 9.6% while Hong Kongs Hang Seng index fell 8.3%.
- Elsewhere in Asia Hong Kongs Hang Seng index fell 8.3% to 12,618
In this article, we have explored in a very broad manner the field of paraphrase generation. This is an extremely wide field with multiple sub-tasks in it. However, if you decided to build a paraphrasing system yourself then the following tips might help you:
- There is a decent size of parallel datasets. And due to the nature of this task, it is also possible to reuse datasets built for different tasks like machine translation or image captioning.
- There are several ways to scrap datasets for this task and some of them like machine translation or bi-lingual pivoting can generate large amounts of datasets.
- Furthermore, we believe that further research can reveal other ways to collect data for this task.
- Methods based on supervised machine translation reports high results in the literature. One weird phenomenon is the fact that statistical machine translation models constantly outperform their neural counter-parts throughout the reviewed literature. This can be attributed to the small size of training data.
- Unsupervised NMT models like the approaches used in style transfer are very complicated. Yet the reported results in the literature are rather decent in comparison with the supervised approach. Furthermore most of them include publicly available code-bases.
- Some of the ad-hoc methods like phrase substitution have their own pros and cons:
- Very simple and can be easy to build and debug.
- Their quality depends on the quality of lexicons used like word2vec or the paraphrases dataset like PPDB, the main fear is that if synonyms are not suitable for the context.
- They only perform phrasal or lexical changes to the sentences.
The following table compares the aforementioned approaches:
|Neural machine translation using OpenNMT||Require parallel data this data can be generated in various ways or we can use some ready-made datasets like PDBB. These models often depends of the domain of the training data mainly because their vocabulary is limited and can often overfit horribly when working with new domains. The quality of the model depends of the size and quality of the training dataset||Relatively easy to train models and optimize them using the machine translation data generation approach it is possible to build datasets for a large number of domains can achieve high performance when large data is available|
|Statistical Machine translation||Also requires parallel data. Moses is not very simple to learn||Reported results in the literature is higher|
|Unsupervised methods||Relatively complicated and hard to understand. Their performance is lower (yet comparable) to their supervised counterparts||Several code-bases are available, and can be relatively used in an off-the-shelf manner does not require parallel data, but does require un-parallel corpora of the 2 styles|
|Phrase substituting||Relatively simple to implement||Their results are often low, the generated sentences are templatic and can generate paraphrases not suitable for the context|
Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from google play: https://play.google.com/store/apps/details?id=io.almeta.almetanewsapp&hl=ar_AR”
 J. Ganitkevitch, B. Van Durme, and C. Callison-Burch, “PPDB: The paraphrase database,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, pp. 758–764.
 R. Belkebir and A. Guessoum, “TALAA-ASC: A sentence compression corpus for Arabic,” in 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), 2015, pp. 1–8.
 F. Al-Raisi, A. Bourai, and W. Lin, “NEURAL SYMBOLIC ARABIC PARAPHRASING WITH AUTOMATIC EVALUATION,” Comput. Sci. Inf. Technol., vol. 1, 2018.
 C. Quirk, C. Brockett, and W. Dolan, “Monolingual machine translation for paraphrase generation,” in Proceedings of the 2004 conference on empirical methods in natural language processing, 2004, pp. 142–149.
 S. Wubben, A. Van Den Bosch, and E. Krahmer, “Paraphrase generation as monolingual translation: Data and evaluation,” in Proceedings of the 6th International Natural Language Generation Conference, 2010, pp. 203–207.
 H. Jhamtani, V. Gangal, E. Hovy, and E. Nyberg, “Shakespearizing modern language using copy-enriched sequence-to-sequence models,” ArXiv Prepr. ArXiv170701161, 2017.
 K. Carlson, A. Riddell, and D. Rockmore, “Evaluating prose style transfer with the Bible,” R. Soc. Open Sci., vol. 5, no. 10, p. 171920, 2018.
 M. Johnson et al., “Google’s multilingual neural machine translation system: Enabling zero-shot translation,” Trans. Assoc. Comput. Linguist., vol. 5, pp. 339–351, 2017.
 E. Pitler, “Methods for sentence compression,” 2010.
 M. T. Nayeem, T. A. Fuad, and Y. Chali, “Abstractive unsupervised multi-document summarization using paraphrastic sentence fusion,” in Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 1191–1204.
 K. Thadani and K. McKeown, “Supervised sentence fusion with single-stage inference,” in Proceedings of the Sixth International Joint Conference on Natural Language Processing, 2013, pp. 1410–1418.