How to detect cliches in text

How to Detect Cliches in Text

In a previous article, we talked about the various factor that makes an article more informative, using cliches was not one of them, this article is a part of our research on measuring text informativeness, if you are interested jump directly to the gist of our research or review other parts:

You are still here? Cool! Let us start.

Cliches are overused, unoriginal expressions that appear in a context where something more novel might have reasonably been expected. They are predominantly multi-word expressions (MWEs) and are closely related to the idea of formulaic language. They are an aspect of the more general formulaic language which also include:

  • idioms: a phrase or expression whose meaning can’t be understood from the ordinary meanings of the words in it e.g. “Get off my back!” is an idiom meaning “Stop bothering me!”
  • binomials: pair is an expression containing two words which are joined by a conjunction (usually “and” oror”) e.g. “rock and roll“, “more or less”
  • collocations: implicit constrains on the attachment of words. e.g. “Heavy rain” vs “Strong rain”

Overall cliches can be classified into three types based on placement:

  • Fixed expressions: “An eye for an eye”
  • Semi-fixed expressions: can include small changes in prepositions: “Throw gas (gasoline) on (to) fire”
  • Syntactically-flexible: “To rake someone over the coals” or “to haul someone over the coals“. Means to reprimand them severely.

Should writers “avoid cliches like the plague”?

In [1] the authors collected a large set of dutch (and Dutch translated) novels, used crowdsourcing to rate them based on quality, then studied the relationship between cliches usage and the reported quality by users. The following figure demonstrates their results. With a Pearson correlation coefficient r=-0.32 there is a small linear negative correlation (articles with more cliches tend to have lower ratings), and with a p-value of 5.6e-10 the probability of a null hypothesis is too small (which means, we are confident of the results).

image is taken from [1]

Furthermore, The detection of cliché and more generally formulaic language can go well beyond the simple issue of writing quality to include many other tasks like improving the quality of machine translation, since the ability to handle MWEs in any language-related tasks will have a positive impact on their final output quality.

How other People Do it? What are clichés like, linguistically?

In [1] the authors studied several indicators to detect cliches and those included:

  • Mean sentence length(number of tokens)
  • Common vocabulary: the percentage of tokens part of the 3000 most common words in a large reference corpus
  • Direct speech: the percentage of sentences with direct speech punctuation. e.g. ‘he said that bla‘ vs ‘He: “bla” ‘
  • Compression ratio the number of bytes when the text is compressed divided by the uncompressed size.

They found out that all of the above correlates significantly with the density of cliches, that basically cliches stand out as having simpler language: they consist almost exclusively of common words, are more repetitive (a lower compression ratio indicates more repetitiveness), and contain shorter sentences.

Using Words Count

In [2] the authors worked on measuring how much a song is clichéd as a factor in measuring the quality of it based on its lyrics and rhythm (the rhythmatic last word in each sentence like say,day,away), for lyrics they used n-grams and for the rhythm they used a ranked list of pairs (stay, away) … , they have annotated a very tiny test set and then suggested several equations of a cliché metric for songs. Nonetheless, there are several issues with this works:

  • The test set is very small and is annotated by a single person thus is subjective and under-representative
  • This method isn’t applicable for our goal of determining how cliched an arbitrary text since in this case rhyming is not a typical feature of the texts.
  • Repetition in song lyrics motivated their n-grams score, but this is not a salient feature of the texts we consider

In [3] the authors used the frequencies of n-grams to assess whether a text is clichéd or not, they found out that clichéd text tends to have their higher n-grams 3,4,5-grams more frequent the following figure illustrates the difference in n-grams distribution between clichéd and original text. However, similar to [2] the authors used a very small test set to demonstrate their method.

Using Dictionaries and Lists

In [4] the authors try to detect Portuguese proverbs by collecting a closed list of them om sources like Wikipedia. The authors in [1] as well suggest using a large lexicon of formulaic language.

Formulaic Language Detection

A part of the effort to detect formulaic language, the authors in [5] built a list of common collocations empirically for Modern Standard Arabic by employing several metrics to rank bi-grams including t-score, log-liklihood, etc.

further effort to detect longer sequences of Formulaic Language is discussed in [6] and [7], in [7] the authors used ranked n-grams in a similar manner to [5] to semi-automatically create a list of common Arabic collocations and long Formulaic sequences.

Supervised Models

While it seems possible to annotate a large corpus for clichés and train a system to extract them, we didn’t come upon any study that did that.

How Can We Do it?

On one hand, we can use the words counts, based on the results of [3] it is possible to use the following algorithm:

  • create a histogram of n-grams distribution of original and clichéd text
  • for each new text find the histograms of high n-grams
  • use the distance between the text histogram and those of original and clichéd text (using KL divergence for example) to choose the closest distribution as well as the confidence.

On the other hand it is also possible to scrap/build a closed list of cliché expressions and detect them in text.

The question of using closed lists vs n-grams distributions is basically a precision-recall trade-off. A manually curated dataset may have limited recall, but will yield higher precision (i.e., will contain fewer false positive). Moreover, the n-gram technique cannot be used to detect whether a particular set of clichés is present in a large text, and the clichés cannot be located; the n-gram method is, therefore, coarse-grained.

Method Pros Cons
Closed list – Possibility of detecting the actual clichéd
– text high precision
– Needs constant updates
– low to mid recall based on the list accuracy
– might be hard to acquire based on language
Words counts – Easier to apply – Depend on corpora size
– low precision
Semi-automatic list building – Can combine the best of both worlds – Extremely harder to build
– have the same shortcomings of closed list

Any Available resources?

  • This site contains a comprehensive manually maintained list of common cliché expressions
  • This is a list of Arabic proverbs from Wikiquote with their translation to English

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.

References:

[1] A. van Cranenburgh, “Cliché Expressions in Literary and Genre Novels,” in Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2018, pp. 34–43.

[2] A. Smith, C. Zee, and A. Uitdenbogerd, “In your eyes: Identifying clichés in song lyrics,” in Australasian Language Technology Workshop 2012 (ALTW 2012), 2012, pp. 88–96.

[3] P. Cook and G. Hirst, “Automatically assessing whether a text is clichéd, with applications to literary analysis,” in Proceedings of the 9th Workshop on Multiword Expressions, 2013, pp. 52–57.

[4] A. P. Rassi, J. Baptista, and O. Vale, “Automatic detection of proverbs and their variants,” in 3rd Symposium on Languages, Applications and Technologies, 2014.

[5] A. A. O. Alghamdi, E. Atwell, and C. Brierley, “An empirical study of Arabic formulaic sequence extraction methods,” in Proceedings of the LREC 2016, 2016, pp. 502–506.

[6] A. Alghamdi and E. Atwell, “Towards Comprehensive Computational Representations of Arabic Multiword Expressions,” in International Conference on Computational and Corpus-Based Phraseology, 2017, pp. 415–431.

[7] A. Alghamdi and E. Atwell, “Constructing a corpus-informed list of Arabic formulaic sequences (ArFSs) for language pedagogy and technology,” Int. J. Corpus Linguist., vol. 24, no. 2, pp. 202–228, 2019.

Leave a Reply

Your email address will not be published. Required fields are marked *