What Makes an Article Informative and How Computers Can Measure Informativeness

What Makes an Article Informative – And How Computers Can Measure Informativity of a Text Content

The Concept of an informative text is really abstract and it is hard to come up with a definitive formula to measure it, in this article we will explore some of the features that we believe can make an article more worthy and show how can these features can be measured quantiatively.

This is our first piece on informativeness but there is A LOT more if you are intrigued you can jump directly to the gist or read the details of individual articles:

Readability

This is based on the assumption that an informative article should be readable, possible metrics include LIX, FOG, … and are heavily used in the literature. We, at Almeta, have built our own readability measuring system. You can read more about our system here.

Skimmability

This is based on the assumption that an informative text can be skimmed easily i.e. A text is informative when there are a lot of pieces of information one could capture without knowing the context of the author.

So for example: “He is my best friend” is less informative than “Max Mustermann has a best friend called Martin Muster”.

A simple metric for measuring the informativity in a given text is the relative amount of content words to non-content words.

Content words are nouns, proper nouns, verbs, and adjectives. Some definitions include adverbs and some prepositions, but a test showed that those were not useful. The content function ratio (CFR) is calculated like this:

CFR(text) = NumContentWords/NumFunctionWords

Lexical Density

Lexical density is defined as the number of lexical words (or content words) divided by the total number of words.

Lexical words give a text its meaning and provide information regarding what the text is about. More precisely, lexical words are simply nouns, adjectives, verbs, and adverbs. Nouns tell us the subject, adjectives tell us more about the subject, verbs tell us what they do, and adverbs tell us how they do it.

Other kinds of words such as articles (a, the), prepositions (on, at, in), conjunctions (and, or, but), and so forth are more grammatical in nature and, by themselves, give little or no information about what a text is about. These non-lexical words are also called function words. Auxiliary verbs, such as “to be” (am, are, is, was, were, being), “do” (did, does, doing), “have” (had, has, having) and so forth, are also considered non-lexical as they do not provide additional meaning.

With the above in mind, lexical density is simply the percentage of words in the written (or spoken) language which give us information about what is being communicated.

Lexical Richness (LR)

LR is a set of measures that is based on the lexical nature of the text including:

  • Lexical richness is a measure of how many tokens (individual words and punctuation) occur in a given text, divided by how many types (unique words and punctuation) The higher the number, the more lexically rich it is, meaning, the more unique words (and punctuation) it contains compared to the total number of words and punctuation it contains. A more advanced variant uses stuff like WordNet to group synonyms into the same count;
  • Type to Token ratio: (and related metrics) basically the ratio between the size of the pos tags set and the vocabulary of the text (other metrics uses more detailed pos tagging in order to find say the different tenses used, the usage of passive voice, number of nominal types, …);
  • Other LR metrics include: it is hard for us to cover all of them here but the MLTD [2], and VOC-D [3] are some of the most common.

Term Informativeness

This task is based on the idea that some words carry more semantic content than others, which leads to the notion of term specificity, or informativeness.

The goal of this task is to assign every term a certain metric representing its importance. You can get a better idea about this task by reading our special article about Term informativeness.

Metadiscourse Markers

Metadiscourse markers (such as ‘firstly’ and ‘in conclusion’) are very important since they refer explicitly to aspects of the organisation of a text or indicate a writer’s stance towards the text’s content or towards the reader.

Here we are using the basic Idea that if a text is more structured and properly interlinked then it is easier and more informative.

They are important markers in more academic text styles. The percentage of these markers also can help in measuring the structuring of the text. This is a good explanation of these markers, [4] is a more detailed explanation, [5] talks about their visualization and [6] is a comparison between their usage in the English and Arabic Languages.

Text Denoising

This is based on the assumption that in a specialized topics like (nuclear physics, genome, … ) that have less readable passages (ones with longer words and sents) are more informative as they include the details of the article.

This can be especially important for tasks relating to keywords extraction like entity-level sentiment detection and relation extraction. The readability is measured here using Fog-Index and Normalized Fog-Index as they focus on words length rather than other elements, the measures are calculated as follows:

The idea of text denoising is to build a filter to remove sentences that have higher readability that a certain threshold as these is deemed less helpful in tasks related to key-phrase generation, here is a source code for the denoising thing by the author.

Cliché Detection

This is based on the assumption that text with cliches is not informative and thus articles with fewer cliches should be more informative, the authors in [8] provide a method of doing this automatically.

Text Coherence

This is based on the idea that a coherent text should be more informative, there is a full line of research in NLP community interested in automating this task, the publication [9] is a very good example.

Automated Essay Scoring

There is a very wide field of research on this task and it encompasses many of the ideas previously mentioned, the main problem is the need for large training sets and the focus of the methods on neural models, see for example [11].

Supervised Informativeness Detection

basically training a binary classifier to predict if a text is informative or not and using its confidence as a metric:

  • Cons:
    • there will be a need to label a large set of data
    • the results of the system will generally be domain dependent
    • sensitivity analysis is generally harder
  • Pros:
    • performance should be generally more consistent and better

The guys in [12] did exactly that on news articles and suggest a method to automatically boost their training data.

Conclusion

In this article we tried to cover several aspects that marks a piece of text as informative and how can we measure them in an information service, in our view the only way to truly measure the informativeness of an article is to use several measurements as proxy to the more abstract Idea of informativeness.

This was the first article in informativeness series if you are intrigued you can jump directly to the gist or read the details of individual articles:

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.

Further Reading

[1] R. Shams, “Identification of informativeness in text using natural language stylometry,” 2014.

[2] P. M. McCarthy, “An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD),” The University of Memphis, 2005.

[3] P. M. McCarthy and S. Jarvis, “vocd: A theoretical and empirical evaluation,” Lang. Test., vol. 24, no. 4, pp. 459–488, 2007.

[4] S. G. Sanford, “A comparison of metadiscourse markers and writing quality in adolescent written narratives,” 2012.

[5] D. Simsek, S. Buckingham Shum, A. Sandor, A. De Liddo, and R. Ferguson, “XIP Dashboard: visual analytics from automated rhetorical parsing of scientific metadiscourse,” 2013.

[6] A. H. Sultan, “A contrastive study of metadiscourse in English and Arabic linguistics research articles,” Acta Linguist., vol. 5, no. 1, p. 28, 2011.

[7] Z. Wu and C. L. Giles, “Measuring term informativeness in context,” in Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: human language technologies, 2013, pp. 259–269.

[8] P. Cook and G. Hirst, “Automatically assessing whether a text is clichéd, with applications to literary analysis,” in Proceedings of the 9th Workshop on Multiword Expressions, 2013, pp. 52–57.

[9] Z. Lin, H. T. Ng, and M.-Y. Kan, “Automatically evaluating text coherence using discourse relations,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, 2011, pp. 997–1006.

[10] C. Danescu-Niculescu-Mizil, G. Kossinets, J. Kleinberg, and L. Lee, “How opinions are received by online communities: a case study on amazon. com helpfulness votes,” in Proceedings of the 18th international conference on World wide web, 2009, pp. 141–150.

[11] Y. Farag, H. Yannakoudakis, and T. Briscoe, “Neural automated essay scoring and coherence modeling for adversarially crafted input,” ArXiv Prepr. ArXiv180406898, 2018.

[12] Y. Yang and A. Nenkova, “Detecting information-dense texts in multiple news domains,” in Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.

Leave a Reply

Your email address will not be published. Required fields are marked *