How to Measure Text Readability?

How to Measure Text Readability?

Readability is the ease with which a reader can understand a written text, which accordingly indicates how effectively the text will reach the target audience.

The readability of text depends on its content (the complexity of its vocabulary and syntax), its presentation (such as typographic aspects like font size, line height, and line length), and factors related to the reader himself (his experience, interest, and motivation). However, the most important set of factors that affect readability are the factors that are related to the text itself.

If you are familiar with how readability can be measured for Arabic content you can jump directly to our next article to discover how we at Almeta have implemented this feature.

To discover how we at Almeta have implemented this feature.

Why Measure The Readability?

Here are several goals of measuring the readability:

  • The easier a text is to read and the clearer the ideas it contains, the more it is likely to attract and retain the attention of the reader.
  • Readability has been widely used in education in order to write and select the appropriate books and assessments for students’ level.
  • It has been widely used in industry for writing manuals and user instructions in a language’s level appropriate for the average end-users.
  • Several official agencies require their forms to be written in a manner that meets a specific readability level, in order to better the spread of information among society members, especially among the ones with a lower level of education and limited literacy.
  • In medicine, the readability of instructions and other important forms, like consent forms, is considered vital to assure better medical treatments and accountability towards patients and their families.
  • Many researchers use text readability for web applications and information retrieval systems, where they give priority for displaying pages that best match a user’s reading level.

How to Measure the Readability?

Judging readability aims to measure the grade level a person must have to read and comprehend a text. The approaches stated in the research can be divided into two types: traditional approaches and data-driven approaches.

Traditional Approach

The common traditional approach to predict readability is the use of readability formulas, which work by measuring certain features of a text-based on mathematical calculations. We base these readability measures on a handful of factors, that tend to be simple and clear, and can be distinguished as language-dependent (number of syllables in a word, etc…) and language-independent (mean number of words per sentence, the mean number of characters per word, etc…) factors.

What about Arabic?

Over 200 mathematical formulas have been published to help to assess the level of text’s readability in different languages, while a limited amount of research has been conducted on the Arabic language.

Applying other languages formulas on the Arabic text

The most common formulas that are used to measure the readability of Arabic are:

El-Heeti [1] readability formula which refers to a grade level required to comprehend the text. This formula uses the average word length in characters as the only feature.

Heeti = (AWL × 4.414) – 13.468

Where:

AWL: The average of words lengths in the text.

The formula does not work as a good indicator of Arabic text readability, especially given that Arabic is a highly inflectional and derivational language and word length by itself does not reflect difficulty.

ARI [2] readability formula ARI which refers to the U.S. grade level needed to comprehend the text.

ARI Grade Level = (4.71 × ACW) + (0.5 × AWS) – 21.43

Where:

ACW: Average number of characters per word.

AWS: Average number of words per sentence.

LIX [3] readability formula which produces a score that determines the difficulty of a given text according to the following:

Score

0 – 24

25 – 34

35 – 44

45 – 54

55 and above

Meaning

Very easy

Easy

Standard

Difficult

Very difficult

And is calculated as follows:

LIX = W/S + 100 × WD/W

Where:

W: Number of words in the text.

S: Number of sentences in the text.

WD: Number of difficult words in the text, where difficult words are defined as words consisting of more than six letters.

These formulas were selected for their simplicity and more importantly because their parameters can be easily applied to Arabic texts. They are also was chosen for the fact that they do not use language-dependent features like the number of syllables in a word.

[4] conducted a test to check the reliability of the previously mentioned formulas to measure the readability of Arabic text. Their test results showed that:

  1. Even these readability formulas which don’t contain language-dependent features are language-dependent. Therefore, the constants of these formulas should be adjusted appropriately in order to adapt them to the Arabic language.
  2. Average sentence length is a good indicator of Arabic text readability.
  3. Al-Heeti formula focuses on one factor that is the average word length. This factor is unreliable alone to assess the readability of the Arabic text.
  4. LX produced the most acceptable results among these formulas for the Arabic text.

According to the previous reasons a better technique to measure Arabic text readability is needed.

Arabic-based Readability Formulas

There are only two readability formulas for Arabic: AARI and Osman.

1. AARI Metric

AARI [5] readability formula:

AARIBase = (3.28 × NOC) + (1.43 × ACW) + (1.24 × AWS)

Where:

NOC: Number of characters.

ACW: Average character per word.

AWS: Average words per sentence.

The authors showed in their experiments that the results obtained applying their formulas overcome the results obtained applying the previously mentioned non-Arabic based formulas.

2. Osman Metric

OSMAN [6] readability formula is an open-source Arabic metric for text readability written in Java. The formula calculation process used Mishkal2 to diacriticize Arabic text, to extract the word syllables. The use of diacriticized texts is problematic because of Arabic texts often are not diacriticized and the process of introducing diacritics if done automatically, can introduce errors. So this is one of the weak points of the formula. The author introduced a set of frequently misspelled letters and referred them as “Faseeh”. Misspelling those letters could result in prosaic Arabic “Rakeek” –weak pronunciation– and therefore affect text readability.

OSMAN = 200.791 – (1.015 × A/B) – 24.181 × ( C/A + D/A + G /A + H /A)

Where:

A: the total number of words.

B: the total number of sentences.

C: the total number of hard words (words with more than 5 letters).

D: the number of syllables per word.

G: the number of complex words (words with more than 4 syllables).

H: the number of “Faseeh” words (complex words with any of the “Faseeh” letters)

Here too, the author showed in his experiments that the results obtained applying his formulas overcome the results obtained applying the previously mentioned non-Arabic based formulas. However, he didn’t compare his results with the results of the AARI metric.

Corpus-Based Arabic Formula

In [7] they applied a corpus-based approach to build an Arabic readability formula. Based on their claim that frequently used words are usually easier than rarely used ones.

They used King Abdulaziz City for Science and Technology Arabic Corpus to characterize a state of the Arabic language. In the corpus, the word with the highest number of frequencies ranked last. The ranking in the corpus is reversed so that the easiest is ranked the first and so on. The difficulty level based on this ranking is taken into consideration when the mean is computed.

Traditional Approach Limitations

According to [4], extensive research has shown that the popular readability formulas are not 100% accurate, yet these formulas provide a “rough estimate” of the readability of a text.

Data-Driven Approach

Machine learning-based methods, treat the problem as a binary classification problem. We need a training set that is annotated with the class to which each example belongs, where the classes usually grade levels. A commonly used data source for Arabic and other languages is GLOSS3 which contains only 230 MSA annotated texts.

Conclusion

In this post, we discussed the text readability measurement task in NLP, its benefits, and methods to implement. We presented an approach based on pre-defined formulas, and another based on ML.

Hope you have enjoyed this discussion if you are interested you review our next article to see How do these metrics work in real-world with real Arabic Content.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from google play: https://play.google.com/store/apps/details?id=io.almeta.almetanewsapp&hl=ar_AR

References

[1]. Flesch, Rudolf Franz. “A new readability yardstick.” The Journal of applied psychology 32 3 (1948): 221-33.

[2]. Smith, E. Anthony and R. J. Senter. “Automated readability index.” AMRL-TR. Aerospace Medical Research Laboratories (1967): 1-14.

[3]. Kootstra, G. “Project on exploratory Factor Analysis applied to foreign language learning.” Accessed via: http://www. let. rug. nl/~ nerbonne/teach/remastats-meth-seminar/Factor-Analysis-Kootstra-04. PDF. 2004.

[4]. Al-Ajlan, Amani A. et al. “Towards the development of an automatic readability measurements for arabic language.” 2008 Third International Conference on Digital Information Management (2008): 506-511.

[5]. Tamimi, Abdel Karim Al et al. “AARI: automatic arabic readability index.” Int. Arab J. Inf. Technol. 11 (2014): 370-378.

[6]. El-Haj, Mahmoud, and Paul Rayson. “OSMAN―A Novel Arabic Readability Metric.” Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). 2016.

[7]. Daud, Nuraihan Mat, Haslina Hassan, and Normaziah Abdul Aziz. “A corpus-based readability formula for estimate of Arabic texts reading difficulty.” World Applied Sciences Journal 21 (2013): 168-173.

Further Reading

[1]. Nassiri, Naoual, Abdelhak Lakhouaja, and Violetta Cavalli-Sforza. “Modern standard arabic readability prediction.” International Conference on Arabic Language Processing. Springer, Cham, 2017.

Leave a Reply

Your email address will not be published. Required fields are marked *