# Automatically Tagging Data for Content Informativity Scoring

In this task the goal is to assign a given piece of text a tag (or number) representing the level of informativeness or detail this text holds usually by training a model to do that.

Here, we rely on the intuitive idea suggested in [1] which states that usually in any news article the lead paragraph can be used as a good summary especially if the author utilizes the inverted pyramid style. And therefore lead paragraphs that uses this style are more informative than those that does not based on this the authors suggested that a corpora for informativeness prediction can be automatically generated using a corpora of human summarized news articles simply by comparing the human generated summary with the lead paragraph and assigning the lead paragraph a tag of (informative/ creative) based on this similarity.

## What Constitutes a Lead Paragraph

The boundaries of the lead paragraphs across the different datasets is not clear with no simple separator, the only assumption we have is that the lead marks the start of the article. Based on this we decided to empirically select the first N sentences from the start as the lead paragraph the choice of N was done empirically using the elbow method by observing the change in the overall standard deviation of the similarity (averaged across the 5 metrics) for different values of lead size between 1 and 10 sentences and we found that 4 sentences gave the highest relative reduction in overall std without impacting the relative mean similarity.

## What Similarity Metric to Use

In our previous article We have decided to go with a simple similarity metric for our dataset selection process, Jaro-winkler similarity is a normalized text similarity function similar to edit-distance, but we are now using a different similarity function based on [1], The score is computed as the fraction of words in the lead that also appear in the summary.

## What Dataset to Use

We have carried out a detailed exploration of the available summarization datasets in Arabic You can read more about this idea at our previous post Summarization for Informativeness for now we will detail the datasets we are currently using:

### Kalimat

Kalimat is a very large (20K) multi-purpose dataset, the original documents are scraped from several newspapers websites, it covers several features including morphological analysis, single and multi-doc summarization and NER as well as the topic. The summaries in here are extractive summaries generated automatically and thus are of lower quality than other datasets. The dataset is composed of 5 genres of articles (‘Culture’, ‘Economy’, ‘International’, ‘Local’, ‘Religion’, ‘Sports’).

To create an informativeness dataset from it we carried out the following steps:

• Filtering by compression ratio: There is a large number of very short articles, to which the summary is basically a permuted version of the original text to resolve this we selected only articles that have a compression ratio greater than 0.6 where the compression ratio is given by:

$CR = 1 - numWordsCompressed/numWordsOriginal$

• Filter by similarity: It is reasonable to expect that our indirect annotations will be noisy. To obtain cleaner data for training of our model, for each genre, we only use the leads with scores that fall below the 20th percentile and above the 80th percentile.

The following table shows the results of the filtering process and the resulting dataset size:

here are some of the issues we observed:

• The filtering process has drastically reduced the size of the dataset
• The distribution of the articles based on the similarity value is nearly the same for all the genres (with the exception of international news) in all of these genres it is easy to see that the distribution is biased towards smaller similarity values with 80th percentile being lower than 0.6 for all of the genres (with the exception of the international news) the following figure shows the distribution of the articles based on the similarity value for the culture genre top and the economy buttom.
• For the international news we noted that the similarity values are concentrated mostly above 0.9, the distribution is shown in the following graph, for this genre, the similarity value is not of significance.
• After random inspection of some of the results we found some irrational tags, to investigate further we manually annotated 5% of each genre to measure the accuracy of the automatic tags. The annotation process was done using the lead paragraph only with the goal of measuring if the lead follows the inverted triangle style or not. The following table shows the results of this annotation process. Note that the religion genre was not considered

Note that the overall results are not very good, furthermore, we noted that the positive class had superior results across all the genres while the negative class results were worst this was caused by the lower precision and is understandable due to the inherent bias we saw in the distribution of the similarity metric values.

• Furthermore the manual inspection have shown that not all of the genres have a comparable language to the news articles processed by us, most specifically the genres of religion and to some extent culture and local news genres

### EACS

EASC is a small extractive dataset it consists of 153 short articles extracted from wikipedia and Alwatan and Alrai newspapers, with each of them 5 reference manual summaries named {A to E}, and although most of the articles comes from Wikipedia nearly all of them adhere to the inverted triangle scheme and thus can be utilized for the summaries are of news articles it is possible to use them to generate informativeness tags, however the quality of the reference summaries are of varying degrees since they are generated using Mturk.

In this dataset, we only used the text-similarity filter as this dataset didn’t have the problem of long summaries seen in KALIMAT. Note that although the dataset does include genre annotations the step was carried out on the whole dataset since it is already pretty small. Following is the results of this step:

Note that the resulting dataset size in all of the cases is 62 samples, we can also see that while all of the references showed a bias towards the negative class with the 80th percentile lower than 0.6,

To ensure the quality of the resulting dataset we manually inspected the resulting dataset and found that nearly all the leads were informative since the inverted triangle style is prominent in Wikipedia entries we don’t really believe the automated annotation scheme is viable for this dataset.

### MULTI-Ling 2011, 2013

Both these datasets are extremely small for us to carry out this analysis and through manual examination of the 30 leads found in these 2 articles we found that all of them are informative since they are as well entities of Arabic Wikipedia.

## Conclusion

Overall we found that the usage of summarization datasets for informativeness training in Arabic is problematic, overall the datasets of EASC and MULTI-Ling can be faithfully considered informative, however we lack samples of non-informative samples the usage of the KALIMAT dataset for this purpose is risky and even if considered as an option the only genres that can be used are international and economy parts. And even in this case, the resulting dataset would still be too small.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.

# References

[1] Y. Yang and A. Nenkova, “Detecting information-dense texts in multiple news domains,” in Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.