How to Detect Clickbait Headlines using NLP?

How to Detect Clickbait Headlines using NLP?

Clickbait is a type of hyperlink on a web page that has catchy or provocative headlines difficult for most users to resist, they tell you exactly what you’re about to see, with just enough of a tease at the end to intrigue you into clicking. Mostly, the content of these pages is disappointed in comparison to their headlines.

The aim of the clickbait detection task is to detect those clickbait pages. A related field of research is identifying bad content on the web, such as spam and fake websites, where features like link-structure and blacklists of URLs, hosts, and IPs have been found to be useful. However, clickbait articles are not spam or fake pages. They can be hosted on reputed news sites.

If you already familiar with the task of click-bait detection and wish to know the details of our system then check out our next article, if you wish to know how clickbait can be formulated as a machine learning task then keep reading.

Clickbait Detection in NLP

The task is usually treated as a binary classification problem, where the classes are “is clickbait” and “is not clickbait”. It can be treated as a regression problem too, where the target denotes the degree of being a clickbait in %.

We can consider the task in terms of social media or normal websites, where the difference is in the available information:

  • For a normal website, we have the article and its headline.
  • For a social media post in addition to the article and its headline we have the post text and other meta-information related to the post (number of shares, number of comments, number of likes, the time when it was posted, the hashtags, etc…) and to the user (number of followers, number of following, etc…).

From Where Can We Get Data to Train a Clickbait Detector?

The most popular dataset in English is the Clickbait Challenge dataset [15], some competitors worked on extending it manually, other researchers annotated their own datasets manually depending on websites famous in using the clickbait technique in their marketing strategies like buzzfeed.com, and other official websites like cnn.com where clickbait supposed to be rare.

To our knowledge, there is no available Arabic clickbait detection dataset. However, many websites are filled with clickbait headlines and can be considered as good data sources.

Which Features Refer to Clickbait?

The clickbait detection task is known for its huge feature space, which can be separated into three categories.

Textual Features

Features related to the headline, the main content in the article itself, the text in the post if we are solving the problem in the social media domain, and the URL of the article.

Following are kinds of textual features:

Title-based

  • The presence/number of some features in the title: numbers, punctuations (exclamation marks, question marks, etc…), question words, stopwords, etc…
  • POS-based: the presence of superlative adverbs and adjectives
  • Sentiment-based: presence/number of negative/positive sentiment words, sentiment polarity.
  • Lexicon-based: presence/number of specific phrases (click here, exclusive, won’t believe, happens next, don’t want, you know, etc…)
  • word2vec-based: the word2vec representation of the headline.
  • Words-based: N-gram, TFIDF.
  • Char-based: N-gram.

Content-based

  • Length-based: average words per sentence.
  • Words-based: N-gram, TFIDF

Similarity-based

The textual similarity between the headline and the content (all the content, first lines, summary, or meta description).

Informality & Readability Based

Measuring the informality level, and the reading difficulty of the text using metrics like LIX, RIX, formality measure, etc…

Forward-reference Based

The presence/number of four grammatical categories in headlines:

  • Demonstratives (this, that, those, these).
  • Third-person personal pronouns (he, she, his, her, him).
  • The definite article (the).
  • Whether the title starts with an adverb.

URL-based

Frequencies of the dash, ampersand, upper case letters, comma, period, equal-to sign, percentage sign, plus, tilde, underscore, and the URL depth (no. of forwarding slashes)

Visual Features

These features are elated to the lead image in the article, or the post image if we are solving the problem in the social media domain, and are presented in the following forms:

  • Using pre-trained object recognition models in two ways:
    • Semantic features: the presence of an object in the image, we take the name of the class as a feature.
    • The embedding of the image taken from the last layer of the model.
  • Using an OCR to extract and analyze the text in the image.

Behavioral and MetaData Features

These features are considered only if we are solving the problem in the social media domain, they are related to the behavior of the user and the metadata in the post, and can be extracted from the following sources:

  • The number of the following and the followers in the publisher profile.
  • The number of likes, comments, and shares
  • The time of publishing
  • Hashtags and mentions

Modeling the Curiosity

A remarkable work to be considered is [3], Where they proposed modeling the curiosity presented in the clickbait headlines in terms of novelty, surprising, and information gap.

Novelty

To model the novelty the following steps were taken:

  • An LDA topic model was trained on 200 topics from the ABC news headlines dataset.
  • A probability distribution over the 200 topics was generated for each headline in clickbait and non-clickbait samples.
  • The information distance between the headline topics that the users were exposed to and the topics that were present in clickbait and non-clickbait was calculated.

They found that clickbait is significantly more novel than non-clickbait.

Surprising

To model the surprising factor the following steps were taken:

  • They took words bi-grams from clickbait and non-clickbait headlines.
  • Measured the frequency with which these bi-grams occurred in the ABC news headlines corpus.
  • Each headline was represented by the frequency of each bigram in it, which was called the surprise frequency vector.
  • More frequent the occurrence of a bigram, lesser is the perceived surprise value of it when encountered by a reader.

Information Gap

Some features that we talked about previously in this report were used to represent the information gap in the headlines like the presence of a question, pronouns, etc…

Conclusion

In this post, we discussed some of the proposed methods in the literature to detect clickbait pages using NLP, and ML techniques. We considered the problem as a binary classification and presented a bunch of the features used to characterize the clickbait headlines.

Are you intrigued? check how we implemented this task in Almeta in our next article.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from google play: https://play.google.com/store/apps/details?id=io.almeta.almetanewsapp&hl=ar_AR

References

[1] Biyani, Prakhar, Kostas Tsioutsiouliklis, and John Blackmer. “” 8 Amazing Secrets for Getting More Clicks”: Detecting Clickbaits in News Streams Using Article Informality.” Thirtieth AAAI Conference on Artificial Intelligence. 2016.

[2] Potthast, Martin, et al. “The clickbait challenge 2017: towards a regression model for clickbait strength.” arXiv preprint arXiv:1812.10847 (2018).

[3] Venneti, Lasya, and Aniket Alam. “How Curiosity can be modeled for a Clickbait Detector.” arXiv preprint arXiv:1806.04212 (2018).

Further Reading

[1] Geckil, Ayse et al. “A Clickbait Detection Method on News Sites.” 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (2018): 932-937.

[2] Papadopoulou, Olga, et al. “A two-level classification approach for detecting clickbait posts using text-based features.” arXiv preprint arXiv:1710.08528 (2017).

[3] Rony, Md Main Uddin, Naeemul Hassan, and Mohammad Yousuf. “BaitBuster: A Clickbait Identification Framework.” Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

[4] Ha, Yui, et al. “Characterizing clickbaits on instagram.” Twelfth International AAAI Conference on Web and Social Media. 2018.

[5] Potthast, Martin, et al. “Clickbait detection.” European Conference on Information Retrieval. Springer, Cham, 2016.

[6] Khater, Suhaib R., et al. “Clickbait Detection.” Proceedings of the 7th International Conference on Software and Information Engineering. ACM, 2018.

[7] Elyashar, Aviad, Jorge Bendahan, and Rami Puzis. “Detecting Clickbait in Online Social Media: You Won’t Believe How We Did It.” arXiv preprint arXiv:1710.06699 (2017).

[8] Wiegmann, Matti, et al. “Heuristic Feature Selection for Clickbait Detection.” arXiv preprint arXiv:1802.01191 (2018).

[9] Grigorev, Alexey. “Identifying clickbait posts on social media with an ensemble of linear models.” arXiv preprint arXiv:1710.00399 (2017).

[10] Cao, Xinyue, and Thai Le. “Machine learning based detection of clickbait posts in social media.” arXiv preprint arXiv:1710.01977 (2017).

[11] Chen, Yimin, Niall J. Conroy, and Victoria L. Rubin. “Misleading online content: Recognizing clickbait as false news.” Proceedings of the 2015 ACM on Workshop on Multimodal Deception Detection. ACM, 2015.

[12] Chakraborty, Abhijnan, et al. “Stop clickbait: Detecting and preventing clickbaits in online news media.” 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 2016.

Leave a Reply

Your email address will not be published. Required fields are marked *