Event Detection in Media using NLP and AI

Event Detection in Media using NLP and AI

News stories are created every day at many news agencies. Users may receive news streams from multiple sources. Browsing in large-scale information spaces without guidance is not effective.

Suppose, for example, a person who has returned from a long vacation and wants to find out what happened during the period. It is impossible to read the whole news collection and it is unrealistic to generate specific queries about unknown facts. As a result, it is difficult to retrieve or to check all the potentially relevant stories.

Thus, it is useful to have an intelligent agent to automatically locate related stories in the continuous stream of news articles.

This is our first article on the event detection task, in the rest of these articles we will discuss:

What is Event Detection?

Event detection is the task of detecting related stories from a continuous stream of news. Where the event is identified as something happening in a certain place at a certain time.

News Event Analysis

Let’s mention some of the news events properties that will help us in modeling the problem:

  1. News stories discussing the same event tend to be temporally proximate.
  2. A time gap between bursts of topically similar stories is often an indication of different events.
  3. A significant vocabulary shift and rapid changes in term frequency distribution are typical of stories reporting a new event.
  4. Events are typically reported in a relatively brief time window (e.g. 1-4 weeks) and contain fewer reports than broader topics.

Modeling The Problem

In order to compare news articles and get to know which of them are more similar to each other, the first obvious step is to define a way to measure the distance between the articles.

What do we need to measure a distance?

  • Representing the articles as vectors.
  • A function that takes two of those vectors and outputs a real non-negative number that reflects how close those vectors are to each other.

How to Represent News Articles for Distance Measurement?

We can’t define a distance between the news articles in their typical unstructured form. Let’s explore the ways used for transforming a news article into the vector space.

Traditional Representation

This representation involves the term vectors or bag of words, whose entries are nonzero if the corresponding terms appear in the document. Each term in the vector is typically weighted using the classical term frequency-inverse document frequency (TF-IDF) approach.

However, according to the third news event property that we mentioned in the previous section, a modification was made to the standard TF-IDF term weighting when applied in this domain, which is using an adaptive IDF instead of the static one.

5Ws1H Representation

This approach tries to take advantage of the structure of a news article. 5W1H is a technique employed in journalism to gather all information about a story, to turn it into a news article. It consists of six questions: What, Who, Where, When, Why, and How. An article is not considered complete until all of these six questions are answered.

It seems that little agreement could be reached on what to consider part of What or Why, or even how to represent How, but Where, Who and When are more concrete and could give better results.

Who and Where can be represented as Named Entities (Proper Nouns, Organizations, Locations, etc.) that appears in the text.

Time Information (When) represented by the publish date of the article.

What was represented by some works as topical information, namly the output vector of the LDA algorithm.

How to Measure the Distance Between The Articles Vectors?

Typically, the distance between the vectorized information is measured using traditional metrics such as:

Another option that can be considered is to measure the intersection between entities and/or the time zone between the publish dates separately.

How to “Detect The Events”?

We differentiate between different approaches in terms of supervised and unsupervised learning approach.

Unsupervised Learning Approach

The most common approach in event detection, which doesn’t need annotated datasets.

Document-Pivot Approach

Typically, the event detection problem is treated as a stream clustering problem that aims to recognize patterns in an unordered, infinite and evolving stream of observations i.e. news stories.

As documents arrive the system must provide decisions (new or old event). Therefore, the employed clustering approaches are typically based on incremental (greedy) algorithms that process the input streams sequentially and merge an event with the most similar ones to it or create a new cluster if the similarity exceeds a predefined threshold.

Feature-Pivot Approach

This approach relies on keywords clustering instead of document clustering. And was found to be more effective in the social media domain.

This method is based on the assumption that documents describing the same event will contain similar sets of keywords. The solutions steps are as follows:

  1. Build a network of keywords based on their co-occurrence in documents.
  2. Use community detection methods analogous to those used for social network analysis to discover and describe events.
  3. Constellations of keywords describing an event can be used to find related articles.

Supervised Learning Approach

Treating the problem of event detection as a supervised learning problem which needs annotated datasets is less common. However, we’re discussing it as a solution in this section.

Data Annotation

For a given time window (e.g. a particular period) collect the news articles from specified web news portals. Each article should be assigned to a group of articles reporting the same event. If there are no previously annotated articles that are reporting the same event a new group should be created for the article.

Auto Tagging Methods

Threshold-based: Calculate the similarity between articles. Similar articles are those whose similarities exceeded a predefined threshold.

Pair-classification: Use a binary classifier that given two articles decides if they are similar or not.

How to Solve Event Detection Across Different Languages?

A few works have been considered the problem of language-independent event detection. For instance, in [1] at first, they clustered the articles in each language separately.

To perform the online clustering, they measured the similarities between the articles using their entities and non-entities weighted by TF-IDF, along with their publication dates.

They were interested in determining where the event took place exactly as an important entity to answer where. To achieve this they considered the problem as a word-level classification problem, where each location mentioned in the article is a candidate. An SVM classifier was used to perform the classification.

After all, they tried to identify the clusters in different languages that are discussing the same event. To perform the task they represent it as a learning problem. From the two clusters under consideration, they extracted a set of learning features that can be used for training a pair-classification model. These features included:

  • Identification of concepts done by wikification, which is a process of entity linking that uses Wikipedia as the knowledge base. Each mentioned concept is annotated with a URI that is the link to the corresponding Wikipedia page.
  • Whether the event locations found for the two clusters are the same or not.
  • The absolute difference in hours between the events in the two clusters.
  • The similarity of the dates that are being mentioned in the articles in the two clusters.
  • The categories of the news articles in the two clusters based on their content, they categorized the news articles into a DMOZ taxonomy.

Another work to consider in this domain is [2] they developed four NLP pipelines for event detection in English, Spanish, Dutch and Italian. The pipelines aimed to identify who did what, when and where by adopting a common semantic representation.

Semantic interoperability across the four languages was achieved by projecting entities, event predicates and roles, time expressions and concepts to language-neutral semantic resources. In order to achieve semantic interoperability, event information from multilingual sources, entity and event mentions were projected onto language-independent knowledge representations.

  • named entities were linked to English DBpedia entity identifiers through cross-lingual links existing to the Spanish, Italian, and Dutch DBpedia.
  • Nominal and verbal event mentions were aligned to abstract representations through the Predicate Matrix [3].
  • Time expressions were all normalized to the ISO time format.

Conclusion

Event detection is the task of detecting related stories from a continuous stream of news. The main key to solve this problem is to represent each news story with features that describe the reported events effectively. The provided solutions include answering the questions about who, where, when, what, and how from the story. Machine Learning methods including supervised and unsupervised approaches were applied to solve this problem.

This was our first article on the event detection task, in the rest of these articles we will discuss:

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from google play: https://play.google.com/store/apps/details?id=io.almeta.almetanewsapp&hl=ar_AR

References

[1] Leban, Gregor, Blaz Fortuna, and Marko Grobelnik. “Using News Articles for Real-time Cross-Lingual Event Detection and Filtering.” NewsIR@ ECIR. 2016.

[2] Agerri, Rodrigo, et al. “Multilingual event detection using the NewsReader pipelines.” de Castilho RE, Ananiadou S, Margoni T, Peters W, Piperidis S, editors. LREC 2016 Workshop. Cross-Platform Text Mining and Natural Language Processing Interoperability; 2016 May 23; Portoroz, Slovenia.[place unknown]: LREC; 2016. p. 42-6.. International Conference on Language Resources and Evaluation (LREC), 2016.

[3] De Lacalle, Maddalen Lopez, Egoitz Laparra, and German Rigau. “Predicate Matrix: extending SemLink through WordNet mappings.” LREC. 2014.

Further Reading

[1] Yang, Yiming, et al. “Learning approaches for detecting and tracking news events.” IEEE Intelligent Systems and their Applications 14.4 (1999): 32-43.

[2] Atefeh, Farzindar, and Wael Khreich. “A survey of techniques for event detection in twitter.” Computational Intelligence 31.1 (2015): 132-164.

[3] Parafita Martínez, Álvaro. “News similarity with natural language processing.” (2016).

[4] Sayyadi, Hassan, Matthew Hurst, and Alexey Maykov. “Event detection and tracking in social streams.” Third International AAAI Conference on Weblogs and Social Media. 2009.

[5] Kumaran, Giridhar, and James Allan. “Text classification and named entities for new event detection.” Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2004.

[6] Kumaran, Giridhar, and James Allan. “Using names and topics for new event detection.” Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2005.

[7] Hogenboom, Frederik, et al. “An Overview of Event Extraction from Text.” DeRiVE@ ISWC. 2011.

[8] Li, Zhiwei, et al. “A probabilistic model for retrospective news event detection.” Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2005.

[9] Lam, Wai, et al. “Using contextual analysis for news event detection.” International Journal of Intelligent Systems 16.4 (2001): 525-546.

[10] Edouard, Amosse. Event detection and analysis on short text messages. Diss. 2017.

[11] Brants, Thorsten and Francine Chen. “A System for new event detection.” SIGIR (2003).

[12] Rafea, Ahmed, and Nada A. GabAllah. “Topic Detection Approaches in Identifying Topics and Events from Arabic Corpora.” Procedia computer science 142 (2018): 270-277.

Leave a Reply

Your email address will not be published. Required fields are marked *