Auto-Tagging Content with NLP

Many sites on the internet allow their users to specify tags for their content. The most famous example of such sites is Tumblr where each post on this social network can hold a manually selected set of tags. These tags can be useful to group the posts into related sets based on their topic and to facilitate searching.

Such tags can also be seen in news outlets or blogs where the author often add them as meta-data to improve their articles ranking in search engines like Google. If you are a writer you will recognize how tedious it can be.

In this article, we will explore the various ways this process can be automated with the help of NLP. Such an auto-tagging system can be used to generate possible tags for your posts or articles and allow you to select the most sensible for your article.

We will also delve into the details of what resources you will need to implement such a system and what approach is more favourable for your case.

NER and NEL

Named Entity Recognition (NER) is the task of extracting Named Entities out of the article text, on the other hand, the goal of Named Entity Linking (NEL) is linking these named entities to a taxonomy like Wikipedia. If you are not familiar with NER and NEL you can review our previous article on this task.

One possible way to generate candidates for tags is to extract all the Named entities or the Aspects in the text as represented by say Wikipedia entries of the named entities in the article. Using a tool like wikifier.

While this method can generate adequate candidates for other approaches like key-phrase extraction. It faces 2 issues:

  • Coverage: well not all the tags in your article have to be named entities they might as well be any phrase.
  • Redundancy: Not all the named entities mentioned in a text document are necessarily important for the article. For example in the following sentence “في بيان أصدرته مساء اليوم الأربعاء وتسلم مراسلنا ناصر حاتم نسخة منه، إن قطاع الأمن الوطني للوزارة رصد، في إطار جهوده “لكشف مخططات جماعة الإخوان الإرهابية والدول الداعمة لها” the named entity ناصر حاتم is irrelevant to the whole article purpose. Nonetheless, this would be suggested as a tag, which is not desired.

Key Phrase Extraction

In keyphrase extraction the goal is to extract major tokens in the text. There are several methods. This can be done, and they generally fall in 2 main categories:

1. Unsupervised Methods

These are simple methods that basically rank the words in the article based on several metrics and retrieves the highest ranking words. These methods can be further classified into statistical and graph-based:

Graph-Based Methods

In these methods, the system represents the document in a graph form and then ranks the phrases based on their centrality score which is commonly calculated using PageRank or a variant of it. The main difference between these methods lies in the way they construct the graph and how are the vertex weights calculated. The algorithms in this category include (TextRank, SingleRank, TopicRank, TopicalPageRank, PositionRank, MultipartiteRank)

Statistical Methods

In this type the candidates are ranked using their occurrence statistics mostly using TFIDF, some of the methods in this category are:

  • TFIDF: this is the simplest possible method. Basically we calculate the TFIDF score of every N-gram in the text and then select those with the highest TFIDF score.
  • KPMiner: [1] the main drawback of using TFIDF is that it inherently has a bias for shorter n-gram since they would have larger scores. In KPMiner the system modifies the candidate selection process to reduce erroneous candidates and then adds a boosting factor to modify the weights of the TFIDF.
  • YAKE: [2] introduces a method that relies on local statistical features from every term and then generates the scores by combining the consecutive N-words into keyphrases.
  • EmbedRank: [3] This simple method uses the following steps:
    • Candidates are phrases that consist of zero or more adjectives followed by one or multiple nouns
    • These candidates and the whole document are then represented using Doc2Vec or Sent2Vec
    • Afterwards, each of the candidates is then ranked based on their cosine similarity to the document vector

2. Supervised Methods

  • KEA is a very famous algorithm for key phrase extraction. Basically it extract candidates from documents using TFIDF and then the trained model is then used to restrict the candidates set.
  • Deep methods were also suggested to tackle this task, basically the task is converted to a sequence tagging problem where the input is the article text while the output is the BOI annotation.

Other Notes

  • A major distinction between key phrase extraction is whether the method uses a closed or open vocabulary. In the closed case, the extractor only selects candidates from a pre-specified set of key phrases this often improve the quality of the generated words but requires building the set as well it can reduce the number of key words extracted and can restrict them to the size of the close-set.
  • Most of the aforementioned algorithms are already implemented in packages like pke.
  • Some articles suggest several post-processing steps to improve the quality of the extracted phrases:
    • In [3] the authors suggest using maximal marginal relevance(MMR) to improve the semantic diversity of the selected key-phrases. They ran a manual experiment with 200 human participants and found that although reducing the phrases’ semantic overlap leads to no gains in F-score, the increased diversity selection is preferred by humans. if you are not familiar with MMR you can learn more about it from our previous article on multi-document summarization.
    • Several other approaches follow the same pattern to diversify their key phrases including [1, 2]
  • Several cloud services including AWS comprehend and Azur Cognitive does support keyphrase extraction for paid fees. However, their performance in Arabic is not always good.

Data

As mentioned above most of these methods are unsupervised and thus require no training data. However, if you wish to use supervised methods then you will need training data for your models.

In the case of Arabic, no large scale corpora are available, the largest we know of is AKEC which is still too small to be used by deep seq2seq models. However many sites add keyphrases to their articles especially news anchors making it fairly simple to scrap corpora of article, key phrases pairs. We already have the data for that in Almeta.

Pros And Cons

  • These methods are generally very simple and have very high performance.
  • Most of these algorithms like YAKE for example are multi-lingual and usually only require a list of stop words to operate
  • The unsupervised methods can generalize easily to any domain and requires no training data, even most of the supervised methods requires very small amount of training data.
  • Being extractive these algorithms can only generate phrases from within the original text. This means that the generated keyphrases can’t abstract the content and the generated keyphrases might not be suitable for grouping documents
  • The quality of the key phrases depends on the domain and algorithm used.

Key Phrase Generation

A major draw back of using extractive methods is the fact that in most datasets (in Arabic and other languages) a significant portion of the keyphrases are not explicitly included within the text [4].

Key Phrase Generation treats the problem instead as a machine translation task where the source language is the articles main text while the target is usually the list of key phrases. Neural architectures specifically designed for machine translation like seq2seq models are the prominent method in tackling this task. Furthermore the same tricks used to improve translation including transforms, copy decoders and encoding text using pair bit encoding are commonly used.

While the supervised method usually yield better key phrases than it’s extractive counter-part there are some problems of using this approach:

  • These methods are usually language and domain-specific: a model trained on news article would generalize miserably on Wikipedia entries. This increases the cost of incorporating other languages.
  • The deep models often require more computation for both the training and inference phases
  • These methods require large quantities of training data to generalize. However as we mentioned above, for some domain such as news articles it is simple to scrap such data.

Text Tagging

Another approach to tackle this issue is to treat it as a fine-grained classification task. Where the input of the system is the article and the system needs to select one or more tags from a pre-defined set of classes that best represents this article. There are 2 main challenges for this approach: choosing a model that can predict an often very large set of classes, and obtaining enough data to train it.

The first task is not simple several challenges have tackled this task especially the LSHTC challenges series. The models often used for such tasks include boosting a large number of generative models [5] or by using large neural models like thous developed for object detection task in computer vision.

The second task is rather simpler, it is possible to reuse the data of the key-phrase generation task for this approach. Another large source of categorized articles is public taxonomies like Wikipedia and DMOZ.

One interesting case of this task is when the tags have a hierarchical structure, one example of this is the tags commonly used in a news outlet or the categories of Wikipedia pages. In this case the model should consider the hierarchical structure of the tags in order to better generalize. Several deep models have been suggested for this task including HDLTex [6] and Capsul Networks [7,8]

The drawbacks of this approach is similar to that of key-phrase generation namely, the inability to generalize across other domains or languages and the increased computational costs.

Ad-hoc solutions

In [9] a very interesting method was suggested. The authors basically indexed the English Wikipedia using Lucene search engine. Then for every new article to generate the tags they used the following steps:

  • Use the new article (or a set of its sentences like summary or titles) as a query to the search engine
  • Sort the results based on their cosine similarity to the article and select the top N Wikipedia articles that are similar to the input
  • Extract the tags from the categories of resulted in Wikipedia articles and score them based on their co-occurrence
  • filter the unneeded tags especially the administrative tags like (born in 1990, died in 1990, …) then return the top N tags

This is a fairly simple approach. However, it might even be unnecessary to index the Wikipedia articles since Wikimedia already have an open free API that can support both querying the Wikipedia entries and extracting their categories. However, this service is somewhat limited in terms of the supported end-points and their results.

Customizable Text Classification by Tagging

Several commercial APIs like TextRazor provide one very useful service which is customizable text classification. Basically, the user can define her own classes in a similar manner to defining your own interests on sites like quora. Next, the model can classify the new articles to the pre-defined classes.

Regardless of the method, you choose to build your tagger one very cool application to the tagging system arises when the categories come for a specific hierarchy. This case can happen either in hierarchical taggers or even in key-phrase generation and extraction by restricting the extracted key-phrases to a specific lexicon, for example, using DMOZ or Wikipedia categories.

The customizable classification system can be implemented by making the user define their own classes as a set of tags for example from Wikipedia, for example, we can define the class football players like the following set {Messi, Ronaldo, … }.

In the test case, the tagging system is used to generate the tags and then the generated tags are grouped using the classes sets.

If the original categories come from a pre-defined taxonomy like in the case of Wikipedia or DMOZ it is much easier to define special classes or use the pre-defined taxonomies.

Conclusion

  • There are several approaches to implement an automatic tagging system, they can be broadly categorized into key-phrase based, classification-based and ad-hoc methods
  • For simple use cases, the unsupervised key-phrase extraction methods provide a simple multi-lingual solution to the tagging task but their results might not be satisfactory for all cases and they can’t generate abstract concepts that summarize the whole meaning of the article.
  • More advanced supervised approaches like key-phrase generation and supervised tagging provides better and more abstractive results at the expense of reduced generalization and increased computation. They also require a longer time to implement due to the time spent on data collection and training the models. However, it is fairly simple to build large-enough datasets for this task automatically.
  • The approach presented in [9] is a fairly general and simple, and it is possible to leverage c APIs to implement it in a fairly simple manner.
  • The simplest way to build a tagging system I out opinion is to combine shallow key-phrase extraction with tags from WikiMedia to generate adequate tags. If the quality of the generated tags is not satisfactory to your application or if you want to support a limited set of tags then you may want to consider stronger options like a key-phrase generation or supervised tagging.
  • One fascinating application of an auto-tagger is the ability to build a user-customizable text classification system. Such a system can be more useful if the tags come from an already established taxonomy.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from google play: https://play.google.com/store/apps/details?id=io.almeta.almetanewsapp&hl=ar_AR

References

[1] El-Beltagy, Samhaa R., and Ahmed Rafea. “Kp-miner: Participation in semeval-2.” Proceedings of the 5th international workshop on semantic evaluation. 2010.

[2] Campos, Ricardo, et al. “YAKE! Keyword extraction from single documents using multiple local features.” Information Sciences 509 (2020): 257-289.

[3] Bennani-Smires, Kamil, et al. “Simple Unsupervised Keyphrase Extraction using Sentence Embeddings.” arXiv preprint arXiv:1801.04470 (2018).

[4] Meng, Rui, et al. “Deep keyphrase generation.” arXiv preprint arXiv:1704.06879 (2017).

[5] Puurula, Antti, Jesse Read, and Albert Bifet. “Kaggle LSHTC4 winning solution.” arXiv preprint arXiv:1405.0546 (2014).

[6] Kowsari, Kamran, et al. “Hdltex: Hierarchical deep learning for text classification.” 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2017.

[7] Sinha, Koustuv, et al. “A hierarchical neural attention-based text classifier.” Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018.

[8] Aly, Rami, Steffen Remus, and Chris Biemann. “Hierarchical multi-label classification of text with capsule networks.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 2019.

[9] Syed, Zareen, Tim Finin, and Anupam Joshi. “Wikipedia as an ontology for describing documents.” UMBC Student Collection (2008).

Leave a Reply

Your email address will not be published. Required fields are marked *