Multidimensional Topic Modelling. The What? and The How?

Multidimensional Topic Modelling. The What? and The How?

Yes, understandably you might be thinking is this related to Rick and Morty?

Well unfortunately No. But you should really continue reading cause Multidimensional topic modeling is really cool.

In this short piece we will explore the fundamental idea behind multi-dimensional topic modeling and even give you a list of some open-sourced implementations so stay tuned.

The What?

Well the task of Topic Modelling is the task of modelling the process of documents creation mostly through approximating the (topics, aspects, perspectives, styles, ….) of the document as a probabilistic distribution over the words.

One Dimension

The aforementioned factors are called Latent variables (hidden factors that influences the document generation process and the choice of words) mostly these distributions are assumed to follow Dirichlet distribution.

LDA [1] is one of the oldest and most successful topic models, it assumes we have a set of Z latent components (usually called “topics” ), and each data point (document) has a discrete distribution over these topics.

The set of latent components usually relates to a single latent variable where the LDA tends to learn distributions which correspond to semantic topics (such as SPORTS or ECONOMICS) which dominate the choice of words in a document, rather than syntax, perspective, or other aspects of document content.

Two Dimensions

Better modelling can be achieved by using more than a single set of latent components.

Imagine that instead of a one-dimensional vector of Z topics, we have a two-dimensional matrix with Z1 components along one dimension (rows) and Z2 components along with the other (columns).

This structure makes sense if your data is composed of two different factors, and the two dimensions might correspond to factors such as news topic and political perspective (if we are modelling newspaper editorials), or research topic and discipline (if we are modelling scientific papers). Individual cells of the matrix would represent pairs such as (ECONOMICS, CONSERVATIVE) or (GRAMMAR, LINGUISTICS). this is the idea behind the two-dimensional models like TAM [2] and SAGE.

A Ton of Dimensions

We can expand the idea even further by assuming K factors modeled with a K-dimensional array, where each cell of the array has a pointer to a word distribution corresponding to that particular K-tuple.

For example, in addition to topic and perspective, we might want to model a third factor of the author’s gender in newspaper editorials, yielding triples such as (ECONOMICS, CONSERVATIVE, MALE).

Conceptually, each K tuple functions as a topic in the original LDA (with an associated word distribution ) except that K-tuples imply a structure, e.g. the pairs (ECONOMICS, CONSERVATIVE) and (ECONOMICS, LIBERAL) are related.

Related algorithms

Other related approaches include the Contrastive Opinion Summarization task. The goal of this task is to extract sentences from positive and negative sets of opinions on a topic and generate a comparative summary containing sentence pairs that are both contrastive to each other and representative with respect to the given sets of opinions.

The method reported in [3] models the summarization task as an Optimization problem and it uses Natural Language Processing (NLP) and Optimization techniques to generate a representative and comparative summary from customer reviews about a topic, product or service.

The How

In the following table we list some of the available implementations of the aforementioned algorithms

Name Programming Language: Description
Factorial LDA code Java Implementation of factorial LDA
ccLDA and TAM code Java Implementation TAM [2] and ccLDA
VODUM Java Implementation of the Viewpoint and Opinion Discovery Unification Model [4]
Contrastive Summarization Python Implementation of the Contrastive summarization model from [3]
SeaNMF Python Implementation of the Sea Nonnegative Matrix Factorization from [5]
STTM Java A Library of Short Text Topic Modeling

Conclusion

In this article we explored the idea of expanding topic modeling to cover various aspect and by now, you might be thinking …

I told you didn’t I.

But seriously how am I going to make a million-dollar from this knowledge? Well, if you want to see a cool application check out this piece to find out how it is possible to use multi-dimensional topic modelling to discover different political views in an opinionated text.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.

Further reading

[1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, no. Jan, pp. 993–1022, 2003.

[2] M. Paul and R. Girju, “A two-dimensional topic-aspect model for discovering multi-faceted topics,” in Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.

[3] H. D. Kim and C. Zhai, “Generating comparative summaries of contradictory opinions in text,” in Proceedings of the 18th ACM conference on Information and knowledge management, 2009, pp. 385–394.

[4] T. Thonet, G. Cabanac, M. Boughanem, and K. Pinel-Sauvagnat, “Vodum: A topic model unifying viewpoint, topic and opinion discovery,” in European Conference on Information Retrieval, 2016, pp. 533–545.

[5] T. Shi, K. Kang, J. Choo, and C. K. Reddy, “Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 1105–1114.

Leave a Reply

Your email address will not be published. Required fields are marked *