How to Rank Articles Based on How Informative They Are

How to Rank Articles Based on How Informative They Are – Using Snorkel

Let’s start with a simple question, what constitutes an informative article? based on Oxford’s dictionary.

informative/ɪnˈfɔːmətɪv/ adjective: informative
providing useful or interesting information

However, this is still an abstract concept. Yes, it is much simpler to flag an article as spammy or unprofessional to see our previous article here
but when it comes to normal article the matter becomes more complicated.
imagine you are given 10 articles and asked to rank each on a scale from 1 to 10 based on how “informative” they are how long will it take you? an hour? half?

If you ever went through such an experience you will remember struggling with exact numbers, yes you will know when an article is zero or 10 but what about deciding between 7 and 6.

Let’s try another task given 5 pairs of articles, for each pair you should choose the article that seems more informative to you. it is clear that the second task is much simpler.

This is the difference between the task of regression the former and ranking the latter. We as humans are much suitable for the latter than the former, it is much easier to choose between chocolate and vanilla ice cream than to rank chocolate on a scale of 10.
The same analogy can be carried out to the AI models. mainly since building datasets by human annotators for pairwise ranking is simpler than direct mapping.

Let us assume that you have built a system that given 2 articles that can decide which of them is more informative. In this article, we will explore a single question, who can we use this system to build an informative feed (sort the articles of your feed based on their informativeness).

This article is the latest part of our research on informativeness, and we have A LOT of it if you are interested in more details you can check each individual piece to learn, or review the gist of it.

Offline Ranking

What is it?

Let us start with a simpler task, assuming that articles represent a closed set, let us assume we have N articles and for these articles, we have M pairwise comparisons, how can we sort the N elements:

One option is to use ordinary sorting algorithms like quickSort so if we want to sort the articles using pairwise comparisons we will need M=Nlog(N), apart from needing a lot of comparisons the main problem with this method is that it assumes the comparisons to be consistent, meaning that we assume that:

If a > b, b > c then a > c

where a, b, c are 3 elements to be compared and > represents our ranking system.


However, this constraint of transitivity might not be necessarily correct, due to noise, errors in the model, etc and if this constraint is not satisfied the whole logic of the sorting algorithm fails.

A method to tackle this issue is to use the Bradley–Terry model in this very simple model that states given a pair of individuals i and j the probability that the i > j is:

P(i>j)={\frac  {p_{i}}{p_{i}+p_{j}}}

Where pi is a positive real score assigned to the individual i. The comparison i > j can be read as “i is preferred to j“, “i ranks higher than j“, or “i beats j“, depending on the application.

This model can be parameterized as follows:

P(i>j)={\frac  {e^{{\beta _{i}}}}{e^{{\beta _{i}}}+e^{{\beta _{j}}}}}

Meaning that given a limited number of comparisons M we can fit the Bradley–Terry using these comparisons to estimate each of beta{i} for i from 1 to N which, in turn, represents the strength (score) of the element i, and thus we can sort the elements based on this score

However, there is still an issue, in the start we assumed that our set of news articles is closed, however in our real application, the news articles will be added periodically, in that case, the abovementioned models can’t assign a score to these new articles, and will require refitting the model every time we add a new article which is computationally very expensive this leads us to the next section.

How can I implement it?

There are a lot of ready toolkits to be used: use the very simple elo in case of trasitive comparisons or employ choix if you need to use models like Bradley–Terry

Online Ranking

What is it?

The goal here is to be able to use the pairwise comparison to get a score of each individual article, several methods were introduced in the literature to handle this issue but let us talk about a single line of research RankNet [1] then LambdaRank [2] a full simple mathematical introduction to these algorithms is presented in [4] and this amazing blogpost also a good read, here we will try to provide a simplified version of the thing.

RankNet:

Let us start by defining an ML model that can take an article represented as an input vector and return a score representing how “Good” this article is, the underlying model can be any model for which the output of the model is a differentiable function of the model parameters RankNet training works as follows. The training data is partitioned by a query. The query here means a dichotomy of the data for example in the case of our application in means comparing articles that talk about the same event. At a given point during training, RankNet maps an input feature vector X from R^n onto a number f(x) where n represents the size of the feature vector and f(x) the score of the article. For a given pairwise comparison between elements i and j that belongs to the same query we can calculate the scores s_i = f(x_i) and s_j = f(x_j) RankNet models the probability of i being better than j as :
P(\textrm{rank}(i) > \textrm{rank}(j)) = \frac{1}{1+e^{-(s_i-s_j)}}

Based on this modeling we can apply negative log-likelihood to create a cost function that can be optimized using gradient descent, Note that this optimization process will affect the weights of the original ML model that we are extracting the scores from. After this optimization process is finished this model can be deployed as it cant take a new article and assign a new score to it.

LambdaRank:

The main problem with this model is that in our equation we are only using the difference s_i - s_j to optimize the ranking, while this would work in many cases, most of the time we are more interested in the top results than in the bottom ones. In other words, the loss is the same for any pair of items i , j regardless of whether i and j are ranked in 1st and 10th place or if they are in 100th and 110th place. since the difference will be the same and therefore the modification of the weights will also be the same.

The algorithm of LambdaRank adds a small detail to handle this issue basically we will give more importance to higher ranks than to lower ones by modifying the ranking probability to be:

P(\textrm{rank}(i) > \textrm{rank}(j)) = \frac{-|\Delta(i,j)|}{1+e^{s_i-s_j}}

where -|\Delta(i,j)| represents a penalty related to ranks of i and j, several metrics have been proposed to implement delta most famous of them is NDCG (normalized discounted cumulative gain). NDCG is a metric that is commonly used in information retrieval and measures the quality of a ranking. the details of NDCG is out of the scope know and you can read more about it in here.

How to do it?

It is possible and may be beneficial to implement the algorithm using your favorite ML toolkit, for example, we tried to implement it in keras by defining a specific loss function (LambdaRank loss), However, the best off the shelf solution we found is LGBMRanker from Microsoft’s LightGBM

Is this the only way to do it?

Absolutely not. Several methods have been devised to generate full rankings from a limited amount of pairwise comparisons, we have only explored a single line of research other methods includes ranks embedding [5] others based on modified heuristics like [6]. [7] represents a good initial overview of the field.

Few Pairwise Comparisons

Based on our discussion so far the main thing that is missing to build your amazing feed ranking system is to have enough pairwise comparisons between items to be used as training data for LambdaRank or any other algorithm. While later on in the life of the system it would be easy to harness users feedback and their interaction to enhance this dataset it is necessary to have an initial data to build the model, and this means manual annotation 🙁

Enter Snorkel

Snorkel is a system for programmatically building and managing training datasets without manual labeling. In Snorkel, users can develop large training datasets in hours or days rather than hand-labeling them over weeks or months.

Snorkel currently exposes three key programmatic operations:

  • Labeling data: e.g., using heuristic rules or distant supervision techniques
  • Transforming data: e.g., rotating or stretching images to perform data augmentation
  • Slicing data: into different critical subsets for monitoring or targeted improvement

Snorkel then automatically models, cleans and integrates the resulting training data using novel, theoretically-grounded techniques. Don’t believe me? read the paper [8].

Snorkel is such a big deal that we are planning to create a full article dedicated just for it, but if you are out of patience, visit their amazing site and jump into the tutorials. For now, let us focus on how we can use snorkel for our ranking system. And for now, we are only interested in labeling the data.

Labeling Functions

Snorkel is based on the idea that we can create several Labeling functions and then combine them to programmatically annotate the dataset, these Labeling functions can be seen as simple rules that vote on the label of the data point. For example, in our task the input to a labeling function would be 2 articles, A and B, and the output will be one of 3 labels -1 (A>B), +1 (A<B), 0 (Abstaining i.e. the labeling function can’t compare them). Inventing such labeling functions should be a simple task (most certainly simpler than manual annotation of thousands or millions of data points).

What can be used as a labeling function?

The short answer is anything. The long answer is that because their only requirement is that they map a data point a label (or abstain), they can wrap a wide variety of forms of supervision. Examples include, but are not limited to:

  • Keyword searches: looking for specific words in a sentence
  • Pattern matching: looking for specific syntactical patterns
  • Third-party models: using a pre-trained model (usually a model for a different task than the one at hand)
  • Distant supervision: using an external knowledge base
  • Crowdworker labels: treating each crowd worker as a black-box function that assigns labels to subsets of the data

More importantly, these LFs can be Noisy (not always accurate) and even have conflicts and correlations between them and it is the role of snorkel to combine them and build a probabilistically annotated dataset.

Are all labeling functions the same?

Obviously, no. Assume that for our task of articles comparison we used 3 LFs all of them takes 2 articles as input and should output a label in the set {-1,0, 1}:

  • LF1: ranks the 2 articles based on the number of cliche terms in each of them.
  • LF2: ranks the 2 articles based on the readability of the 2 articles.
  • LF3: ranks the 2 articles based on the ratio of named entities in each of them. (to measure how detailed the article is)

It is easy to see that the 3 LFs will have different behavior and might as well have different accuracy when used to label data, but how can we evaluate them and hopefully improve upon them. based on snorkel: the typical LF development cycles include multiple iterations of ideation, refining, evaluation, and debugging. A typical cycle consists of the following steps:

  1. Look at examples to generate ideas for LFs
  2. Write an initial version of an LF
  3. Spot check its performance by looking at its output on data points in the training set (or development set if available)
  4. Refine and debug to improve coverage or accuracy as necessary

And here is the catch, while theoretically, it is possible to use Snorkel to build your dataset with 0 labeled data in practice it is important to have at least a very small development dataset this can be in the magnitude of 100 data points. the development data will be used to help you build and evaluate the performance of the LFs. However, annotating such a small number of data points by hand is not a big deal and can be accomplished in hours to days.

To Wrap it All Up

OK, so here is the simple plan to build your amazing online ranking system:

  1. Start by collecting a large amount of unlabeled data in form of pairs. Let us call this set Training Set
  2. Manually annotate at least 2 small sets of data one for developing Labeling Functions with Snorkel (this can be around 100 data points) and a larger test data to evaluate the overall performance of the pairwise ranker
  3. Develop LFs using your development set and imagination then use them with Snorkel to automatically annotate your Training Set
  4. Use any of the aforementioned online ranking algorithms to train a ranking model
  5. Enjoy.

In this article, we simply explained the workflow of creating a feed ranking system in the next articles we will go with you through the implementation details of every step of this plan.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from google play: https://play.google.com/store/apps/details?id=io.almeta.almetanewsapp&hl=ar_AR

Further Reading

[1] Burges, Christopher, et al. “Learning to rank using gradient descent.” Proceedings of the 22nd International Conference on Machine learning (ICML-05). 2005.

[2] Burges, Christopher J., Robert Ragno, and Quoc V. Le. “Learning to rank with nonsmooth cost functions.” Advances in neural information processing systems. 2007.

[3] Wu, Qiang, et al. “Adapting boosting for information retrieval measures.” Information Retrieval 13.3 (2010): 254-270.

[4] Burges, Christopher JC. “From ranknet to lambdarank to lambdamart: An overview.” Learning 11.23-581 (2010): 81.

[5] Jamieson, Kevin G., and Robert Nowak. “Active ranking using pairwise comparisons.” Advances in Neural Information Processing Systems. 2011.

[6] Shah, Nihar B., and Martin J. Wainwright. “Simple, robust and optimal ranking from pairwise comparisons.” The Journal of Machine Learning Research 18.1 (2017): 7246-7283.

[7] Jamieson, Kevin G., and Robert Nowak. “Active ranking using pairwise comparisons.” Advances in Neural Information Processing Systems. 2011.

[8] Ratner, Alexander J., et al. “Data programming: Creating large training sets, quickly.” Advances in neural information processing systems. 2016.

Leave a Reply

Your email address will not be published. Required fields are marked *