RomCom — The Personalized Recommendation System For Almeta

The nice thing about working in Almeta, is that we are our own users.

As a platform for the best Arabic content on the web, we want to deliver the best content for each user. For us. For each Arabic-speaking reader.

As usual, many company think of “personalization” as the way to go to improve engagement, reach, or even acquisition. Just let the user tell you what he wants, fit a model against his needs, and let him see what he wants to see. Not what he should see. We tend to disagree.

Given our own unique thoughts, interests and curiosity as humans, each of us has a very distinct taste. As a company we can easily think that we can design a system and employ AI to deliver a personalization system that “fits” each one of us as users.

But we disagree. We don’t want us, nor our users, to be locked down by our own interests, tastes and preferences. Exploring curious spaces, lucrative-good content and new novel ideas is essentials for our growth and even wellbeing as humans. We shouldn’t forget that. For this, please read our blogpost here.

We work hard to make AI work for our users at Almeta; delivering what we think is best for us, the Arab readers.

To this end, we have envisioned a personalized recommendation system that tries to cover most of any user’s interests aspects, timeless content, novel ideas and ever-green pieces.

In this post, we’re scoping our recommendation system, “RomCom”, technically. We’ll start with the usual way of building a personalization system. And expand on this later.

Recommendation Systems Through The Eyes of Search Engines

Recommendations and search are two sides of the same coin. Both rank content for a user based on “relevance” the only difference is whether a keyword query is provided.

By translating the problem of recommending content to a user into a search problem for users’ implied interests, we can base our recommender system on a search engine.

Through this post, we would like to share how we are planning to reduce the problem of recommending content to a user, to a search problem for users’ implied interests.

ElasticSearch

As a search engine, we decided to go with ElasticSearch.

Elasticsearch is the central component of the Elastic Stack, a set of open-source tools for data ingestion, enrichment, storage, analysis, and visualization. It is a distributed, open-source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.

What are The Possible Recommendations and How to Handle Them?

Every article we grab from the web is a possible recommendation. Each article is indexed as an ElasticSearch document using its title, content, and other semantic features generated using our NLP services like summary, genres, entities, tags, etc.

However, it’s not just about textual content, Almeta is arranging to be the repository for any kind of content on the web that may trigger a user’s interest; from articles, images, and videos, to podcasts, posts, and accounts, etc.

Good news that all of these can be simply handled the same way!

Each kind of content grabbed from the web would have its own index in ElasticSearch, where the elements would be indexed using particular snippets able to express the content, properties, and semantics behind that specific kind itself.

Let’s take some examples:

Suppose we’re grabbing YouTube videos. Each video has a title, description, categories, tags, etc. all of these and even other metadata like the video channel, its language, whether it has subtitles and the languages of the provided subtitles and more can be a part of a video index.

While elements that are associated with textual content e.g. images and videos in articles can have enhanced index by linking them to their parent textual content, so that, these elements can, for instance, belong to some events, topics, genres, etc.

Other kinds of recommendations are the document’s features themselves e.g. a publisher, topic, genre, etc. Those features would characterize the user’s preferences/interests as we will see further in this post.

The index of these features would be the aggregation across the indexes of all the documents that belong to them e.g. a publisher would have all the topics of the articles published by them, and all the genres of the articles published by them, etc.

Now that we have prepared our recommendations, let’s investigate how we can recommend them.

How to Recommend Content?

In our recommender system, we would have three types of content recommendations.

Personalized Recommendations

These are recommendations directed to each user alone based on his interests.

Determining The User’s Interests

The main block for building personal recommendations is obviously determining the user’s interests.
A user would have a profile either collected implicitly from that user’s behavior; generated based on the semantics of the content each user interacts with, or explicitly stated by the user, for example by selecting the topics he wants to follow, his location, career, education, age, etc.

Each user’s profile can be indexed the same way as a document. Following is an example of an explicitly stated profile:

PUT recs/user/1
{
   "first_name":"user1",
   "last_name":"user1_lastname",
   "email":"user1@xxx.com",
   "job":"programmer",
   "geolocation":"lebanon",
   "education":"computer science",
   "followed_genres":["politics", "economy", "tech"],
   "followed_entities":["bill_gates", "elon_musk"],
   "followed_topics":["blockchain", "deep_learning"],
   "followed_events":["lebanese_revolution"],
   "followed_publishers":["aljazeera", "tech_news"]
}

Typically, the terms in the followed genres, entities, and so on would be ids instead of terms.

To incorporate the user’s behavior into his explicit preferences, a weight can be added to each preference in such a way that, for instance, reading more “tech” articles increases the “tech” preference weight. Simply this weight can be proportional to the number of articles read by the user that meet this preference.

Following is an example of weighted genres:

{
   "followed_genres":[
      {
         "genre":"tech",
         "weight":0.5
      },
      {
         "genre":"health",
         "weight":0.8
      }
   ]
}

The increase in the weight can differ according to the interaction type with the content e.g. sharing increases the weight more than liking while both liking and sharing increase the weight more than just reading.

Until now we talked only bout explicitly determined preferences, so what about the implicitly extracted ones?

Implicitly interests are ones derived from the semantics of the content the user has interacted with but is not interested in explicitly e.g. a user has been observed to read many “tech” articles recently while he’s not following the “tech” genre.
This data can be indexed within the same user’s profile like the following:

{
   "recently_interacted_genres": [
      {
         "genre":"politics",
         "weight":0.5
      },
      {
         "genre":"science",
         "weight":0.8
      }
   ]
}

Personalized Recommendations Scenarios

The recommendations would be provided to the user from different perspectives. All of these recommendations can be acquired by simple ElasticSearch queries built according to each perspective’s context.

Based on what you’re following Scenario

Recommending content based on the user’s explicitly stated preferences.

We can incorporate the user’s explicit preferences into an Elasticsearch query such that all relevant content to his interests is retrieved. Specifically, we can generate a “should” query so that any document matching at least one clause of the query is eligible to be returned to the user. Documents with more than one match will be scored higher.

{
   "query": {
      "bool": {
         "should": [
            {
               "term": {
                  "genre":"tech"
               }
            },
            {
               "term": {
                  "genre":"politics"
               }
            },
            {
               "term": {
                  "topic":"deep_learning"
               }
            },
            {
               "term": {
                  "topic":"blockchain"
               }
            },
            {
               "term": {
                  "event":"lebanese_revolution"
               }
            },
            {
               "term": {
                  "publisher":"aljazeera"
               }
            },
            {
               "term": {
                  "publisher":"tech_news"
               }
            }            
         ]
      }
   }
}

To incorporate the weights of the preferences into the relevance score we can use the “boost” parameter or think further in using function score.

Based on your reading history” Scenario

Recommending content based on the user’s implicit preferences.

This can be done exactly in the same way as the previous feature but incorporating the implicit preferences derived from the user’s behavior in the query this time.

What people with the same interests are reading?” Scenario

Recommend content that is being viewed by similar users.

This requires indexing the ids of the content viewed recently by a user in his profile, for example, having a “recently_viewed” field in the user’s index.

{
   "recently_viewed": ["article_1", "article_2", "article_3"]
}

To retrieve these recommendations for a user, we would first retrieve similar users to him; where the similarity is determined based on the users’ preferences/interests. This can be achieved using a similar query to the ones used in the two previous features to retrieve content that meets the user’s interest, however, this time we’re querying the users.
To get the most viewed content by similar users we can use the significant terms aggregation feature from ElasticSearch.
The following picture is an example of the resulted aggregation buckets:

{
   "buckets": [
      {
         "key":"article_1",
         "doc_count":27617,
         "score":0.0599,
         "bg_count":53182
      },
      {
         "key":"article_2",
         "doc_count":3640,
         "score":0.371,
         "bg_count":66799
      }
   ]
}

“”Preference” for you” Scenario

Recommending preferences i.e. publisher, topic, genre, event, etc.

In a previous section, we discussed how to index the preferences, we can recommend these preferences in three ways:

  1. Using the user’s interests to query a kind of preference e.g. recommend the publishers who frequently publish content meets the user’s interests. Which can be achieved using a similar query to the ones used in the first two features, but querying the publishers this time.
  2. Recommend preferences similar to the highest weighted preferences in the user’s profile e.g. recommend entities similar to the highest weighted entities in the user’s profile.
    Similar preferences can be retrieved by querying the preferences using attributes of the one in hand.
    Following our example, an entity preference would aggregate the attributes from all the content it belongs to, for example, “Bill Gates” entity’s genres would include
    • “tech” with the highest score
    • other entities like “Microsoft” as one of the highest weighted entities.
    • in addition to other related entities like “Apple”, “Facebook”, “Steve Jobs”, etc.
    • Also, it would include topics like “windows 10 release” and so on.
      That’s why entities that would have also the same attributes like: “Steve Jobs” and “Mark Zuckerburg” can be strong recommendation candidates.
  3. Recommend the implicit preferences e.g. a publisher viewed a lot by the user but he’s not following him explicitly.
  4. Recommend similar users’ interests that are not stated in the profile of the user in hand.
    This is very similar to “What people with the same interests are reading?” We would first retrieve the similar users, to get the other’s interests, we can use the significant terms aggregation feature from ElasticSearch where the aggregation would be applied to the field of the interest in hand.

I want to get better at…” Scenario

A feature to propose content for the user according to what he wants to get better in.

What a user wants to get better in, can be thought of as skills e.g. drawing, speaking English, writing, etc.

Regardless of the way in which these skills would be acquired from the user, they can be handled as any user’s preference.

In order to index a skill, we would query the different kinds of content using the terms of the skill e.g. “drawing”. The query result would be, for instance, articles with the following topics:

{
   "query_result": [
      {
         "topics": {
            "deep_learning":0.5,
            "nlp":0.2
         }
      },
      {
         "topics": {
            "deep_learning":0.1,
            "nlp":0.4
         }
      }
   ]
}

To get the skill indexing info we would aggregate the retrieved results so that we would get the following topics for the skill by aggregating the topics in the retrieved example result.

PUT recs/skill/1
{
   "topics": {
      "deep_learning":0.3,
      "nlp':0.3"
   }
}

The produced skill can be then used to query the desired content.
Focusing on the actionable content can be a useful semantic feature to be considered in this kind of recommendation too and can be as other NLP semantic properties calculated offline for the different documents.

Stop seeing this” Feature

This can be thought of as a kind of user’s interaction that can penalize the weights of the user’s preferences that the content in hand meets.

Another solution is using “should not” query in ElasticSearch as described here.

See more like this” Feature

This can be thought of as a kind of user’s interaction that can increase the weights of the user’s preferences that the content in hand meets.

Ranking The Personalized Recommended Content

During the previous recommendation scenarios and in many cases, we may have a huge amount of content that meets the user’s interests in the same way. Hence, we need a way to rank this content to surface the best that meets the user’s interests.

To this end, we’re recalling our previous post which discussed several content ranking metrics like the content novelty, freshness, user engagement, etc.
These metrics can be calculated for each piece of content offline using NLP services. The metrics scores can be then incorporated into the relevance formula of the ElasticSearch using function score.

Content-Based Recommendations

This kind of recommendations can occur in two different scenarios.

Similar content” Scenario

This is the simplest recommendation scenario. Suppose we would like to show documents that are similar to the document that the user is viewing, how to do this?

Simply Elasticsearch has a query feature called “More Like This Query“, also known as the MLT Query, that tackles this case.

MLT query finds documents that are similar to a given text.

How does it work?

MLT query extracts the text from the input document, analyzes it, selects the top K terms with the highest term TF-IDF to form a disjunctive query of these terms.

In a similar scenario, suppose a user just decided to follow a specific publisher, we would like to show publishers similar to the one the user has just followed.

For any given content or preference similar elements can be retrieved by incorporating the element’s attributes into the query as discussed in similar cases in the previous section.

People also viewed” Scenario

In this scenario, given an element a user interacted with, show him elements from the same kinds that other users who interacted with that element in hand also interacted with.

The way to implement this is very similar to the way we implemented “What people with the same interests are reading?” scenario in. We would first retrieve the users that interacted with the same element e.g. followed the same topic he’s just followed. To get the other elements that those users interacted with, we can use the significant terms aggregation feature from ElasticSearch in our example the aggregation would be applied on the “followed_topics” field in the user index.

Public Recommendations

Here we recommend content that may trigger the user’s interest but is not directed especially to him. These recommendations are shared between the whole users’ base.
The public recommendations go in four different scenarios.

Popular on Almeta” Scenario

We haven’t incorporated the popularity factor in our personalized content recommendations. However, in contrast, we tried to eliminate its effect by using significant terms aggregation in some cases instead of the normal terms aggregation.

This is because what’s globally popular can come over what’s interesting in personalized recommendations. If you used the popularity as a method for recommendations, every user will be recommended the same thing and the recommendations won’t be personalized. This problem is referred to as the “Oprah Book Club” problem – a bias towards cross-cutting popularity that isn’t particularly useful.

However, what can be interesting for a lot of people is possibly interesting to a specific user. So, we can have a separate popular tab in our app; such that no matter the genre, topic, event, etc. of the article is, here it will be weighted by the users’ interactions e.g. likes count, shares count, views count, etc.

So, simply we need to implement an ElasticSearch query that ranks the content given the “interaction” volume for a specific piece of content.

Local News” Scenario

Users may be more interested in things close to them, surrounding them. So we will have a local news tab; things taking place in their current location.

The current geolocation of the user can be explicitly defined by him or detected by the app and saved as part of his profile. To get the local news we simply need to query the news according to the place it took place in.
Where this piece of info about some news can be provided offline as the other semantics by an NLP service.

Trends” Scenario

What is interesting to a user can be world/local trends – which is not the same as popular – these are content belong to trendy topics.

Trendy topics would be determined offline using an NLP service too, then we would query the trendy content that belongs to them.

Recommended full read” Scenario

These are articles with good content worthy to read.

Away from the user’s interests, this content is recommended according to the metrics discussed in a previous post like the content novelty, freshness, user engagement, etc.

These metrics can be calculated for each piece of content offline using NLP services. The metrics scores can be then incorporated into the relevance formula of the ElasticSearch using function score.

Conclusion

In this post, we scoped “RomCom” Almeta’s personalized recommendations system technically. We translated the recommendation problem into a search problem. Using ElasticSearch as our search engine we showed the strategy to implement each recommendation scenario.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store

Further Reading

  1. https://www.elastic.co/blog/looking-at-content-recommendation-through-a-search-lens
  2. https://qbox.io/blog/mlt-similar-documents-in-elasticsearch-more-like-this-query
  3. https://dzone.com/articles/high-quality-recommendation-systems-with-elastic-1
  4. https://dzone.com/articles/high-quality-recommendation-systems-with-elastic-2
  5. https://smart-factory.net/product-recommendation-with-machine-learning-using-elasticsearch/
  6. https://www.youtube.com/watch?v=ZoTfTDAwEkY

Leave a Reply

Your email address will not be published. Required fields are marked *