User Profiling Using AWS ElasticSearch – RomCom use case.

Personal differences and preferences marks a very important part of our identity, and optimizing the user experiences based on them can be a great tool in improving users engagement.

In our previous post to tackled the issue of personalized recommendations and how can ElasticSearch make the process extreemly simpler. However in order to build a robust personal recomendation system it is paramount to have an idea of each user. Who are they and what do they like. This is commonly refered to as a user profile.

In this post we will present a road map to enabling user profiling with the help of AWS ElasticSearch, AWS Lambda and AWS SNS. we will compare the various possible solutions to takling this issue and leave you to take the decision that best suites yor case.

before you go any further if you are not familiar with suggestion systems and how they can be implemented in ElasticSearch please review our previous post on it.

Task description

Our main Goal here is to build a user profile and then use it in suggestions. This entails the following main steps:

  1. An action is taken by the user, these actions have the following features:
    1. this action can come from multiple platforms, android/ios apps, your website, … 
    2. The types of actions we are following should be dynamic and allow for adding new actions or removing old ones
    3. Actions can be serialized into a pre-defined structure
    4. Furthermore there are a variety of actions types and each have a different impact on the user profile. The following is an initial list of actions we might want to track
      1. Clicking on an article/ event /genre /tag /publisher …
      2. Clicking on an article source/ event description / entity …
      3. Subscribing to a publisher/ genre/ event/ topic/ tag …
      4. Unsubscribing to a publisher/ genre/ event/ topic/ tag …
      5. Liking or disliking an article 
      6. See more like this, don’t see more like this 
      7. Add geo-location 
      8. Sign up 
  2. The action is sent from the platform (app/ web) to intermidatory, which aggregates the information from multiple users and multiple actions per user. 
  3. Then sends the modified data to ElasticSearch to create/update user profile
  4. To generate new suggestions we simply issues queries to the user index in ElasticSearch 

The main variable in this design is the choice of intermedatory service and how it aggregates and indexes data into ElasticSearch to generate users profiles.

Using an ETL stream (LogStash) as an example

These are streams that allow for grabbing data, applying transformations to it and then loads it in data-pool.

This approach is suitable for you in 2 main situations:

  • You only need to apply simple transformations to the actions being regestered, this can come in handy in case you are aggreagating multiple formats from multiple sources
  • You are regestering a lot of actions and want to use a reliable medium to transfer them to ElasticSearch without losing any data.

In the case of LogStash this data pool is ElasticSearch. Main features for using LogStash includes:

Pros of this design:

  • Aggregation filters and rules are configured directly within the stream 
  • No need to store and index historical actions
  • Real time processing and updates to ElasticSearch which means user profiles are always up to date
  • Indexing is done proactivly 

Cons of the Design:

  • Real Time processing, which means that ElasticSearch can be under-pressure as the number of users increases, this can be mitigated to an extent by controlling the indexing interval of ElasticSearch
  • The data transformations are limited to the capabilities of the ETL service
  • Indexing is done implicitly which can reduce control on timing, and call-backs

To illustrate let us consider a case of clicking on an article:

  • The app sends a fire-and-forget call to the stream endpoint containing the action
  • Stream internally filters the input based on the action type and applies the appropriate transformation rules.  For example, in LogStash it is possible to aggregate multiple objects based on a certain field see this example or calculating metrics based on events like in here for a definitive list of all log-stash processing plugins see this 
  • Stream internally indexes its output to ElasticSearch in real time to update user profile
  • Now it is possible to use the updated profile to issue suggestions

The following UML illustrates this use case.

However, The main draw-back of this approach is that only simple input transformations are possible, these are mainly intended for simple input format unification and similar use cases.

Using ElasticSearch Alerts + SNS + Lambda

AWS ElasticSearch supports Alerts, see the documentation. In very short terms. It is possible to create a monitor that will run once every set time based on a schedule, the monitor executes an ES query using DSL then sends the results to one or more Triggers.

Triggers are basically conditions that evaluate the monitor results and if they evaluate to True an Alert is triggered.

Every Alert has a destination, a message body and a message subject.

These Alerts can have multiple destinations including e-mails and AWS SNS. linking them with AWS SNS is great for 2 reasons:

  1. SNS can be linked directly with lambdas or indirectly using AWS SQS
  2. SNS is virtually free when linked with Lambdas for the first 1M pushes per-month

So using the alerts the flow would be something like this:

To illustrate let us reconsider a case of clicking on an article, and how will it be applied using this design:

The process can be split in 2 main parts: indexing actions and processing Actions:

Indexing

  • The app sends a fire-and-forget call to the stream endpoint containing the action
  • The Stream simply indexes its output to ElasticSearch in real time into a temporary index called /Actions, this index will store all the actions that are being registired for a limited period of time. this is possible with the help of time rolled indexes in ElasticSearch.

The UML of this part is illustrated in the follwing graph

Processing

  • We define an Alert on the /Actions index that is triggered whenever a new action/ a new number of actions is indexed.
  • The trigger have an SNS topic as its destination let’s call it for now UserActionsTopic
  • We define a lambda function let us call it UserProfileUpdater
  • UserProfileUpdater would subscribe to UserActionsTopic and by that any Alert sent through SNS will be sent to UserProfileUpdater
  • UserProfileUpdater will be then responsable for processing the actions and modifying the user’s profile
  • Now it is possible to use the updated profile to issue suggestions

What Stream To Use

In the previous UML we assumed that the actions are sent to ElasticSearch through a stream, but what streams should we use?

  1. Active Stream: This represents the previously mentioned ETL-Stream option. an example of this is AWS kinesis or Logstach, this type will allow you to carry out simple transformations on the input to unify them into a single format.
  2. Static Stream: Here the stream only collects data and sends them to Time rolled data lake (in this case the /Actions index)
  3. No Stream at all: instead of sending users actions to the non-ETL stream the app sends the actions to a lambda function that have a single function of indexing the actions into ElasticSearch. This is a good option to save cost for new services.

To choose one of them we need to discuss two major factors regarding the streaming services:

  1. Does the streaming service allow for historical “cold” data processing 
  2. How costly is it 

The following table illustrates the streaming services I reviewed

AWS Kinesis Kinesis +AWS kinesis Data analytics LogStashKafkaApache Spark Streaming + Apache Storm
Aggregation capacityPlain Kinesis supports processing at the record level through filtering Lambda, but can’t aggregate any historical data The data Analytics service supports real time analytics of kinesis data, it allows aggregation of up to one hour historical data depending on the stream throughput and storage type see the limits here It allows aggregation of multiple records in the same request but no aggregation of historical data Kafka will only cache the last few minutes’ worth of data in-memory, which means attempting to read ‘cold’ data from Kafka will cause significant production issuesI haven’t find any source that supports “cold data” processing
Metrics and analyticsSupports accumulated general historical metrics such as average response time, or top 10 results in a game but can’t do user specific processingBoth general and categorized (user specific) metrics are supportedOnly general metrics are supportedOnly general metrics are supportedNo
serverlessyesyesNoyesNo
Cost per month (Assuming our current use case with less than 1M actions per month and users interaction 24/7 )For a throughput of 1Mbps cost is 13.4$  + 0.018$ per additional million put requestKinesis cost  + 102.2$ for a single KPU that uses SQL and its running storage Hosting on an EC2 nano size will cost around 4$ per months151$Not calculated
Supported LanguagesEvery language supported by lambdaSQL or JavaLog stash configuration syntaxMultiple languages
ComplexitysimpleRelatively simpleRelatively simplecomplicated

Based on this you can determin the stream service to use based on your service requirements:

  • In case you have a limited number of users or you are indexing a limited number of actions per-month then you should opt to the Lambda as a stream option since the cost of it would be much lower
  • If your users increase and the cost of Lambda starts to pile up or if you need to do small tranformations on the actions input then you might wanna use something like log-stash or kenisis
  • As your users base increases you might start considering AWS analytics option

Conclusion

In this post we explored various methods to enable user profiling in order to empower your suggestions system, we believe that using AWS alerts with SNS and Lambda as a processing pipline and employing logstash to index the users actions into ElasticSearch can be adequate for the most common use cases.

Leave a Reply

Your email address will not be published. Required fields are marked *