AWS ElasticSearch – Implementation Plan

Many of the web-based application follows a simple pattern: collect data, process it to get some value and then allow users to access the analysis results. In most of these applications users would want to have access to a search service in order to to access specific items from this data collection.

In our previous articles we looked at the several possible frameworks to easily develop your search service:

In this article we will explore how you can deploy your own search service with the help of AWS ElasticSearch and AWS CloudFormation in a couple of hours and possibly for free.

Configuring And Building Domain

An AWS ElasticSearch domain is equivalent to an ElasticSearch cluster, the number of instances and their types heavily depends on the size of your data. We assume in this tutorial that you already have your data in an AWS DynmoDB table. AWS ElasticSearch do support several connectors with multiple data sources you can read about the other options here.

The first step to do is to calculate your storage needs: let us start with a simple example. Assuming you want to build a small service for e-book store and you have a small DynamoDB table whose size is about 300 MB. According tothis we will need around 500MB in storage. This is way belowthe free tier limit of 10GB per month.

Next step is to find the optimal number of shrads for your cluster, this greatly depends on the size of your data. A good rule of thumb is to try to keep shard size between 10–50 GB. In our our previous example, with such a small data size and assuming a growth less than 50GB in the foreseeable future i.e. 100 times our current articles size, we can use a single shrad. You can review this for further details.

Afterwards you need to decide on the type and number of instances you are going to use. This is a very crucial step since the whole pricing plan is based on the instances hourly rates. You can use guidelines from here to have an initial informed guess but in reality you will need to deploy the server and re-tune it for your needs. The free tier offers t2.small instances which is good enough for initial experiments assuming your data is limited. In the example of the previous book store a single instance will be way more than enough. Note that we can add more instances easily later one. One thing to Note is that the maximum storage size of the instance type. For example in case of t2.small this limit is 35 GB.

The next step is selecting your security options, it is recommended to at least use Https for external requests. However, more advanced options like node2node encryption (encrypting the traffic between the instances of the same cluster within AWS), or encryption at rest basically encrypting the data in your nodes. You might need these options for highly sensitive data. However, please note that this option can’t be activated for already built domains and you should decide at this point. Because modification to the security options in already built AWS ElasticSearch clusters requires the creation of a new cluster and migrating to it in order to activate these options.

The final option you might consider is availability zones, these are replicas that are added in a different AWS region in order to provide higher availability to users from that region. In case of the free tier you can only keep a single availability zone, but you can pay a bit more to activate up to 2 more zones.

Uploading And Indexing Current Data

Since we are using DynamoDB we can use it’s stream service to link the updates in the table to the ElasticSearch domain using a Lambda function following this approach. Uploading the already existent data is possible either through the built in REST API or as mentioned in the previous article using kensis. You should consider REST API for cases when you have a limited data size but kensis might be a better option for larger migration operations.

One key aspect that you should consider is the shape of the files to be indexed. ElasticSearch allows the indexing of json files with multiple fields, However you should only consider the fields that you believe might be eligible for search queries.

Monitoring and Alerting

Kibana is a basic dashboard to see the search stats and insights. Basically we can follow this to configure it and we might need to modify the ES IAM roles from here or using cognito from here

After Kibana is ready we can setup Alerting for when there are issues with cluster following this and this

Dev Staging and Deployment

Having 3 separate development stages for development, staging and deployment is an essential part of the CI/CD pipeline, However this is not directly supported by AWS. For ElasticSearch a different scheme blue/green testing style is supported. In this style one environment is live and the other is for development. AWS automatically supports this if we perform breaking changes to the configuration like changing the instance types or count In order to prevent down-time.

However if you need 3 stages it is possible to do this with all AWS service by simply replicating your stack in 3 different regions. The costs of these regions will be minimal since only one of them will be used for real users.

Querying

The simplest option is for the front end directly query the AWS see this. However the suggested option by AWS is to set up a small proxy server that accepts the requests from the clients. The proxy server (usually a small Lambda function) is authenticated using IAM so it can directly call ElasticSearch. This style can prevent public access to elastic cluster which might have the ability to access non-search related endpoints and thus can damage the ES service. While in case of server that is authenticated to a specific IAM role the IAM policy can restrict what actions such server can do. The suggested solution is to use Amazon API Gateway to restrict users to a subset of the Elasticsearch APIs and AWS Lambda to sign requests from API Gateway to Amazon ES.

Customizing Elastic Pipeline

This customization include a bunch of stuff, and they are independent of AWS and mostly related to elastic Search configurations. In this step you most likely going to modify the following configurations:

  • Configuring ingest pipeline: basically the processing steps that should be applied to the text before indexing.ElasticSearch supports Arabic directly but the processing pipeline might not be optimal you might wish to use custom algorithmic lemmatizers or normalizers.
  • Configuring the relevance formula: e.g. deciding if the hits to the title are more or less important to hits to say the article body.
  • Configuring Synonyms: this can really boost the performance of the search service
  • Configuring Facets.

All of these steps are optional and may not be needed if the search service performance was acceptable using the default options.

IaC implementation

There are great merits in using Infrastructure as code for your services. This allows you to easily modify your infrastructure, rebuild it or replicate it. Such style is extremely valuable when using services like AWS.

Now let us look on how we can translate the previous steps to be done automatically using IaC:

  • Building the Elastic cluster (domain) and configuring it. It is extremely simple to do this using AWS cloud formation,
  • Linking ElasticSearch with DynamoDB: it is possible to create a stream from DynamoDB to a custom Lambda function. The stream can have a specific a batch size over the Lambda calls.
  • The proxy API: this is the backend server that links the app or any other application from outside AWS with the ElasticSearch service to insure control over the ElasticSearch operation (limiting them to querying) and insuring security and similar stuff. This is a simple Lambda function in front of ElasticSearch.
  • Configuring ElasticSearch: recall that AWS elasticSearch simply hosts and manages an elasticSearch service, any customization to the ElasticSearch behavior have to be done using http requests to it’s endpoints. see the following example to specify a custom text analyzer. It is not possible to include these configurations in the cloudformation yaml files. One way around this issue is to add these requests in a configuration step in some CI/CD service like circleci that gets called after the deployment step.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.

Leave a Reply

Your email address will not be published. Required fields are marked *