AWS Cloud Search Service

What is the common denominator between a news aggregator, an electronic shopping site and a music streaming service? All of these applications collect data, and for all of them the users would want to have access to a search service in order to to access specific items from this data collection.

Regardless of the application you are building, in this article, we will look at one possible option to easily set up your search service especially if you rely on AWS.

Amazon CloudSearch is a fully managed service in the AWS Cloud that can very easily allow you to set up, manage, and scale a search solution for your website or application.

Why CloudSearch

There are several factors that might incline you to choose CloudSearch to build your search service:

Simplicity

A major factor in selecting CloudSearch is the simplification of the process of setting up a search service, most of the popular options such as Solr or Lucien requires some level of technical experience to build an initial search service. However, in the case of CloudSearch, the whole process can be done in less than an hour.

Scalability

Another issue to address when building a search service is managing the requirements of this service, this manifests mainly in the size of the inverted index (the ability to store more articles) and computational requirements to carry out search requests from customers.

This might require prior allocation of large resources to store these documents, or even distributing your index across several nodes in a cluster.

However, with cloud search, this whole process is reduced to configuring your environment and letting AWS handle the allocation of the instances.

A search domain has one or more search instances, each with a finite amount of RAM and CPU resources for indexing data and processing requests. It is possible to configure scaling options to control the instance type that is used, the number of instances your search index is distributed across (partition count), and the number of replicas of each index partition (replication count). With the single limitation, all instances for a domain are always of the same type.

The 3 different parameters affect your search service in different ways:

Instance Type

Smaller instances are basically cheaper. However, you should consider allocating larger instances in 2 situations:

  • Increasing upload capacity: this metric refers to the maximum size of documents that can be added to the search service and re-indexed, smaller instances have lower upload limits (fewer documents).
  • Improving the search speed: intuitively using a larger instance will mean larger CPU and thus faster queries, it is also possible to improve the performance by increasing the partitioning of your index.

Replication

CloudSearch natively supports auto-scaling to manage the increase in traffic, this is done by fully replicating the whole inverted index.

This process can be particularly costly, see more details in the pricing section. However, by using replicas, it is possible to increase the fault tolerance of the system where if a single node collapses other replicas or partitions remains functional.

Availability

Apart from replicating the nodes to scale the service, you might need to host different systems in different regions of the world in order to improve the availability of the service for different users.

Cloud search supports this functionality through the Multi-AZ service provisions and maintains extra instances for your search domain in a second Availability Zone to ensure high availability. The maximum number of Availability Zones a domain can be deployed in is two.

If your domain is running on a single search instance, enabling the Multi-AZ option adds a second search instance in a different availability zone, which doubles the cost of running your domain. Similarly, if your index is split across multiple partitions, a new instance is deployed in the second Availability Zone for each partition.

Management and Security

As a Saas, the cloud search can save a ton of manual work in managing and expanding the cluster of servers that handles the search service through auto-scaling, it also provides initial security services through Https and IAM roles.

Search Features

  • Autocomplete suggestions
  • Free text search, Boolean operations (And, OR, ..),
  • Faceted (Field-based) search: searching based on a specific field of the indexed data, basically the indexing process uses JSON format and indexes objects rather than simple text. This means it is possible to restrict the search to any field of the JSON object, e.g. search by title, body, meta-data, tags, …,
    this can also be used to restrict/filter the search results for example restricting text search results based on date.
  • Field weighting: weighting the impact of different fields of the indexed object on the overall relevance, for example, the hits in the title field might be considered more important than hits in the article body
  • Customizable relevance ranking and query-time rank expressions
  • Geospatial search: searching in fields that contain longitude, latitude information.
  • Highlighting: Generally speaking, “highlighting” means marking up a document to visually indicate the words that a site visitor used to perform a search. … An advantage of this type of highlighting, with some search engines, is that when you open the document, it “jumps” to the first instance of the word.
  • Support for 34 languages.

Limitations

  • Text processing: the system allows the indexing of text-based fields in 38 different languages. However, the ability to customize the text processing is extremely limited as it is not possible to add custom processing steps like stemming, tokenization or stop-words removal, and while the default options can give acceptable initial results its is far from being optimal.
  • Documents uploading: documents should be uploaded to the service and indexed in batches in order to save the processing and ram costs. Each batch maximum size is 5MB and each individual documents have a limit size of 1MB.

Pricing

Customers are billed according to their monthly usage across the following dimensions, Note that the new users have a free usage tier of 12 months:

  • Search instances: see the following section.
  • Document batch uploads: $0.10 per 1,000 Batch Upload Requests (the maximum size for each batch is 5 MB).
  • IndexDocuments requests: the re-indexing step is used to update the index after (adding, modifying or deleting) documents the cost of this step is $0.98 per GB of data stored in your search domain.
  • Data transfer: this part represents the amount of data sent to the service or retrieved from it and essentially depends on the traffic size. For this part, the costs are illustrated in the following table. However, Note that Data transferred between cloud search and other AWS services in the same region is free.
Data Transfer IN Pricing
All data transfer in $0.00 per GB
Data Transfer OUT
Up to 1 GB / Month $0.00 per GB
Next 9.999 TB / Month $0.09 per GB
Next 40 TB / Month $0.085 per GB
Next 100 TB / Month $0.07 per GB
Greater than 150 TB / Month $0.05 per GB

Search Instances

The following table illustrates the recommended data size and the instance type. We also show the hourly rate pricing of each instance using the EU-west region as a reference. Note that pricing is per instance-hour consumed for each search instance; from the time the instance is launched until it is terminated. Each partial instance-hour consumed is billed as a full hour.

Instance name Recommended data size GB Hourly rate
search.m1.small < 1 $0.063
search.m3.large 1 – 8 $0.208
search.m3.xlarge 8 – 16 $0.416
search.m3.2xlarge 16 – 32 $0.832

In case your index is larger than this you can partition your index across multiple instances.

Use Case

To understand the impact of using AWS CloudSearch let us compare the costs of using it against a conventional server full Solr service. For now, let us ignore the human development and management costs of this service:

Assuming we have a DynamoDB table containing a set of text documents where the size of the table counting the documents and the various features (creation_date, id, …) is 200MB, the number of documents in the table is around 17k.

  • For this size of documents, we can use a single instance of the small type and thus the monthly cost of this service would amount to 24*30*0.063=45.36$ (This is assuming an international traffic size from multiple regions where the instances needs to stay on the whole time if this is not the case then the cost will drop dramatically, e.g. for 8 active hours per day the cost boils down to 15$)
  • Uploading the documents to the service requires 200/5=40 batches and thus 0.1$
  • For indexing, the documents since the overall size of the data are less than 1G you will require 0.98$
  • Finally, for the outside traffic since we are using DynamoDB, it would be enough to return a list of documents Ids as the results of the search assuming 20 results per query where each item is an integer the size of the response per query would be 80 Bytes, assuming 1M requests per month your cost would still be 0.0$

The overall cost would just above 47$ per month,

On contrast, a manually built and managed search service that is server full can be hosted on AWS EC2 a much stronger m5.large (8 GB RAM, 2 virtual CPUs) for around 77$ per month using on-demand pricing. Much lower costs could be achieved for nearly the same specifications of m5.large using providers like digital ocean where better specifications (8GB RAM, 4 vCPUs) could be obtained for 80$ per month. However, this ignores the human effort of actually developing the service using frameworks such as Solr.

What About Other Services

The following table illustrates in a very concise manner the difference between different managed search services and how they compare to the traditional players like Solr, Note that there are several other services that we didn’t cover to keep this article as short as possible, some of the main contenders are Algolia and Swiftype

Features AWS cloud Search AWS elastic Search Azure cognitive search Solr
Boolean operations yes yes yes yes
Facets yes yes yes yes and very advanced
using pivot faceting
Auto-complete Limited to a single lexicon file yes yes yes
Relevance customization limited yes yes yes
Field-based search yes yes yes yes
Geospatial search yes yes yes yes
Highlighting yes yes yes yes
Support for Arabic yes yes
and customization is possible
yes yes
you need to do the processing yourself
Customization limited yes
to a fine degree
yes yes
very detailed customization even to the workflow of the core
Auto-correction (did you mean) no yes yes yes
Text search customization limited yes yes
custom lexical analyzers and language analyzers from Lucene or Microsoft
yes
Ease of development and deployment yes
somewhat trivial
Somewhat easy to start with (build an initial V1) with initial knowledge of elastic search but will need a full ELK stack for a fully functional version and some good background Somewhat similar to AWS elastic search since both are based on the same core no
Auto-scaling yes no
but manual scaling is very simple either using API or through AWS console
no no, some hosting services help with this
Auto-suggestion (more like this) feature no yes yes yes
Restarting core no yes no yes some operations like changing the files schema or modifying the configurations require restarting or re-indexing
Failure tolerance yes embedded in the service yes using the alert system yes
up to 99.9% SLA.
No
needs to be maintained manually (assuming non-managed deployment)
Additional features
– Integration with kibana for visualization
– Integeration with logStach as a dataPipeline
– built-in alerting
– SQL like querying
– network isolation with Amazon VPC
– encrypt data at-rest and in-transit using AWS KMS keys
– authentication and access control with Amazon Cognito
– data durability allowing easy data retrieval through hourly snapshots retained for 14 days with no additional costs – possibility of 3 replicas in three different Availability Zones (AZ) as opposed to cloudSearch’s 2 zones
– cognitive search allows for OCR and image search
– integrated analytic and visualization panel – search in multiple formats including PDFs
– data enrichment by adding machine learning models

Free tier See the section above – free tier included without time boundary, free usage of up to 750 hours per month of a t2.small.elasticsearch instance and 10GB per month of optional EBS storage Yes
but the free tier is very limited for example it is not suffecient enough to cover the aforementioned use case
No
Pricing for the previous use case
(all the prices are calculated in the west-europ region)
48$ – on-demand: 29.18 $
However much lower costs and higher specifications can be applied for reserved instances with up to 52% saving for more detail review here
Starting plan can cost up to 72.72$ However this is because the starting specifications are higher
see more details over here
See the section above for details

Conclusions

  • AWS cloud search presents a very simple way for non-technical people to add search capabilities to their services in a fairly simple and cost-effective manner
  • If your data does not require detailed text analysis features or advanced NLP then this AWS cloud search could be the perfect PaaS for you
  • For small teams and low traffic size, AWS Elastic Search could be a more proficient alternative as it can save a few costs especially for long term contract on the expense of increased development and management complexity
  • Developing your search service from scratch based on frameworks like Solr is always an option and can often be cheaper from the servers point of view, but you should only consider this option if you can afford the human cost of development, management and maintenance. Or, if your data is extremely sensitive and thus can’t be trusted in cloud services.
  • Azure cognitive search offers a wider range of features in a much simpler manner but for increased costs. You might want to opt for this option in case you have a large business with multiple sub-systems and data types in place.
  • For a small team with limited traffic we can follow the following steps:
    • We can deploy an AWS cloud search service very quickly with limited ability to tweak it if it works well. We deploy fast, and then do more in-depth research
    • If that option fails then we can go for AWS elastic search for reduced price and more control (but more complexity)

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from google play: https://play.google.com/store/apps/details?id=io.almeta.almetanewsapp&hl=ar_AR

Leave a Reply

Your email address will not be published. Required fields are marked *