Sequential Clustering

Sequential Clustering

Search queries, passport scans, barcode scans, your online shopping history, your photos on Instagram, your tweets on twitter, voice messages, every day news articles, and more, and more… All of these contain a huge amount of data…

Data generation is exponentially growing… Here’s come the Big Data term that characterizes this huge generated data by its large volume, its high velocity which is the speed at which data gets generated, and its wide variety.

But without a way to get insights from this data that is being generated every day, it would be just worthless!

Due to the growing need for big data analysis, a lot of data mining algorithms were developed that allow us to make sense of this increasingly amounts of information in real-time.

Our focus in this post is on the velocity attribute of the big data, we are briefly discussing how to handle the infinite data streams, with a major focus on the unsupervised learning algorithms.

What’s The Difficulty of Handling Data Streams?

Let’s consider a traditional learning process. As usual, the process starts by getting a static training dataset, with labels if it’s a supervised learning problem. The next step would be to perform feature engineering, scaling, selection, etc. We can then fit multiple machine learning models on the data, try to fine-tune their parameters, and finally select the best one to be deployed and make live predictions.

This kind of models is obviously limited to those patterns it has trained on, and impossible to be adapted to fit the incoming new changes in the data over time. Every time you want to adapt this model with new changes you would find yourself forced to perform the entire training process from scratch.

This is computationally expensive and requires that all relevant data is stored for periodic re-evaluation.

So, How to Handle These Streams?

The solution in known as online learning or incremental learning. Which in contrast to the traditional batch learning, embraces the fact that the sequentially received data can be used to update our best model for future data at each step.

Sequential Clustering

Also known as incremental, online, or stream clustering is a set of unsupervised online learning algorithms.

It’s a suitable choice when we aim to recognize patterns in an unordered, infinite and evolving stream of observations. It enables updating the existing clusters and integrating new observations into the existing model by identifying emerging structures and removing outdated structures incrementally.

Available Tools for Sequential Clustering

Here we’re providing you with a set of the available implementations for the Sequential Clustering algorithms in different languages:

Name Language Implemented Algorithms
stream R-Package D-Stream, D-Stream with Attraction, DBSTREAM, BICO and BIRCH.
Massive Online Analysis (MOA) JAVA framework StreamKM++, CluStream, ClusTree, DenStream and D-Stream, BICO and COBWEB
streamMOA R-Package Interfaces to JAVA implementation of CluStream, DenStream and ClusTree
subspaceMOA JAVA framework PreDeConStream and HDDStream
evoStream R, C++, Python evolutionary stream clustering
BIRCH C
BIRCH Python
BICO C++
BICO Python
StreamKM++ C++

Conclusion

In this post, we discussed the difficulties behind handling data streams. We also introduced the online learning approach. We talked about sequential clustering, and provided you with a list of tools that implement multiple sequential clustering algorithms to enjoy applying them yourself.

If you wish to see a real-life example of sequential clustering check out our series on event-detection to learn how we have used sequential clustering to detect articles describing the same event

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from google play: https://play.google.com/store/apps/details?id=io.almeta.almetanewsapp&hl=ar_AR

Further Reading

[1] Carnein, Matthias, and Heike Trautmann. “Optimizing data stream representation: An extensive survey on stream clustering algorithms.” Business & Information Systems Engineering (2019): 1-21.

Leave a Reply

Your email address will not be published. Required fields are marked *