K-means Clustering & it’s Real use-case in the Security Domain

Varnikaverma
6 min readJul 19, 2021

K-means clustering is a very famous and powerful unsupervised machine learning algorithm. It is used to solve many complex unsupervised machine learning problems.

The history

The term “k-means” was first used by James Macqueen in 1967 as part of his paper on “some methods for classification and analysis of multivariate observations”. The standard algorithm was also used in bell labs as part of a technique in pulse code modulation in 1957. It was also published by in 1965 by e. w. Forgy and typically is also known as the lloyd-forgy method.

What is k-means?

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters. The goal of the k-means algorithm is to find groups in the data, with the number of groups represented by the variable k. The algorithm works iteratively to assign each data point to one of k groups based on the features that are provided. In the reference image below, k=2, and there are two clusters identified from the source dataset.

The outputs of executing a k-means on a dataset are:

  • k centroids: centroids for each of the k clusters identified from the dataset.
  • Complete dataset labeled to ensure each data point is assigned to one of the clusters.

How Does the K-means clustering algorithm work?

k-means clustering tries to group similar kinds of items in form of clusters. It finds the similarity between the items and groups them into the clusters. K-means clustering algorithm works in three steps. Let’s see what are these three steps.

  1. Select the k values.
  2. Initialize the centroids.
  3. Select the group and find the average.
  • Figure 1 shows the representation of data of two different items. the first item has shown in blue color and the second item has shown in red color. Here I am choosing the value of K randomly as 2. There are different methods by which we can choose the right k values.
  • In figure 2, Join the two selected points. Now to find out centroid, we will draw a perpendicular line to that line. The points will move to their centroid. If you will notice there, then you will see that some of the red points are now moved to the blue points. Now, these points belong to the group of blue color items.
  • The same process will continue in figure 3. we will join the two points and draw a perpendicular line to that and find out the centroid. Now the two points will move to its centroid and again some of the red points get converted to blue points.
  • The same process is happening in figure 4. This process will be continued until and unless we get two completely different clusters of these groups.

Use-Cases in the Security Domain

1. Identifying Fake News

Fake news is not a new phenomenon, but it is one that is becoming prolific.

What the problem is: Fake news is being created and spread at a rapid rate due to technology innovations such as social media.

How clustering works: The algorithm works is by taking in the content of the fake news article, the corpus, examining the words used and then clustering them. These clusters are what helps the algorithm determine which pieces are genuine and which are fake news. Certain words are found more commonly in sensationalized, click-bait articles. When you see a high percentage of specific terms in an article, it gives a higher probability of the material being fake news.

2. Spam filter

You know the junk folder in your email inbox? It is the place where emails that have been identified as spam by the algorithm.

What the problem is: Spam emails are at best an annoying part of modern day marketing techniques, and at worst, an example of people phishing for your personal data. To avoid getting these emails in your main inbox, email companies use algorithms.

How clustering works: K-Means clustering techniques have proven to be an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped together. These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%.

3. Classifying network traffic

Imagine you want to understand the different types of traffic coming to your website. You are particularly interested in understanding which traffic is spam or coming from bots.

What the problem is: As more and more services begin to use APIs on your application, or as your website grows, it is important you know where the traffic is coming from.

How clustering works: K-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic types. The process is faster and more accurate than the previous Autoclass method. By having precise information on traffic sources, you are able to grow your site and plan capacity effectively.

4. Identifying fraudulent or criminal activity

Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection.

What is the problem: The challenge is how do you identify what is true and which is false?

How clustering works: By analysing the GPS logs, the algorithm is able to group similar behaviors. Based on the characteristics of the groups you are then able to classify them into those that are real and which are fraudulent. Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns.

5. Document analysis

There are many different reasons why you would want to run an analysis on a document. In this scenario, you want to be able to organize the documents quickly and efficiently.

What the problem is: Imagine you are limited in time and need to organize information held in documents quickly. To be able to complete this ask you need to: understand the theme of the text, compare it with other documents and classify it.

How clustering works: Hierarchical clustering has been used to solve this problem. The algorithm is able to look at the text and group it into different themes. Using this technique, you can cluster and organize similar documents quickly using the characteristics identified in the paragraph.

Conclusion

Kmeans clustering is one of the most popular clustering algorithms and usually the first thing practitioners apply when solving clustering tasks to get an idea of the structure of the dataset. It’s a simple & quick algorithm that can be applied to large datasets to separate them into different partitions; analysis of these partitions may provide a better characterization in the Security Domain.

--

--