k-mean clustering and its real use-case in the security domain

10 min readJul 19, 2021

k-mean clustering

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.

Introduction

K-means clustering is one of the most popular clustering algorithms.
It gets it name based on its property that it tries to find most optimal user specified k number of clusters in a any dataset. The quality of the dataset and their seperability is subject to implementation details, but it is fairly straight forward iterative algorithm.
It basically involves a random centroid initialization step followed by two steps, namely, cluster assignment step, and centroid calculation step that are executed iteratively until a stable mean set is arrived upon. It becomes more clear in the animation below.

Cluster Assignment: Assign each data point to one of the two clusters based on its distance from them. A point is assigned to the cluster, whose centroid it is closer to.
Move Centroid: After cluster assignment, centroids are moved to the mean of clusters formed. And then the process is repeated. After a certain number of steps the centroids will no longer move around and then the iterations can stop.

K-Means Algorithm

Input:

K, number of clusters
Training set, x(1),x(2),⋯,x(m)x(1),x(2),⋯,x(m)

where:

x(i)∈Rnx(i)∈Rn, as there are no bias terms, x0=1x0=1

Algorithm:

Randomly initialize KK cluster centroids, μ1,μ2,⋯,μk∈Rnμ1,μ2,⋯,μk∈Rn
Then,

Repeat{for i=1 to mc(i)= index (from 1 to K) of centroid closest to x(i), i.e., mink∥x(i)−μk∥2for k=1 to Kμk= average (mean) of points assigned to cluster, k, i.e., sum of x(i), where c(i)=knumber of c(i)=k}(1)(1)Repeat{for i=1 to mc(i)= index (from 1 to K) of centroid closest to x(i), i.e., mink‖x(i)−μk‖2for k=1 to Kμk= average (mean) of points assigned to cluster, k, i.e., sum of x(i), where c(i)=knumber of c(i)=k}

It is common to apply k-means to a non-seperated clusters. This has particular applications in segmentation problems, like market segmentation or population division based on pre-selected features.

Optimization Objective

Notation:

c(i)c(i) — index of cluster {1,2,⋯,K}{1,2,⋯,K} to which example x(i)x(i) is currently assigned
μkμk — cluster centroid kk, μk∈Rnμk∈Rn
μc(i)μc(i) — cluster centroid of the cluster to which the example x(i)x(i) is assigned

Following the above notation, the cost function of the k-means clustering is given by,

J(c(1),c(2),⋯,c(m),μ1,μ2,⋯,μK)=1mm∑i=1∥x(i)−μc(i)∥2(2)(2)J(c(1),c(2),⋯,c(m),μ1,μ2,⋯,μK)=1m∑i=1m‖x(i)−μc(i)‖2

Hence the optimization objective is,

minc(1),c(2),⋯,c(m),μ1,μ2,⋯,μKJ(c(1),c(2),⋯,c(m),μ1,μ2,⋯,μK)(3)(3)minc(1),c(2),⋯,c(m),μ1,μ2,⋯,μKJ(c(1),c(2),⋯,c(m),μ1,μ2,⋯,μK)

The cost function in (2)(2) is called distortion cost function or the distortion of k-means clustering.

It can argued that the k-means algorithm in (1)(1), is implementing the cost function optimization. This is so because the first step of k-mean clustering, i.e. the cluster assignment step is nothing but the minimization of the cost w.r.t. c(1),c(2),⋯,c(m)c(1),c(2),⋯,c(m) as this step involves assigning a data point to the nearest possible cluster. Similarly the second step, i.e. moving the centroid step is the minimization of the clustering cost w.r.t. μ1,μ2,⋯,μKμ1,μ2,⋯,μK as the most optimal position of centroid for minimizing the distortion for a given set of points is the mean position.

One handy way of checking if the clustering alorithm is working correctly is to plot distortion as a function of number of iterations. As both the steps in the k-means are calculated steps for minimization it is always going to decrease or remain approximately constant as the number of iterations increase.

Random Initialization

There are various ways for randomly picking out K<mK<m cluster centroids, but the most recommended one involves picking KK randomly picked training examples and initialize {μ1,μ2,⋯,μK}{μ1,μ2,⋯,μK} equal to these KK examples.

Based on initialization, it is possible that k-means could converge to different centroids or stuck in some local optima. One possible solution to this is to try multiple random initializations and then choose the one with the least distortion. It’s fairly usual to run k-means around 50–1000 times with random initialization to make sure that it does not get stuck in local optima.

Generally the trick of multiple random initializations will help only if the number of clusters is small, i.e. between 2–10. For higher number of clusters the multiple number of random initializations are less likely to help improve the distortion cost function.

Choosing the Number of Clusters

One way of choosing the number of clusters is by manually visualizing the data.
Sometimes it is ambiguous as to how many clusters exist in the dataset and in such cases it’s rather more useful to choose the number of clusters on the basis of end goal or the number of clusters that serve well the later down stream goal that needs to be extracted from the datasets.
Elbow Method: On plotting the distortion as a function of number of clusters, KK, this methods says that the optimal number of cluster at the point the elbow occurs as can be seen for line B in the plot below. It is a reasonable way of choosing the number of clusters. But this method does not always work because the sometimes the plot would look like line A which does not have clear elbow to choose.

As a strict rule in k-means, as the number of cluster KK increases, the distortion would decrease. But after some point the increase in cluster would not give much decrease in the distortion.

How Does K-Means Clustering Work?

The flowchart below shows how k-means clustering works:

The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. We can use the trial and error method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters.

Example

Ipython Notebook

import cv2
import matplotlib.pyplot as plt
import numpy as np

Given a image it is reshaped into a vector for ease of processing,

l, w, ch = img.shape
vec_img = img.reshape(-1, ch).astype(int)

Following this K points are randomly chosen and assigned as centroids,

def choose_random(K, vec):
    m = len(vec)
    idx = np.random.randint(0, m, K)
    return vec[idx]mu = choose_random(K, vec_img)

The two basic steps of k-means clustering, cluster assignment and moving centroids can be implemented as follows,

def cluster_assignment(mu, vec):
    return ((vec - mu[:, np.newaxis]) ** 2).sum(axis=2).argmin(axis=0)def move_centroid(mu, c, vec):
    for i in range(len(mu)):
        vec_sub = vec[c==i]
        mu[i] = np.mean(vec_sub, axis=0)
    return mu

The distortion cost fuction is calculated as follows,

def distortion(mu, c, vec):
    return ((mu[c] - vec) ** 2).sum() / vec.shape[0]

Once all the modules are in place, k-means needs to iterate over the steps of cluster assignment and moving centroids until the distorion is within the threshold (threshold chosen = 1),

last_dist = distortion(mu, c, vec_img) + 100
curr_dist = last_dist - 100while last_dist - curr_dist > 1:
    last_dist = curr_dist
    c = cluster_assignment(mu, vec_img)
    mu = move_centroid(mu, c, vec_img)    
    curr_dist = distortion(mu, c, vec_img)

Following plots are obtained after running k-means for image compression on two different images,

Following code implements k-means for different values of K.

def elbow(img):
    K_hist = []
    dist_hist = []
    for K in tqdm(range(1, 10)):
        K_hist.append(K)
        mu, c, dist = k_means(img, K, plot=False)
        dist_hist.append(dist)
    plt.plot(K_hist, dist_hist)
    plt.xlabel("K")
    plt.ylabel("final distortion")plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.title('elbow plot of image 1')
elbow(img_1)
plt.subplot(1, 2, 2)
elbow(img_2)
plt.title('elbow plot of image 2')

Seeing the two plots it is evident while the elbow plot gives a optimal value of two for image 1, there is no well defined elbow for the image 2.

Use-cases of Clustering Algorithms in the Real World

1. Identifying Fake News

Fake news is not a new phenomenon, but it is one that is becoming prolific.

What the problem is: Fake news is being created and spread at a rapid rate due to technology innovations such as social media. The issue gained attention recently during the 2016 US presidential campaign. During this campaign, the term Fake News was referenced an unprecedented number of times.

How clustering works: In a paper recently published by two computer science students at the University of California, Riverside, they are using clustering algorithms to identify fake news based on the content.

The way that the algorithm works is by taking in the content of the fake news article, the corpus, examining the words used and then clustering them. These clusters are what helps the algorithm determine which pieces are genuine and which are fake news. Certain words are found more commonly in sensationalized, click-bait articles. When you see a high percentage of specific terms in an article, it gives a higher probability of the material being fake news.

2. Spam filter

You know the junk folder in your email inbox? It is the place where emails that have been identified as spam by the algorithm.

Many machine learning courses, such as Andrew Ng’s famed Coursera course, use the spam filter as an example of unsupervised learning and clustering.

What the problem is: Spam emails are at best an annoying part of modern day marketing techniques, and at worst, an example of people phishing for your personal data. To avoid getting these emails in your main inbox, email companies use algorithms. The purpose of these algorithms is to flag an email as spam correctly or not.

How clustering works: K-Means clustering techniques have proven to be an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped together.
These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%. This is excellent news for people who want to be sure they’re not missing out on your favorite newsletters and offers.

3. Marketing and Sales

Personalization and targeting in marketing is big business.

This is achieved by looking at specific characteristics of a person and sharing campaigns with them that have been successful with other similar people.

What the problem is: If you are a business trying to get the best return on your marketing investment, it is crucial that you target people in the right way. If you get it wrong, you risk not making any sales, or worse, damaging your Customer trust.

How clustering works: Clustering algorithms are able to group together people with similar traits and likelihood to purchase. Once you have the groups, you can run tests on each group with different marketing copy that will help you better target your messaging to them in the future.

4. Classifying network traffic

What the problem is: As more and more services begin to use APIs on your application, or as your website grows, it is important you know where the traffic is coming from. For example, you want to be able to block harmful traffic and double down on areas driving growth. However, it is hard to know which is which when it comes to classifying the traffic.

How clustering works: K-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic types. The process is faster and more accurate than the previous Autoclass method. By having precise information on traffic sources, you are able to grow your site and plan capacity effectively.

5. Identifying fraudulent or criminal activity

In this scenario, we are going to focus on fraudulent taxi driver behavior. However, the technique has been used in multiple scenarios.

What is the problem: You need to look into fraudulent driving activity. The challenge is how do you identify what is true and which is false?

How clustering works: By analysing the GPS logs, the algorithm is able to group similar behaviors. Based on the characteristics of the groups you are then able to classify them into those that are real and which are fraudulent.

6. Document analysis

There are many different reasons why you would want to run an analysis on a document. In this scenario, you want to be able to organize the documents quickly and efficiently.

What the problem is: Imagine you are limited in time and need to organize information held in documents quickly. To be able to complete this ask you need to: understand the theme of the text, compare it with other documents and classify it.

How clustering works: Hierarchical clustering has been used to solve this problem. The algorithm is able to look at the text and group it into different themes. Using this technique, you can cluster and organize similar documents quickly using the characteristics identified in the paragraph.

7. Fantasy Football and Sports

Ok so up until this point we have looked into different business problems and how clustering algorithms have been applied to solve them.

But now for the critical issues — fantasy football!

What is the problem: Who should you have in your team? Which players are going to perform best for your team and allow you to beat the competition? The challenge at the start of the season is that there is very little if any data available to help you identify the winning players.

How clustering works: When there is little performance data available to train your model on, you have an advantage for unsupervised learning. In this type of machine learning problem, you can find similar players using some of their characteristics. This has been done using K-Means clustering. Ultimately this means you can get a better team more quickly at the start of the year, giving you an advantage.