K-means clustering and its Real use-cases in the Security Domain.

Lalita Rajpoot
4 min readFeb 13, 2022

K-means clustering is the simplest and popular unsupervised machine learning algorithms Used for Solving Classification Problems. K Means segregates the unlabelled data into various groups, called clusters, based on having similar features.

“A cluster refers to a collection of data points aggregated together because of certain similarities.”

What is Unsupervised Learning?

Unsupervised Learning is a machine learning technique in which, there are no labels for the training data. A machine learning algorithm tries to learn the underlying patterns or distributions that govern the data.

What is clustering?

Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different.
Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance.

What is K-means?

K-means is an algorithm that identifies k number of centroids and then allocates every data point to the nearest cluster while keeping the centroids as small as possible.

Here “K” refers to the number of centroids we need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.

Where Can I Apply K-Means?

k-means can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. think of a scenario in which you want to make groups of similar things from a randomly distributed collection of things; k-means is very suitable for such scenarios.

How Does K-Means Clustering Work?

The flowchart below shows how k-means clustering works:

The following stages will help us understand how the K-Means clustering technique works-

  • Step 1: First, we need to provide the number of clusters, K, that need to be generated by this algorithm.
  • Step 2: Next, choose K data points at random and assign each to a cluster. Briefly, categorize the data based on the number of data points.
  • Step 3: The cluster centroids will now be computed.
  • Step 4: Iterate the steps below until we find the ideal centroid, which is the assigning of data points to clusters that do not vary.
  • 4.1 The sum of squared distances between data points and centroids would be calculated first.
  • 4.2 At this point, we need to allocate each data point to the cluster that is closest to the others (centroid).
  • 4.3 Finally, compute the centroids for the clusters by averaging all of the cluster’s data points.

Interesting use cases for k-means

1. Document Classification

Cluster documents in multiple categories based on tags, topics, and the content of the document. this is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. the initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarities in document groups.

2. Insurance Fraud Detection

Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns.

3. Cyber-Profiling Criminals

Cyber profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

4. Call Record Detail Analysis

Call Detail Record (CDR) is the information captured by the telecom companies during Call, SMS, and Internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. Most of the telecom companies use CDR information for fraud detection by clustering the user profiles, reducing customer churn by usage activity, and targeting the profitable customers by using RFM analysis.

5. Behavioral segmentation:

  • Segment by purchase history
  • Segment by activities on application, website, or platform
  • Define personas based on interests
  • Create profiles based on activity monitoring

Thanks for reading my article..

I hope it will help you to explore the idea of K-mean clustering.

--

--