Parallel K-Means Clustering Algorithm on Map Reduce Framework

Authors

  • Seung-Hee Kim
  • Jun-Ki Min

Abstract

Clustering is the algorithm for partitioning the points in a given data set into several groups. The goal of clustering is that the points in a group are similar while the dissimilar points are in the different groups. Among the diverse clustering algorithms, K-Means as a center based clustering algorithm is one of the most widely used algorithms. In this paper, we propose an efficient parallel K-mean algorithm, called MPKMeans, which utilizes the MapReduce framework for processing large volume data sets. In MPKMeans, for each center, we maintain the distance to the closest center of it and we check whether each point is needed to compute the distances to all center points by using the maintained minimal distances. Additionally, in contrast to the existing parallel algorithm, since we design MPKMeans composed of map phases only, we eliminate the overhead for conducting shuffle and reduce phases.

Downloads

Published

2019-12-12

Issue

Section

Articles