Description

k-mean clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-mean clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

(https://en.wikipedia.org/wiki/K-means_clustering)

Parameters

Name Description Type Required? Default Value
predictionDistanceCol Column name of prediction. String
distanceType Distance type for clustering, support EUCLIDEAN and COSINE. String “EUCLIDEAN”
vectorCol Name of a vector column String
maxIter Maximum iterations, the default value is 20 Integer 20
initMode Methods to get initial centers, support K_MEANS_PARALLEL and RANDOM! String “K_MEANS_PARALLEL”
initSteps When initMode is K_MEANS_PARALLEL, it defines the steps of iteration. The default value is 2. Integer 2
k Number of clusters. Integer 2
epsilon When the distance between two rounds of centers is lower than epsilon, we consider the algorithm converges! Double 1.0E-4
predictionCol Column name of prediction. String
predictionDetailCol Column name of prediction result, it will include detailed info. String
reservedCols Names of the columns to be retained in the output table String[] null

Script Example

Code

  1. import numpy as np
  2. import pandas as pd
  3. data = np.array([
  4. [0, "0 0 0"],
  5. [1, "0.1,0.1,0.1"],
  6. [2, "0.2,0.2,0.2"],
  7. [3, "9 9 9"],
  8. [4, "9.1 9.1 9.1"],
  9. [5, "9.2 9.2 9.2"]
  10. ])
  11. df = pd.DataFrame({"id": data[:, 0], "vec": data[:, 1]})
  12. inOp = BatchOperator.fromDataframe(df, schemaStr='id int, vec string')
  13. kmeans = KMeans().setVectorCol("vec").setK(2).setPredictionCol("pred")
  14. kmeans.fit(inOp).transform(inOp).collectToDataframe()

Results

Prediction
  1. rowID id vec pred
  2. 0 0 0 0 0 1
  3. 1 1 0.1,0.1,0.1 1
  4. 2 2 0.2,0.2,0.2 1
  5. 3 3 9 9 9 0
  6. 4 4 9.1 9.1 9.1 0
  7. 5 5 9.2 9.2 9.2 0