Description

Bisecting k-means is a kind of hierarchical clustering algorithm.

Bisecting k-means algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible.

Parameters

Name Description Type Required? Default Value
minDivisibleClusterSize Minimum divisible cluster size Integer 1
k Number of clusters. Integer 4
distanceType Distance type for clustering, support EUCLIDEAN and COSINE. String “EUCLIDEAN”
vectorCol Name of a vector column String
maxIter Maximum iterations, The default value is 10 Integer 10
predictionCol Column name of prediction. String
predictionDetailCol Column name of prediction result, it will include detailed info. String
reservedCols Names of the columns to be retained in the output table String[] null

Script Example

Code

  1. import numpy as np
  2. import pandas as pd
  3. data = np.array([
  4. [0, "0 0 0"],
  5. [1, "0.1,0.1,0.1"],
  6. [2, "0.2,0.2,0.2"],
  7. [3, "9 9 9"],
  8. [4, "9.1 9.1 9.1"],
  9. [5, "9.2 9.2 9.2"]
  10. ])
  11. df = pd.DataFrame({"id": data[:, 0], "vec": data[:, 1]})
  12. inOp = BatchOperator.fromDataframe(df, schemaStr='id int, vec string')
  13. kmeans = BisectingKMeans().setVectorCol("vec").setK(2).setPredictionCol("pred")
  14. kmeans.fit(inOp).transform(inOp).collectToDataframe()

Results

Prediction
  1. rowId id vec pred
  2. 0 0 0 0 0 0
  3. 1 1 0.1,0.1,0.1 0
  4. 2 2 0.2,0.2,0.2 0
  5. 3 3 9 9 9 1
  6. 4 4 9.1 9.1 9.1 1
  7. 5 5 9.2 9.2 9.2 1