Description

Calculate the cluster evaluation metrics for clustering.

PredictionCol is required for evaluation. LabelCol is optional, if given, NMI/Purity/RI/ARI will be calcuated. VectorCol is also optional, if given, SilhouetteCoefficient/SSB/SSW/Compactness/SEPERATION/DAVIES_BOULDIN /CALINSKI_HARABAZ will be calculated. If only predictionCol is given, only K/ClusterArray/CountArray will be calculated.

Parameters

Name Description Type Required? Default Value
labelCol Name of the label column in the input table String null
vectorCol Name of a vector column String null
predictionCol Column name of prediction. String
distanceType Distance type for clustering, support EUCLIDEAN and COSINE. String “EUCLIDEAN”

Script Example

Code

  1. import numpy as np
  2. import pandas as pd
  3. data = np.array([
  4. [0, "0 0 0"],
  5. [0, "0.1,0.1,0.1"],
  6. [0, "0.2,0.2,0.2"],
  7. [1, "9 9 9"],
  8. [1, "9.1 9.1 9.1"],
  9. [1, "9.2 9.2 9.2"]
  10. ])
  11. df = pd.DataFrame({"id": data[:, 0], "vec": data[:, 1]})
  12. inOp = BatchOperator.fromDataframe(df, schemaStr='id int, vec string')
  13. metrics = EvalClusterBatchOp().setVectorCol("vec").setPredictionCol("id").linkFrom(inOp).collectMetrics()
  14. print("Total Samples Number:", metrics.getCount())
  15. print("Cluster Number:", metrics.getK())
  16. print("Cluster Array:", metrics.getClusterArray())
  17. print("Cluster Count Array:", metrics.getCountArray())
  18. print("CP:", metrics.getCompactness())
  19. print("DB:", metrics.getDaviesBouldin())
  20. print("SP:", metrics.getSeperation())
  21. print("SSB:", metrics.getSsb())
  22. print("SSW:", metrics.getSsw())
  23. print("CH:", metrics.getCalinskiHarabaz())

Results

  1. Total Samples Number: 6
  2. Cluster Number: 2
  3. Cluster Array: ['0', '1']
  4. Cluster Count Array: [3.0, 3.0]
  5. CP: 0.11547005383792497
  6. DB: 0.014814814814814791
  7. SP: 15.588457268119896
  8. SSB: 364.5
  9. SSW: 0.1199999999999996
  10. CH: 12150.000000000042