Description

Gaussian Mixture is a kind of clustering algorithm.

Gaussian Mixture clustering performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated “mixing” weights specifying each’s contribution to the composite.

Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.

Parameters

Name Description Type Required? Default Value
tol Iteration tolerance. Double 0.01
vectorCol Name of a vector column String
k Number of clusters. Integer 2
maxIter Maximum iterations, The default value is 100 Integer 100

Script Example

Code

  1. data = np.array([
  2. ["-0.6264538 0.1836433"],
  3. ["-0.8356286 1.5952808"],
  4. ["0.3295078 -0.8204684"],
  5. ["0.4874291 0.7383247"],
  6. ["0.5757814 -0.3053884"],
  7. ["1.5117812 0.3898432"],
  8. ["-0.6212406 -2.2146999"],
  9. ["11.1249309 9.9550664"],
  10. ["9.9838097 10.9438362"],
  11. ["10.8212212 10.5939013"],
  12. ["10.9189774 10.7821363"],
  13. ["10.0745650 8.0106483"],
  14. ["10.6198257 9.9438713"],
  15. ["9.8442045 8.5292476"],
  16. ["9.5218499 10.4179416"],
  17. ])
  18. df_data = pd.DataFrame({
  19. "features": data[:, 0],
  20. })
  21. data = dataframeToOperator(df_data, schemaStr='features string', op_type='batch')
  22. gmm = GmmTrainBatchOp() \
  23. .setVectorCol("features") \
  24. .setTol(0.)
  25. model = gmm.linkFrom(data)
  26. model.print()

Results

  1. model_id model_info
  2. 0 0 {"vectorCol":"\"features\"","numFeatures":"2",...
  3. 1 1048576 {"clusterId":0,"weight":0.7354489748549162,"me...
  4. 2 2097152 {"clusterId":1,"weight":0.26455102514508383,"m...