Clustering - GaussianMixture - 《Alink v1.0.1 Document》

Description
Parameters
Script Example
- Code
- Results

Description

Gaussian Mixture is a kind of clustering algorithm.

Gaussian Mixture clustering performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated “mixing” weights specifying each’s contribution to the composite.

Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.

Parameters

Name	Description	Type	Required？	Default Value
tol	Iteration tolerance.	Double		0.01
vectorCol	Name of a vector column	String	✓
k	Number of clusters.	Integer		2
maxIter	Maximum iterations, The default value is 100	Integer		100
vectorCol	Name of a vector column	String	✓
predictionCol	Column name of prediction.	String	✓
predictionDetailCol	Column name of prediction result, it will include detailed info.	String
reservedCols	Names of the columns to be retained in the output table	String[]		null

Script Example

Code

 data = np.array([
    ["-0.6264538 0.1836433"],
    ["-0.8356286 1.5952808"],
    ["0.3295078 -0.8204684"],
    ["0.4874291 0.7383247"],
    ["0.5757814 -0.3053884"],
    ["1.5117812 0.3898432"],
    ["-0.6212406 -2.2146999"],
    ["11.1249309 9.9550664"],
    ["9.9838097 10.9438362"],
    ["10.8212212 10.5939013"],
    ["10.9189774 10.7821363"],
    ["10.0745650 8.0106483"],
    ["10.6198257 9.9438713"],
    ["9.8442045 8.5292476"],
    ["9.5218499 10.4179416"],
])
df_data = pd.DataFrame({
    "features": data[:, 0],
})
data = dataframeToOperator(df_data, schemaStr='features string', op_type='batch')
gmm = GaussianMixture() \
    .setPredictionCol("cluster_id") \
    .setVectorCol("features") \
    .setPredictionDetailCol("cluster_detail")
    .setTol(0.)
gmm.fit(data).transform(data).print()

Results

                 features  cluster_id              cluster_detail
0    -0.6264538 0.1836433           0   1.0 4.275273913994647E-92
1    -0.8356286 1.5952808           0  1.0 1.0260377730322135E-92
2    0.3295078 -0.8204684           0  1.0 1.0970173367582936E-80
3     0.4874291 0.7383247           0    1.0 3.30217313232611E-75
4    0.5757814 -0.3053884           0   1.0 3.163811360527691E-76
5     1.5117812 0.3898432           0  1.0 2.1018052308786076E-62
6   -0.6212406 -2.2146999           0   1.0 6.772270268625197E-97
7    11.1249309 9.9550664           1  3.1567838012477083E-56 1.0
8    9.9838097 10.9438362           1  1.9024447346702333E-51 1.0
9   10.8212212 10.5939013           1  2.8009730987296404E-56 1.0
10  10.9189774 10.7821363           1  1.7209132744891575E-57 1.0
11   10.0745650 8.0106483           1   2.864269663513225E-43 1.0
12   10.6198257 9.9438713           1    5.77327399194046E-53 1.0
13    9.8442045 8.5292476           1  2.5273123050926845E-43 1.0
14   9.5218499 10.4179416           1  1.7314580596765865E-46 1.0