Description

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Parameters

Name Description Type Required? Default Value
topicNum Number of topic. Integer
alpha alpha.Concentration parameter (commonly named “alpha”) for the prior placed on documents’ distributions over topics (“beta”). Double -1.0
beta Concentration parameter (commonly named “beta” or “eta”) for the prior placed on topics’ distributions over terms. Double -1.0
method optimizer: em, online String “em”
onlineLearningOffset (For online optimizer) A (positive) learning parameter that downweights early iterations. Larger values make early iterations count less. Double 1024.0
learningDecay (For online optimizer) Learning rate, set as an exponential decay rate. This should be between (0.5, 1.0] to guarantee asymptotic convergence. Double 0.51
subsamplingRate For online optimizer Fraction of the corpus to be sampled and used in each iteration of mini-batchgradient descent, in range (0, 1]. Double 0.05
optimizeDocConcentration (For online optimizer only, currently) Indicates whether the docConcentration(Dirichlet parameter for document-topic distribution) will be optimized during training. Boolean true
numIter Number of iterations, The default value is 10 Integer 10
vocabSize The maximum word number of the dictionary. If the total numbers of words are above this value,the words with lower document frequency will be filtered Integer 262144
selectedCol Name of the selected column used for processing String
selectedCol Name of the selected column used for processing String
predictionCol Column name of prediction. String
predictionDetailCol Column name of prediction result, it will include detailed info. String
reservedCols Names of the columns to be retained in the output table String[] null

Script Example

Code

  1. data = np.array(["a b b c c c c c c e e f f f g h k k k", \
  2. "a b b b d e e e h h k", \
  3. "a b b b b c f f f f g g g g g g g g g i j j", \
  4. "a a b d d d g g g g g i i j j j k k k k k k k k k", \
  5. "a a a b c d d d d d d d d d e e e g g j k k k", \
  6. "a a a a b b d d d e e e e f f f f f g h i j j j j", \
  7. "a a b d d d g g g g g i i j j k k k k k k k k k", \
  8. "a b c d d d d d d d d d e e f g g j k k k", \
  9. "a a a a b b b b d d d e e e e f f g h h h", \
  10. "a a b b b b b b b b c c e e e g g i i j j j j j j j k k", \
  11. "a b c d d d d d d d d d f f g g j j j k k k", \
  12. "a a a a b e e e e f f f f f g h h h j"])
  13. df = pd.DataFrame({"doc" : data})
  14. data = dataframeToOperator(df, schemaStr="doc string",op_type="batch")
  15. op = Lda()\
  16. .setSelectedCol("doc")\
  17. .setTopicNum(6)\
  18. .setMethod("online")\
  19. .setPredictionCol("pred")
  20. pipeline = Pipeline().add(op)
  21. model = pipeline.fit(data)
  22. model.transform(data).collectToDataframe()

Results

Model
model_id model_info
0 {“logPerplexity”:”22.332946259667825”,”betaArray”:”[0.2,0.2,0.2,0.2,0.2]”,”logLikelihood”:”-915.6507966463809”,”method”:”\”online\””,”alphaArray”:”[0.16926092344987234,0.17828690973899627,0.17282213771078062,0.18555794554097874,0.15898463316059516]”,”topicNum”:”5”,”vocabularySize”:”11”}
1048576 {“m”:5,”n”:11,”data”:[6135.5227952852865,7454.918734235136,6569.887273287071,7647.590029783137,7459.37196542985,6689.783286316853,8396.842418256507,7771.120258275389,7497.94247894282,7983.617922597562,7975.470848777338,7114.413879475893,8420.381073064213,6747.377398176922,6959.728145538011,7368.902852508116,7635.5968635989275,6734.522904998126,6792.566021565353,6487.885790775943,8086.932892160501,8443.888239756887,7227.0417299467745,7561.023252667202,6264.97808011349,6964.080980387547,8234.247108608217,8263.190977757107,7872.088651923572,7725.669369347696,7591.453097717432,7733.627117746213,6595.2753568320295,8158.346230399092,7765.777648163369,6456.891859572009,6814.768507000475,6612.17371610521,6506.877213010642,7166.140342089344,7588.370517354863,7645.016947338933,8929.620632942893,6855.855247335312,7263.088264847597,7993.009126022237,7302.794183756114,6074.524636118613,6386.578740892538,8465.84700774072,7242.276290933901,7257.474039179472,7676.72445702261,6733.70550536632,7577.265607033211]}
2097152 {“f0”:”d”,”f1”:0.36772478012531734,”f2”:0}
3145728 {“f0”:”k”,”f1”:0.36772478012531734,”f2”:1}
4194304 {“f0”:”g”,”f1”:0.08004270767353636,”f2”:2}
5242880 {“f0”:”b”,”f1”:0.0,”f2”:3}
6291456 {“f0”:”a”,”f1”:0.0,”f2”:4}
7340032 {“f0”:”e”,”f1”:0.36772478012531734,”f2”:5}
8388608 {“f0”:”j”,”f1”:0.26236426446749106,”f2”:6}
9437184 {“f0”:”f”,”f1”:0.4855078157817008,”f2”:7}
10485760 {“f0”:”c”,”f1”:0.6190392084062235,”f2”:8}
11534336 {“f0”:”h”,”f1”:0.7731898882334817,”f2”:9}
12582912 {“f0”:”i”,”f1”:0.7731898882334817,”f2”:10}
Prediction
doc pred
a b b b d e e e h h k 1
a a b d d d g g g g g i i j j j k k k k k k k k k 3
a a a a b b d d d e e e e f f f f f g h i j j j j 3
a a b d d d g g g g g i i j j k k k k k k k k k 1
a a a a b b b b d d d e e e e f f g h h h 3
a b c d d d d d d d d d f f g g j j j k k k 3
a b b c c c c c c e e f f f g h k k k 2
a b b b b c f f f f g g g g g g g g g i j j 0
a a a b c d d d d d d d d d e e e g g j k k k 3
a b c d d d d d d d d d e e f g g j k k k 3
a a b b b b b b b b c c e e e g g i i j j j j j j j k k 3
a a a a b e e e e f f f f f g h h h j 0