Description

Extract all words from the dataset.Record the document frequency(DF), word count(WC) and inverse document frequency(IDF) of every word as a model.

Parameters

Name Description Type Required? Default Value
maxDF When the number of documents a word appears in is above maxDF, the word will not be included in the dictionary. It could be an exact countor a fraction of the document number count. When maxDF is within [0, 1), it’s used as a fraction. Double 1.7976931348623157E308
selectedCol Name of the selected column used for processing String
minDF When the number of documents a word appears in is below minDF, the word will not be included in the dictionary. It could be an exact countor a fraction of the document number count. When minDF is within [0, 1), it’s used as a fraction. Double 1.0
featureType Feature type, support IDF/WORD_COUNT/TF_IDF/Binary/TF String “WORD_COUNT”
vocabSize The maximum word number of the dictionary. If the total numbers of words are above this value,the words with lower document frequency will be filtered Integer 262144
minTF When the number word in this document in is below minTF, the word will be ignored. It could be an exact count or a fraction of the document token count. When minTF is within [0, 1), it’s used as a fraction. Double 1.0

Script Example

Code

  1. import numpy as np
  2. import pandas as pd
  3. data = np.array([
  4. [0, u'二手旧书:医学电磁成像'],
  5. [1, u'二手美国文学选读( 下册 )李宜燮南开大学出版社 9787310003969'],
  6. [2, u'二手正版图解象棋入门/谢恩思主编/华龄出版社'],
  7. [3, u'二手中国糖尿病文献索引'],
  8. [4, u'二手郁达夫文集( 国内版 )全十二册馆藏书']])
  9. df = pd.DataFrame({"id": data[:, 0], "text": data[:, 1]})
  10. inOp1 = BatchOperator.fromDataframe(df, schemaStr='id int, text string')
  11. inOp2 = StreamOperator.fromDataframe(df, schemaStr='id int, text string')
  12. segment = SegmentBatchOp().setSelectedCol("text").linkFrom(inOp1)
  13. train = DocCountVectorizerTrainBatchOp().setSelectedCol("text").linkFrom(segment)
  14. predictBatch = DocCountVectorizerPredictBatchOp().setSelectedCol("text").linkFrom(train, segment)
  15. [model,predict] = collectToDataframes(kmeans, predictBatch)
  16. print(model)
  17. print(predict)
  18. segment = SegmentStreamOp().setSelectedCol("text").linkFrom(inOp2)
  19. predictStream = DocCountVectorizerPredictStreamOp(train).setSelectedCol("text").linkFrom(segment)
  20. predictStream.print(refreshInterval=-1)
  21. StreamOperator.execute()

Results

Model
  1. rowID model_id model_info
  2. 0 0 {"minTF":"1.0","featureType":"\"WORD_COUNT\""}
  3. 1 1048576 {"f0":"二手","f1":0.0,"f2":0}
  4. 2 2097152 {"f0":"/","f1":1.0986122886681098,"f2":1}
  5. 3 3145728 {"f0":"出版社","f1":0.6931471805599453,"f2":2}
  6. 4 4194304 {"f0":"(","f1":0.6931471805599453,"f2":3}
  7. 5 5242880 {"f0":")","f1":0.6931471805599453,"f2":4}
  8. 6 6291456 {"f0":"9787310003969","f1":1.0986122886681098,...
  9. 7 7340032 {"f0":":","f1":1.0986122886681098,"f2":6}
  10. 8 8388608 {"f0":"下册","f1":1.0986122886681098,"f2":7}
  11. 9 9437184 {"f0":"中国","f1":1.0986122886681098,"f2":8}
  12. 10 10485760 {"f0":"主编","f1":1.0986122886681098,"f2":9}
  13. 11 11534336 {"f0":"书","f1":1.0986122886681098,"f2":10}
  14. 12 12582912 {"f0":"入门","f1":1.0986122886681098,"f2":11}
  15. 13 13631488 {"f0":"全","f1":1.0986122886681098,"f2":12}
  16. 14 14680064 {"f0":"医学","f1":1.0986122886681098,"f2":13}
  17. 15 15728640 {"f0":"十二册","f1":1.0986122886681098,"f2":14}
  18. 16 16777216 {"f0":"华龄","f1":1.0986122886681098,"f2":15}
  19. 17 17825792 {"f0":"南开大学","f1":1.0986122886681098,"f2":16}
  20. 18 18874368 {"f0":"国内","f1":1.0986122886681098,"f2":17}
  21. 19 19922944 {"f0":"图解","f1":1.0986122886681098,"f2":18}
  22. 20 20971520 {"f0":"思","f1":1.0986122886681098,"f2":19}
  23. 21 22020096 {"f0":"成像","f1":1.0986122886681098,"f2":20}
  24. 22 23068672 {"f0":"文学","f1":1.0986122886681098,"f2":21}
  25. 23 24117248 {"f0":"文献","f1":1.0986122886681098,"f2":22}
  26. 24 25165824 {"f0":"文集","f1":1.0986122886681098,"f2":23}
  27. 25 26214400 {"f0":"旧书","f1":1.0986122886681098,"f2":24}
  28. 26 27262976 {"f0":"李宜燮","f1":1.0986122886681098,"f2":25}
  29. 27 28311552 {"f0":"正版","f1":1.0986122886681098,"f2":26}
  30. 28 29360128 {"f0":"版","f1":1.0986122886681098,"f2":27}
  31. 29 30408704 {"f0":"电磁","f1":1.0986122886681098,"f2":28}
  32. 30 31457280 {"f0":"糖尿病","f1":1.0986122886681098,"f2":29}
  33. 31 32505856 {"f0":"索引","f1":1.0986122886681098,"f2":30}
  34. 32 33554432 {"f0":"美国","f1":1.0986122886681098,"f2":31}
  35. 33 34603008 {"f0":"谢恩","f1":1.0986122886681098,"f2":32}
  36. 34 35651584 {"f0":"象棋","f1":1.0986122886681098,"f2":33}
  37. 35 36700160 {"f0":"选读","f1":1.0986122886681098,"f2":34}
  38. 36 37748736 {"f0":"郁达夫","f1":1.0986122886681098,"f2":35}
  39. 37 38797312 {"f0":"馆藏","f1":1.0986122886681098,"f2":36}
Output Data
  1. rowID id text
  2. 0 0 $37$0:1.0 6:1.0 13:1.0 20:1.0 24:1.0 28:1.0
  3. 1 1 $37$0:1.0 2:1.0 3:1.0 4:1.0 5:1.0 7:1.0 16:1.0...
  4. 2 2 $37$0:1.0 1:2.0 2:1.0 9:1.0 11:1.0 15:1.0 18:1...
  5. 3 3 $37$0:1.0 8:1.0 22:1.0 29:1.0 30:1.0
  6. 4 4 $37$0:1.0 3:1.0 4:1.0 10:1.0 12:1.0 14:1.0 17:...