Description

ApproxVectorSimilarityTopNLSHBatchOp used to search the topN nearest neighbor of every record in the first dataset from the second dataset. It’s an approximate method using LSH.

The two datasets must each contain at least two columns: vector column and id column.

The class supports two distance type: EUCLIDEAND and JACCARD.

The output contains four columns: leftId, rightId, distance, rank.

Parameters

Name Description Type Required? Default Value
distanceType Distance type for clustering, support EUCLIDEAN and JACCARD. String “EUCLIDEAN”
topN top n Integer 5
leftCol Name of the tensor column from left table String
rightCol Name of the tensor column from the right table String
outputCol Name of the output column String
leftIdCol Name of the tensor column from left table String
rightIdCol Name of the id column from right table String
projectionWidth Bucket length, used in bucket random projection LSH. Double 1.0
numHashTables The number of hash tables Integer 1
selectedCol Name of the selected column used for processing String
numProjectionsPerTable The number of hash functions within every hash table Integer 1
seed seed Long 0

Script Example

Code

  1. import numpy as np
  2. import pandas as pd
  3. data = np.array([
  4. [0, "0 0 0"],
  5. [1, "1 1 1"],
  6. [2, "2 2 2"]
  7. ])
  8. df = pd.DataFrame({"id": data[:, 0], "vec": data[:, 1]})
  9. source = BatchOperator.fromDataframe(df, schemaStr='id int, vec string')
  10. op = (
  11. ApproxVectorSimilarityTopNLSHBatchOp()
  12. .setLeftIdCol("id")
  13. .setRightIdCol("id")
  14. .setLeftCol("vec")
  15. .setRightCol("vec")
  16. .setOutputCol("output"))
  17. op.linkFrom(source, source).collectToDataframe()

Results

Output Data
  1. rowID id_right id_left output rank
  2. 0 0 0 0.0 1
  3. 1 1 1 0.0 1
  4. 2 2 2 0.0 1