Similarity - ApproxVectorSimiliartyTopN(batch) - 《Alink v1.0.1 Document》

Description
Parameters
Script Example
- Code
- Results
  - Output Data

Description

ApproxVectorSimilarityTopNLSHBatchOp used to search the topN nearest neighbor of every record in the first dataset from the second dataset. It’s an approximate method using LSH.

The two datasets must each contain at least two columns: vector column and id column.

The class supports two distance type: EUCLIDEAND and JACCARD.

The output contains four columns: leftId, rightId, distance, rank.

Parameters

Name	Description	Type	Required？	Default Value
distanceType	Distance type for clustering, support EUCLIDEAN and JACCARD.	String		“EUCLIDEAN”
topN	top n	Integer		5
leftCol	Name of the tensor column from left table	String	✓
rightCol	Name of the tensor column from the right table	String	✓
outputCol	Name of the output column	String	✓
leftIdCol	Name of the tensor column from left table	String	✓
rightIdCol	Name of the id column from right table	String	✓
projectionWidth	Bucket length, used in bucket random projection LSH.	Double		1.0
numHashTables	The number of hash tables	Integer		1
selectedCol	Name of the selected column used for processing	String	✓
numProjectionsPerTable	The number of hash functions within every hash table	Integer		1
seed	seed	Long		0

Script Example

Code

import numpy as np
import pandas as pd
data = np.array([
    [0, "0 0 0"],
    [1, "1 1 1"],
    [2, "2 2 2"]
])
df = pd.DataFrame({"id": data[:, 0], "vec": data[:, 1]})
source = BatchOperator.fromDataframe(df, schemaStr='id int, vec string')
op = (
    ApproxVectorSimilarityTopNLSHBatchOp()
    .setLeftIdCol("id")
    .setRightIdCol("id")
    .setLeftCol("vec")
    .setRightCol("vec")
    .setOutputCol("output"))
op.linkFrom(source, source).collectToDataframe()

Results

Output Data

rowID id_right id_left output    rank
0    0    0    0.0    1
1    1    1    0.0    1
2    2    2    0.0    1