Similarity - ApproxVectorSimilarityJoin(batch) - 《Alink v1.0.1 Document》

Description
Parameters
Script Example
- Code
- Results
  - Output Data

Description

ApproxVectorSimilarityJoinLSHBatchOp is used to join two vectors whose distance is below threshold from two datasets separately. It’s an approximate method using LSH.

The two datasets must each contain at least two columns: vector column and id column.

The class supports two distance type: EUCLIDEAND and JACCARD.

The output contains three columns: leftId, rightId, distance.

Parameters

Name	Description	Type	Required？	Default Value
distanceType	Distance type for clustering, support EUCLIDEAN and JACCARD.	String		“EUCLIDEAN”
distanceThreshold	distance threshold	Double		1.7976931348623157E308
leftCol	Name of the tensor column from left table	String	✓
rightCol	Name of the tensor column from the right table	String	✓
outputCol	Name of the output column	String	✓
leftIdCol	Name of the tensor column from left table	String	✓
rightIdCol	Name of the id column from right table	String	✓
projectionWidth	Bucket length, used in bucket random projection LSH.	Double		1.0
numHashTables	The number of hash tables	Integer		1
selectedCol	Name of the selected column used for processing	String	✓
numProjectionsPerTable	The number of hash functions within every hash table	Integer		1
seed	seed	Long		0

Script Example

Code

# -*- coding=UTF-8 -*-
import numpy as np
import pandas as pd
data = np.array([
    [0, "0 0 0"],
    [1, "1 1 1"],
    [2, "2 2 2"]
])
df = pd.DataFrame({"id": data[:, 0], "vec": data[:, 1]})
source = BatchOperator.fromDataframe(df, schemaStr='id int, vec string')
op = (
    ApproxVectorSimilarityJoinLSHBatchOp()
    .setLeftIdCol("id")
    .setRightIdCol("id")
    .setLeftCol("vec")
    .setRightCol("vec")
    .setOutputCol("output")
    .setDistanceThreshold(2.0))
op.linkFrom(source, source).collectToDataframe()

Results

Output Data

rowID  id_left    id_right    output
0    0    0    0.0
1    1    1    0.0
2    2    2    0.0