Binary Quantization from Scratch

Setup: Install Dependencies, Imports & Download Embeddings

  1. !pip install matplotlib tqdm pandas numpy datasets --quiet --upgrade
  1. import numpy as np
  2. import pandas as pd
  3. from datasets import load_dataset
  4. from tqdm import tqdm

👨🏾‍💻 Code Walkthrough

Here’s an explanation of the code structure provided:

  1. Loading Data: OpenAI embeddings are loaded from a parquet files (we can load upto 1M embedding) and concatenated into one array.
  2. Binary Conversion: A new array with the same shape is initialized with zeros, and the positive values in the original vectors are set to 1.
  3. Accuracy Function: The accuracy function compares original vectors with binary vectors for a given index, limit, and oversampling rate. The comparison is done using dot products and logical XOR, sorting the results, and measuring the intersection.
  4. Testing: The accuracy is tested for different oversampling rates (1, 2, 4), revealing a correctness of ~0.96 for an oversampling of 4.

💿 Loading Data

  1. # Download from Huggingface Hub
  2. ds = load_dataset(
  3. "Qdrant/dbpedia-entities-openai3-text-embedding-3-large-3072-100K", split="train"
  4. )
  5. openai_vectors = np.array(ds["text-embedding-3-large-3072-embedding"])
  6. del ds
  1. openai_bin = np.zeros_like(openai_vectors, dtype=np.int8)
  2. openai_bin[openai_vectors > 0] = 1
  1. n_dim = openai_vectors.shape[1]
  2. n_dim
  1. 3072

🎯 Accuracy Function

We will use the accuracy function to compare the original vectors with the binary vectors for a given index, limit, and oversampling rate. The comparison is done using dot products and logical XOR, sorting the results, and measuring the intersection.

  1. def accuracy(idx, limit: int, oversampling: int):
  2. scores = np.dot(openai_vectors, openai_vectors[idx])
  3. dot_results = np.argsort(scores)[-limit:][::-1]
  4. bin_scores = n_dim - np.logical_xor(openai_bin, openai_bin[idx]).sum(axis=1)
  5. bin_results = np.argsort(bin_scores)[-(limit * oversampling) :][::-1]
  6. return len(set(dot_results).intersection(set(bin_results))) / limit

📊 Results

  1. number_of_samples = 10
  2. limits = [3, 10]
  3. sampling_rate = [1, 2, 3, 5]
  4. results = []
  5. def mean_accuracy(number_of_samples, limit, sampling_rate):
  6. return np.mean(
  7. [accuracy(i, limit=limit, oversampling=sampling_rate) for i in range(number_of_samples)]
  8. )
  9. for i in tqdm(sampling_rate):
  10. for j in tqdm(limits):
  11. result = {
  12. "sampling_rate": i,
  13. "limit": j,
  14. "mean_acc": mean_accuracy(number_of_samples, j, i),
  15. }
  16. print(result)
  17. results.append(result)
  1. 0%| | 0/4 [00:00<?, ?it/s]
  2. 0%| | 0/2 [00:00<?, ?it/s]
  3. 50%|█████ | 1/2 [00:02<00:02, 2.05s/it]
  1. {'sampling_rate': 1, 'limit': 3, 'mean_acc': 0.9}
  1. 100%|██████████| 2/2 [00:04<00:00, 2.02s/it]
  2. 25%|██▌ | 1/4 [00:04<00:12, 4.05s/it]
  1. {'sampling_rate': 1, 'limit': 10, 'mean_acc': 0.8300000000000001}
  1. 0%| | 0/2 [00:00<?, ?it/s]
  2. 50%|█████ | 1/2 [00:01<00:01, 1.72s/it]
  1. {'sampling_rate': 2, 'limit': 3, 'mean_acc': 1.0}
  1. 100%|██████████| 2/2 [00:03<00:00, 1.76s/it]
  2. 50%|█████ | 2/4 [00:07<00:07, 3.75s/it]
  1. {'sampling_rate': 2, 'limit': 10, 'mean_acc': 0.9700000000000001}
  1. 0%| | 0/2 [00:00<?, ?it/s]
  2. 50%|█████ | 1/2 [00:01<00:01, 1.72s/it]
  1. {'sampling_rate': 3, 'limit': 3, 'mean_acc': 1.0}
  1. 100%|██████████| 2/2 [00:03<00:00, 1.69s/it]
  2. 75%|███████▌ | 3/4 [00:10<00:03, 3.58s/it]
  1. {'sampling_rate': 3, 'limit': 10, 'mean_acc': 0.9800000000000001}
  1. 0%| | 0/2 [00:00<?, ?it/s]
  2. 50%|█████ | 1/2 [00:01<00:01, 1.68s/it]
  1. {'sampling_rate': 5, 'limit': 3, 'mean_acc': 1.0}
  1. 100%|██████████| 2/2 [00:03<00:00, 1.65s/it]
  2. 100%|██████████| 4/4 [00:14<00:00, 3.57s/it]
  1. {'sampling_rate': 5, 'limit': 10, 'mean_acc': 0.99}

㆓ Binary Conversion

Here, we will use 0 as the threshold for the binary conversion. All values greater than 0 will be set to 1, and others will remain 0. This is a simple and effective way to convert continuous values into binary values for OpenAI embeddings.

  1. results = pd.DataFrame(results)
  2. results
sampling_ratelimitmean_acc
0130.90
11100.83
2231.00
32100.97
4331.00
53100.98
6531.00
75100.99