🤗 Huggingface vs ⚡ FastEmbed️
Comparing the performance of Huggingface’s 🤗 Transformers and ⚡ FastEmbed️ on a simple task on the following machine: Apple M2 Max, 32 GB RAM.
📦 Imports
Importing the necessary libraries for this comparison.
!pip install matplotlib transformers torch -qq
import timefrom typing import Callableimport matplotlib.pyplot as pltimport torch.nn.functional as Ffrom torch import Tensorfrom transformers import AutoModel, AutoTokenizerfrom fastembed import TextEmbedding
import fastembedfastembed.__version__
'0.2.6'
📖 Data
data is a list of strings, each string is a document.
documents: list[str] = ["Chandrayaan-3 is India's third lunar mission","It aimed to land a rover on the Moon's surface - joining the US, China and Russia","The mission is a follow-up to Chandrayaan-2, which had partial success","Chandrayaan-3 will be launched by the Indian Space Research Organisation (ISRO)","The estimated cost of the mission is around $35 million","It will carry instruments to study the lunar surface and atmosphere","Chandrayaan-3 landed on the Moon's surface on 23rd August 2023","It consists of a lander named Vikram and a rover named Pragyan similar to Chandrayaan-2. Its propulsion module would act like an orbiter.","The propulsion module carries the lander and rover configuration until the spacecraft is in a 100-kilometre (62 mi) lunar orbit","The mission used GSLV Mk III rocket for its launch","Chandrayaan-3 was launched from the Satish Dhawan Space Centre in Sriharikota","Chandrayaan-3 was launched earlier in the year 2023",]len(documents)
12
Setting up 🤗 Huggingface
We’ll be using the Huggingface Transformers with PyTorch library to generate embeddings. We’ll be using the same model across both libraries for a fair(er?) comparison.
class HF:"""HuggingFace Transformer implementation of FlagEmbedding"""def __init__(self, model_id: str) -> None:self.model = AutoModel.from_pretrained(model_id)self.tokenizer = AutoTokenizer.from_pretrained(model_id)def embed(self, texts: list[str]):encoded_input = self.tokenizer(texts, max_length=512, padding=True, truncation=True, return_tensors="pt")model_output = self.model(**encoded_input)sentence_embeddings = model_output[0][:, 0]sentence_embeddings = F.normalize(sentence_embeddings)return sentence_embeddingsmodel_id = "BAAI/bge-small-en-v1.5"hf = HF(model_id=model_id)hf.embed(documents).shape
torch.Size([12, 384])
Setting up ⚡️FastEmbed
Sorry, don’t have a lot to set up here. We’ll be using the default model, which is Flag Embedding, same as the Huggingface model.
embedding_model = TextEmbedding(model_name=model_id)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...To disable this warning, you can either:- Avoid using `tokenizers` before the fork if possible- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Fetching 9 files: 0%| | 0/9 [00:00<?, ?it/s]
📊 Comparison
We’ll be comparing the following metrics: Minimum, Maximum, Mean, across k runs. Let’s write a function to do that:
🚀 Calculating Stats
import typesdef calculate_time_stats(embed_func: Callable, documents: list, k: int) -> tuple[float, float, float]:times = []for _ in range(k):# Timing the embed_func callstart_time = time.time()embeddings = embed_func(documents)# Force computation if embed_func returns a generatorif isinstance(embeddings, types.GeneratorType):list(embeddings)end_time = time.time()times.append(end_time - start_time)# Returning mean, max, and min time for the callreturn (sum(times) / k, max(times), min(times))hf_stats = calculate_time_stats(hf.embed, documents, k=100)print(f"Huggingface Transformers (Average, Max, Min): {hf_stats}")fst_stats = calculate_time_stats(embedding_model.embed, documents, k=100)print(f"FastEmbed (Average, Max, Min): {fst_stats}")
Huggingface Transformers (Average, Max, Min): (0.04711266994476318, 0.0658111572265625, 0.043084144592285156)FastEmbed (Average, Max, Min): (0.04384247303009033, 0.05654191970825195, 0.04293417930603027)
📈 Results
Let’s run the comparison and see the results.
def plot_character_per_second_comparison(hf_stats: tuple[float, float, float], fst_stats: tuple[float, float, float], documents: list):# Calculating total characters in documentstotal_characters = sum(len(doc) for doc in documents)# Calculating characters per second for each modelhf_chars_per_sec = total_characters / hf_stats[0] # Mean time is at index 0fst_chars_per_sec = total_characters / fst_stats[0]# Plotting the bar chartmodels = ["HF Embed (Torch)", "FastEmbed"]chars_per_sec = [hf_chars_per_sec, fst_chars_per_sec]bars = plt.bar(models, chars_per_sec, color=["#1f356c", "#dd1f4b"])plt.ylabel("Characters per Second")plt.title("Characters Processed per Second Comparison")# Adding the number at the top of each barfor bar, chars in zip(bars, chars_per_sec):plt.text(bar.get_x() + bar.get_width() / 2,bar.get_height(),f"{chars:.1f}",ha="center",va="bottom",color="#1f356c",fontsize=12,)plt.show()plot_character_per_second_comparison(hf_stats, fst_stats, documents)

Are the Embeddings the same?
This is a very important question. Let’s see if the embeddings are the same.
def calculate_cosine_similarity(embeddings1: Tensor, embeddings2: Tensor) -> float:"""Calculate cosine similarity between two sets of embeddings"""return F.cosine_similarity(embeddings1, embeddings2).mean().item()calculate_cosine_similarity(hf.embed(documents), Tensor(list(embedding_model.embed(documents))))
/var/folders/b4/grpbcmrd36gc7q5_11whbn540000gn/T/ipykernel_14307/1958479940.py:8: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_new.cpp:278.)calculate_cosine_similarity(hf.embed(documents), Tensor(list(embedding_model.embed(documents))))
0.9999992847442627
This indicates the embeddings are quite close to each with a cosine similarity of 0.99 for BAAI/bge-small-en and 0.92 for BAAI/bge-small-en-v1.5. This gives us confidence that the embeddings are the same and we are not sacrificing accuracy for speed.