Fuzzy String Matching in 30 Lines

Different behavior on Jupyter Notebook

Be aware of the following when running this tutorial in jupyter notebook. Some python built-in attributes such as __file__ do not exist. You can change __file__ for any other file path existing in your system.

Now that you understand all fundamental concepts, let’s practice the learnings and build a simple end-to-end demo.

We will use Jina to implement a fuzzy search solution on source code: given a snippet source code and a query, find all lines that are similar to the query. It is like grep but in fuzzy mode.

Preliminaries

Client-Server architecture

../../_images/simple-arch.svg

Server

Character embedding

Let’s first build a simple Executor for character embedding:

  1. import numpy as np
  2. from jina import DocumentArray, Executor, requests
  3. class CharEmbed(Executor): # a simple character embedding with mean-pooling
  4. offset = 32 # letter `a`
  5. dim = 127 - offset + 1 # last pos reserved for `UNK`
  6. char_embd = np.eye(dim) * 1 # one-hot embedding for all chars
  7. @requests
  8. def foo(self, docs: DocumentArray, **kwargs):
  9. for d in docs:
  10. r_emb = [ord(c) - self.offset if self.offset <= ord(c) <= 127 else (self.dim - 1) for c in d.text]
  11. d.embedding = self.char_embd[r_emb, :].mean(axis=0) # average pooling

Indexer with Euclidean distance

  1. from jina import DocumentArray, Executor, requests
  2. class Indexer(Executor):
  3. _docs = DocumentArray() # for storing all documents in memory
  4. @requests(on='/index')
  5. def foo(self, docs: DocumentArray, **kwargs):
  6. self._docs.extend(docs) # extend stored `docs`
  7. @requests(on='/search')
  8. def bar(self, docs: DocumentArray, **kwargs):
  9. docs.match(self._docs, metric='euclidean', limit=20)

Put it together in a Flow

  1. from jina import Flow
  2. f = (Flow(port_expose=12345, protocol='http', cors=True)
  3. .add(uses=CharEmbed, replicas=2)
  4. .add(uses=Indexer)) # build a Flow, with 2 shard CharEmbed, tho unnecessary

Start the Flow and index data

  1. from jina import Document
  2. with f:
  3. f.post('/index', (Document(text=t.strip()) for t in open(__file__) if t.strip())) # index all lines of _this_ file
  4. f.block() # block for listening request

Caution

open(__file__) means open the current file and use it for indexing. Note in some enviroment such as Jupyter Notebook and Google Colab, __file__ is not defined. In this case, you may want to replace it to open('my-source-code.py').

Query via SwaggerUI

Open http://localhost:12345/docs (an extended Swagger UI) in your browser, click /search tab and input:

  1. {
  2. "data": [
  3. {
  4. "text": "@requests(on=something)"
  5. }
  6. ]
  7. }

That means, **we want to find lines from the above code snippet that are most similar to @request(on=something).**Now click Execute button!

../../_images/swagger-ui-prettyprint1.gif

Query from Python

Let’s do it in Python then! Keep the above server running and start a simple client:

  1. from jina import Client, Document
  2. from jina.types.request import Response
  3. def print_matches(resp: Response): # the callback function invoked when task is done
  4. for idx, d in enumerate(resp.docs[0].matches[:3]): # print top-3 matches
  5. print(f'[{idx}]{d.scores["euclidean"].value:2f}: "{d.text}"')
  6. c = Client(protocol='http', port=12345) # connect to localhost:12345
  7. c.post('/search', Document(text='request(on=something)'), on_done=print_matches)

, which prints the following results:

  1. [email protected][S]:connected to the gateway at localhost:12345!
  2. [0]0.168526: "@requests(on='/index')"
  3. [1]0.181676: "@requests(on='/search')"
  4. [2]0.218218: "from jina import Document, DocumentArray, Executor, Flow, requests"