Filter Table Rows by Attributes

Tip

Find the full source code and run TagsHasher on Jina Hub.

Big news, one can use Jina to filter table rows by their attributes! Such an amazing feature that only exists since… 47 years ago, aka SQL! Jina as a neural search framework surely won’t implement a SQL database from scratch. The question here is: is it possible to leverage what we learned about neural search: embedding, indexing, nearest-neighbour matching to enable similar feature like SQL, e.g. filter, select?

Yes! Jina can do this. In this article, I will give you a walkthrough on how to filter the tabular data using Jina and without SQL (also no GPT-3). Let’s call this mini-project as the neuretro-SQL.

Feature hashing

The first thing you want to learn is feature hashing. I already gave a tutorial at here. I strongly recommend you to read that first before continue.

In general, feature hashing is a great way to embed unbounded number of features into fixed-size vectors. We will leverage the same idea here to embed the columns of the tabular data into fixed-size vectors.

Load CSV as DocumentArray

Let’s look at an example CSV file. Here I use a film dataset that looks like the following:

../../../_images/film-dataset.png

Let’s load the data from the web and put them into a DocumentArray:

  1. import io
  2. from jina import Document, DocumentArray
  3. from jina.types.document.generators import from_csv
  4. # Load some online CSV file dataset
  5. src = Document(
  6. uri='https://perso.telecom-paristech.fr/eagan/class/igr204/data/film.csv'
  7. ).load_uri_to_text('iso8859')
  8. da = DocumentArray(from_csv(io.StringIO(src.text), dialect='auto'))
  1. <jina.types.arrays.document.DocumentArray (length=1660) at 5697295312>

Here we use Document API to download the data, convert it into the right charset, and load it via our CSV API as a DocumentArray.

Looks like we got 1660 Documents in total, let’s take one sample from it and take a look:

  1. print(da[5].json())
  1. {
  2. "id": "16a9745c-3d99-11ec-a97f-1e008a366d49",
  3. "tags": {
  4. "*Image": "NicholasCage.png",
  5. "Actor": "Gere, Richard",
  6. "Actress": "Adams, Brooke",
  7. "Awards": "No",
  8. "Director": "Malick, Terrence",
  9. "Length": "94",
  10. "Popularity": "14",
  11. "Subject": "Drama",
  12. "Title": "Days of Heaven",
  13. "Year": "1978"
  14. }
  15. }

It looks like this Document has two non-empty attributes id and tags, and all values in tags correspond to the column value we have in the CSV data. Now our task is clear: we want to filter Documents from this DocumentArray according to their values in .tags, but no SQL, pure Jina, pure neural search.

Embed columns as vectors

To embed columns into vectors, we first notice that each “column-item” in .tags is actually a Tuple[str, Any] pair. The first part, a string, represents the column title, e.g. “Actor”, “Actress”, “Director”. We can simply reuse our previous hash function:

  1. import hashlib
  2. h = lambda x: int(hashlib.md5(str(x).encode('utf-8')).hexdigest(), base=16) % 256
  3. h('Actor')
  4. h('Director')
  5. h('Length')
  1. 163
  2. 111
  3. 117

Now that we have indices, the actual value on that index, namely the Any part of that Tuple[str, Any] pair needs some extra thought. First, some values are numbers like integers or floats, they are a good hash by themselves (e.g. 1996 is 1996, equal numbers are identity in semantics with no collision), so they do not need another hash function. Boolean values are the same, 0 and 1 are pretty representative. Strings can be handled in the same way above. What about lists, tuples and dicts? We can serialize them into JSON strings and then apply our string hash. The final hash function looks like the following:

  1. def _any_hash(self, v):
  2. try:
  3. return int(v) # parse int parameter
  4. except ValueError:
  5. try:
  6. return float(v) # parse float parameter
  7. except ValueError:
  8. if not v:
  9. # ignore it when the parameter is empty
  10. return 0
  11. if isinstance(v, str):
  12. v = v.strip()
  13. if v.lower() in {'true', 'yes'}: # parse boolean parameter
  14. return 1
  15. if v.lower() in {'false', 'no'}:
  16. return 0
  17. if isinstance(v, (tuple, dict, list)):
  18. v = json.dumps(v, sort_keys=True)
  19. return int(self.hash(str(v).encode('utf-8')).hexdigest(), base=16)

If you apply this directly, you will get extremely big integers on the embedding values. Too big that you don’t even want to look at or store it (for numerical and stability reason). So we need to bound it. Remember in full-text feature hashing example, we introduced n_dim to “horizontally” bound the dimensions of the embedding vector. We can follow the same spirit and introduce another variable max_val to “vertically” bound the dimensions of the vector:

  1. from jina import Executor
  2. import hashlib
  3. class TagsHasher(Executor):
  4. def __init__(self, n_dim: int = 256, max_val: int = 65536, sparse: bool = False, **kwargs):
  5. super().__init__(**kwargs)
  6. self.n_dim = n_dim
  7. self.max_val = max_val
  8. self.hash = hashlib.md5
  9. self.sparse = sparse

Here we give a larger number to max_val then to n_dim. This is because the likelihood of a collision happens on vertical direction is in general much higher than on horizontal direction (otherwise, it implies there are more variants on the column name than on the column value, which then suggests the table-maker to simply “transpose” the whole table for better readability).

The final embedding procedure is then very simple:

  1. @requests
  2. def encode(self, docs: DocumentArray, **kwargs):
  3. if self.sparse:
  4. from scipy.sparse import csr_matrix
  5. for idx, doc in enumerate(docs):
  6. if doc.tags:
  7. idxs, data = [], [] # sparse
  8. table = np.zeros(self.n_dim) # dense
  9. for k, v in doc.tags.items():
  10. h = self._any_hash(k)
  11. sign_h = np.sign(h)
  12. col = h % self.n_dim
  13. val = self._any_hash(v)
  14. sign_v = np.sign(val)
  15. val = val % self.max_val
  16. idxs.append((0, col))
  17. val = sign_h * sign_v * val
  18. data.append(val)
  19. table[col] += val
  20. if self.sparse:
  21. doc.embedding = csr_matrix(
  22. (data, zip(*idxs)), shape=(1, self.n_dim)
  23. )
  24. else:
  25. doc.embedding = table

Put all together

Let’s encode our loaded DocumentArray:

  1. from jina import Executor
  2. th = Executor.load_config('jinahub://TagsHasher')
  3. th.encode(da)

Now let’s build some filters as Document:

  1. filters = [
  2. {"Subject": "Comedy"},
  3. {"Year": 1987},
  4. {"Subject": "Comedy", "Year": 1987}
  5. ]
  6. qa = DocumentArray([Document(tags=f) for f in filters])

Encode the filter with TagsHasher to get the embeddings.

  1. th.encode(qa)

Now that we have embeddings for both indexed docs da (i.e. our film CSV table), and the query docs qa (our filters), we can use .match function to find nearest neighbours.

  1. qa.match(da, limit=5, exclude_self=True, metric='jaccard', use_scipy=True)

Note that here I use Jaccard distance instead of the cosine distance. This is because the closeness of the value on each feature is meaningless, as the value is the result of a hash function. Whereas in FeatureHashser’s example, the value represents the term frequency of a word, so it was meaningful there. This needs to be kept in mind when using TagsHasher.

Finally, let’s see some results. Here I only print top-5 matches.

  1. for d in qa:
  2. print('my filter is:', d.tags.json())
  3. for m in d.matches:
  4. print(m.tags.json())

“Subject”: “Comedy”

  1. {
  2. "*Image": "NicholasCage.png",
  3. "Actor": "Chase, Chevy",
  4. "Actress": "",
  5. "Awards": "No",
  6. "Director": "",
  7. "Length": "",
  8. "Popularity": "82",
  9. "Subject": "Comedy",
  10. "Title": "Valkenvania",
  11. "Year": "1990"
  12. }
  13. {
  14. "*Image": "paulNewman.png",
  15. "Actor": "Newman, Paul",
  16. "Actress": "",
  17. "Awards": "No",
  18. "Director": "",
  19. "Length": "",
  20. "Popularity": "28",
  21. "Subject": "Comedy",
  22. "Title": "Secret War of Harry Frigg, The",
  23. "Year": "1968"
  24. }
  25. {
  26. "*Image": "NicholasCage.png",
  27. "Actor": "Murphy, Eddie",
  28. "Actress": "",
  29. "Awards": "No",
  30. "Director": "",
  31. "Length": "",
  32. "Popularity": "56",
  33. "Subject": "Comedy",
  34. "Title": "Best of Eddie Murphy, Saturday Night Live, The",
  35. "Year": "1989"
  36. }
  37. {
  38. "*Image": "NicholasCage.png",
  39. "Actor": "Mastroianni, Marcello",
  40. "Actress": "",
  41. "Awards": "No",
  42. "Director": "Fellini, Federico",
  43. "Length": "",
  44. "Popularity": "29",
  45. "Subject": "Comedy",
  46. "Title": "Ginger & Fred",
  47. "Year": "1993"
  48. }
  49. {
  50. "*Image": "NicholasCage.png",
  51. "Actor": "Piscopo, Joe",
  52. "Actress": "",
  53. "Awards": "No",
  54. "Director": "",
  55. "Length": "60",
  56. "Popularity": "14",
  57. "Subject": "Comedy",
  58. "Title": "Joe Piscopo New Jersey Special",
  59. "Year": "1987"
  60. }

“Year”: 1987.0

  1. {
  2. "*Image": "NicholasCage.png",
  3. "Actor": "",
  4. "Actress": "Madonna",
  5. "Awards": "No",
  6. "Director": "",
  7. "Length": "50",
  8. "Popularity": "75",
  9. "Subject": "Music",
  10. "Title": "Madonna Live, The Virgin Tour",
  11. "Year": "1987"
  12. }
  13. {
  14. "*Image": "NicholasCage.png",
  15. "Actor": "Piscopo, Joe",
  16. "Actress": "",
  17. "Awards": "No",
  18. "Director": "",
  19. "Length": "60",
  20. "Popularity": "14",
  21. "Subject": "Comedy",
  22. "Title": "Joe Piscopo New Jersey Special",
  23. "Year": "1987"
  24. }
  25. {
  26. "*Image": "NicholasCage.png",
  27. "Actor": "Everett, Rupert",
  28. "Actress": "",
  29. "Awards": "No",
  30. "Director": "",
  31. "Length": "95",
  32. "Popularity": "25",
  33. "Subject": "Drama",
  34. "Title": "Hearts of Fire",
  35. "Year": "1987"
  36. }
  37. {
  38. "*Image": "NicholasCage.png",
  39. "Actor": "Lambert, Christopher",
  40. "Actress": "Sukowa, Barbara",
  41. "Awards": "No",
  42. "Director": "Cimino, Michael",
  43. "Length": "",
  44. "Popularity": "41",
  45. "Subject": "Drama",
  46. "Title": "Sicilian, The",
  47. "Year": "1987"
  48. }
  49. {
  50. "*Image": "NicholasCage.png",
  51. "Actor": "Hubley, Whip",
  52. "Actress": "",
  53. "Awards": "No",
  54. "Director": "Rosenthal, Rick",
  55. "Length": "98",
  56. "Popularity": "87",
  57. "Subject": "Action",
  58. "Title": "Russkies",
  59. "Year": "1987"
  60. }

{ “Subject”: “Comedy”, “Year”: 1987.0}

  1. {
  2. "*Image": "NicholasCage.png",
  3. "Actor": "Piscopo, Joe",
  4. "Actress": "",
  5. "Awards": "No",
  6. "Director": "",
  7. "Length": "60",
  8. "Popularity": "14",
  9. "Subject": "Comedy",
  10. "Title": "Joe Piscopo New Jersey Special",
  11. "Year": "1987"
  12. }
  13. {
  14. "*Image": "NicholasCage.png",
  15. "Actor": "Murphy, Eddie",
  16. "Actress": "",
  17. "Awards": "No",
  18. "Director": "Murphy, Eddie",
  19. "Length": "90",
  20. "Popularity": "51",
  21. "Subject": "Comedy",
  22. "Title": "Eddie Murphy Raw",
  23. "Year": "1987"
  24. }
  25. {
  26. "*Image": "NicholasCage.png",
  27. "Actor": "McCarthy, Andrew",
  28. "Actress": "Cattrall, Kim",
  29. "Awards": "No",
  30. "Director": "Gottlieb, Michael",
  31. "Length": "",
  32. "Popularity": "23",
  33. "Subject": "Comedy",
  34. "Title": "Mannequin",
  35. "Year": "1987"
  36. }
  37. {
  38. "*Image": "NicholasCage.png",
  39. "Actor": "Williams, Robin",
  40. "Actress": "",
  41. "Awards": "No",
  42. "Director": "Levinson, Barry",
  43. "Length": "120",
  44. "Popularity": "37",
  45. "Subject": "Comedy",
  46. "Title": "Good Morning, Vietnam",
  47. "Year": "1987"
  48. }
  49. {
  50. "*Image": "NicholasCage.png",
  51. "Actor": "Boys, The Fat",
  52. "Actress": "",
  53. "Awards": "No",
  54. "Director": "Schultz, Michael",
  55. "Length": "86",
  56. "Popularity": "69",
  57. "Subject": "Comedy",
  58. "Title": "Disorderlies",
  59. "Year": "1987"
  60. }

Not bad!