DocumentArray

A DocumentArray is a list of Document objects. You can construct, delete, insert, sort and traverse a DocumentArray like a Python list. It implements all Python List interface.

Hint

We also provide a memory-efficient version of DocumentArray coined as DocumentArrayMemmap. It shares almost the same API as DocumentArray, which means you can easily use it as a drop-in replacement when your data is big. You can can find more about here.

Construct

You can construct a DocumentArray in different ways:

From empty Documents

  1. from jina import DocumentArray
  2. da = DocumentArray.empty(10)

From list of Documents

  1. from jina import DocumentArray, Document
  2. da = DocumentArray([Document(...), Document(...)])

From generator

  1. from jina import DocumentArray, Document
  2. da = DocumentArray((Document(...) for _ in range(10)))

From another DocumentArray

  1. from jina import DocumentArray, Document
  2. da = DocumentArray((Document() for _ in range(10)))
  3. da1 = DocumentArray(da)

From JSON, CSV, ndarray, files, …

You can find more details about those APIs in FromGeneratorMixin.

  1. da = DocumentArray.from_ndjson(...)
  2. da = DocumentArray.from_csv(...)
  3. da = DocumentArray.from_files(...)
  4. da = DocumentArray.from_lines(...)
  5. da = DocumentArray.from_ndarray(...)

Access elements

Like a List and a Dict, elements in DocumentArray can be accessed via integer index, string id or slice indices:

  1. from jina import DocumentArray, Document
  2. da = DocumentArray([Document(id='hello'), Document(id='world'), Document(id='goodbye')])
  3. da[0]
  4. da[1:2]
  5. da['world']
  1. <jina.types.document.Document id=hello at 5699749904>
  2. <jina.types.arrays.document.DocumentArray length=1 at 5705863632>
  3. <jina.types.document.Document id=world at 5736614992>

Tip

To access Documents with nested Documents, please refer to Traverse nested elements.

Bulk access contents

You can quickly access .text, .blob, .buffer, .embedding of all Documents in the DocumentArray without writing a for-loop.

DocumentArray provides the plural counterparts, i.e. texts, buffers, blobs, embeddings that allows you to get and set these properties in one shot. It is much more efficient than looping.

  1. from jina import DocumentArray
  2. da = DocumentArray.empty(2)
  3. da.texts = ['hello', 'world']
  4. print(da[0], da[1])
  1. <jina.types.document.Document ('id', 'text') at 4520833232>
  2. <jina.types.document.Document ('id', 'text') at 5763350672>

When accessing .blobs or .embeddings, it automatically ravels/unravels the ndarray (can be Numpy/TensorFlow/PyTorch/SciPy/PaddlePaddle) for you.

  1. import numpy as np
  2. import scipy.sparse
  3. from jina import DocumentArray
  4. sp_embed = np.random.random([10, 256])
  5. sp_embed[sp_embed > 0.1] = 0
  6. sp_embed = scipy.sparse.coo_matrix(sp_embed)
  7. da = DocumentArray.empty(10)
  8. da.embeddings = scipy.sparse.coo_matrix(sp_embed)
  9. print('da.embeddings.shape=', da.embeddings.shape)
  10. for d in da:
  11. print('d.embedding.shape=', d.embedding.shape)
  1. da.embeddings.shape= (10, 256)
  2. d.embedding.shape= (1, 256)
  3. d.embedding.shape= (1, 256)
  4. d.embedding.shape= (1, 256)
  5. d.embedding.shape= (1, 256)
  6. d.embedding.shape= (1, 256)
  7. d.embedding.shape= (1, 256)
  8. d.embedding.shape= (1, 256)
  9. d.embedding.shape= (1, 256)
  10. d.embedding.shape= (1, 256)
  11. d.embedding.shape= (1, 256)

Bulk access to attributes

get_attributes() let you fetch multiple attributes from the Documents in one shot:

  1. import numpy as np
  2. from jina import DocumentArray, Document
  3. da = DocumentArray([Document(id=1, text='hello', embedding=np.array([1, 2, 3])),
  4. Document(id=2, text='goodbye', embedding=np.array([4, 5, 6])),
  5. Document(id=3, text='world', embedding=np.array([7, 8, 9]))])
  6. da.get_attributes('id', 'text', 'embedding')
  1. [('1', '2', '3'), ('hello', 'goodbye', 'world'), (array([1, 2, 3]), array([4, 5, 6]), array([7, 8, 9]))]

Import/Export

DocumentArray provides the following methods for importing from/exporting to different formats.

DescriptionExport MethodImport Method
LZ4-compressed binary string/file.to_bytes() (or bytes(…) for more Pythonic), .save_binary().load_binary()
JSON string/file.to_json(), .save_json().load_json(), .from_ndjson()
CSV file.save_csv().load_csv(), .from_lines(), .from_csv()
pandas.Dataframe object.to_dataframe().from_dataframe()
Local files.from_files()
numpy.ndarray object.from_ndarray()
Jina Cloud Storage (experimental).push().pull()

See also

.from_*() functions often utlizes generators. When using independently, can be more memory-efficient. See generators.

Sharing DocumentArray across machines

Caution

This is an experimental feature introduced in Jina 2.5.4. The behavior of this feature might change in the future.

Since Jina 2.5.4 we introduce a new IO feature: push() and pull(), which allows you to share a DocumentArray object across machines.

Consider you are working on a GPU machine via Google Colab/Jupyter. After preprocessing and embedding, you got everything you need in a DocumentArray. You can easily transfer it to the local laptop via:

  1. from jina import DocumentArray
  2. da = DocumentArray(...) # heavylifting, processing, GPU task, ...
  3. da.push(token='myda123')

Then on your local laptop, simply

  1. from jina import DocumentArray
  2. da = DocumentArray.pull(token='myda123')

Now you can continue the work at local, analyzing da or visualizing it. Your friends & colleagues who know the token myda123 can also pull that DocumentArray. It’s useful when you want to quickly share the results with your colleagues & friends.

For more information of this feature, please refer to PushPullMixin.

Danger

The lifetime of the storage is not promised at the moment: could be a day, could be a week. Do not use it for persistence in production. Only consider this as temporary transmission or a clipboard.

Embed via model

Important

embed() function supports both CPU & GPU, which can be specified by its device argument.

Important

You can use PyTorch, Keras, ONNX, PaddlePaddle as the embedding model.

When a DocumentArray has .blobs set, you can use a deep neural network to embed() it, which means filling DocumentArray.embeddings. For example, our DocumentArray looks like the following:

  1. from jina import DocumentArray
  2. import numpy as np
  3. docs = DocumentArray.empty(10)
  4. docs.blobs = np.random.random([10, 128]).astype(np.float32)

And our embedding model is a simple MLP in Pytorch/Keras/ONNX/Paddle:

PyTorch

  1. import torch
  2. model = torch.nn.Sequential(
  3. torch.nn.Linear(
  4. in_features=128,
  5. out_features=128,
  6. ),
  7. torch.nn.ReLU(),
  8. torch.nn.Linear(in_features=128, out_features=32))

Keras

  1. import tensorflow as tf
  2. model = tf.keras.Sequential(
  3. [
  4. tf.keras.layers.Dense(128, activation='relu'),
  5. tf.keras.layers.Dense(32),
  6. ]
  7. )

ONNX

Preliminary: you need to first export a DNN model to ONNX via API/CLI. For example let’s use the PyTorch one:

  1. data = torch.rand(1, 128)
  2. torch.onnx.export(model, data, 'mlp.onnx',
  3. do_constant_folding=True, # whether to execute constant folding for optimization
  4. input_names=['input'], # the model's input names
  5. output_names=['output'], # the model's output names
  6. dynamic_axes={
  7. 'input': {0: 'batch_size'}, # variable length axes
  8. 'output': {0: 'batch_size'},
  9. })

Then load it as InferenceSession:

  1. import onnxruntime
  2. model = onnxruntime.InferenceSession('mlp.onnx')

Paddle

  1. import paddle
  2. model = paddle.nn.Sequential(
  3. paddle.nn.Linear(
  4. in_features=128,
  5. out_features=128,
  6. ),
  7. paddle.nn.ReLU(),
  8. paddle.nn.Linear(in_features=128, out_features=32),
  9. )

Now, you can simply do

  1. docs.embed(model)
  2. print(docs.embeddings)
  1. tensor([[-0.1234, 0.0506, -0.0015, 0.1154, -0.1630, -0.2376, 0.0576, -0.4109,
  2. 0.0052, 0.0027, 0.0800, -0.0928, 0.1326, -0.2256, 0.1649, -0.0435,
  3. -0.2312, -0.0068, -0.0991, 0.0767, -0.0501, -0.1393, 0.0965, -0.2062,

Hint

By default, .embeddings is in the model framework’s format. If you want it always be numpy.ndarray, use .embed(..., to_numpy=True).

You can also use pretrained model for embedding:

  1. import torchvision
  2. model = torchvision.models.resnet50(pretrained=True)
  3. docs.embed(model)

You can also visualize .embeddings using Embedding Projector, find more details here.

Hint

On large DocumentArray, you can set batch_size via .embed(..., batch_size=128)

Find nearest neighbours

Important

match() function supports both CPU & GPU, which can be specified by its device argument.

Once embeddings is set, one can use match() function to find the nearest neighbour Documents from another DocumentArray based on their .embeddings.

The following image visualizes how DocumentArrayA finds limit=5 matches from the Documents in DocumentArrayB. By default, the cosine similarity is used to evaluate the score between Documents.

../../../_images/match_illustration_5.svg

More generally, given two DocumentArray objects da_1 and da_2 the function da_1.match(da_2, metric=some_metric, normalization=(0, 1), limit=N) finds for each Document in da_1 the N Documents from da_2 with the lowest metric values according to some_metric.

Note that,

  • da_1.embeddings and da_2.embeddings can be Numpy ndarray, Scipy sparse matrix, Tensorflow tensor, PyTorch tensor or Paddle tensor.

  • metric can be 'cosine', 'euclidean', 'sqeuclidean' or a callable that takes two ndarray parameters and returns an ndarray.

  • by default .match returns distance not similarity. One can use normalization to do min-max normalization. The min distance will be rescaled to a, the max distance will be rescaled to b; all other values will be rescaled into range [a, b]. For example, to convert the distance into [0, 1] score, one can use .match(normalization=(1,0)).

  • limit represents the number of nearest neighbours.

The following example finds for each element in da1 the three closest Documents from the elements in da2 according to Euclidean distance.

Dense embedding

  1. import numpy as np
  2. from jina import DocumentArray
  3. da1 = DocumentArray.empty(4)
  4. da1.embeddings = np.array(
  5. [[0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [1, 1, 1, 1, 0], [1, 2, 2, 1, 0]]
  6. )
  7. da2 = DocumentArray.empty(5)
  8. da2.embeddings = np.array(
  9. [
  10. [0.0, 0.1, 0.0, 0.0, 0.0],
  11. [1.0, 0.1, 0.0, 0.0, 0.0],
  12. [1.0, 1.2, 1.0, 1.0, 0.0],
  13. [1.0, 2.2, 2.0, 1.0, 0.0],
  14. [4.0, 5.2, 2.0, 1.0, 0.0],
  15. ]
  16. )
  17. da1.match(da2, metric='euclidean', limit=3)
  18. query = da1[2]
  19. print(f'query emb = {query.embedding}')
  20. for m in query.matches:
  21. print('match emb =', m.embedding, 'score =', m.scores['euclidean'].value)
  1. query emb = [1 1 1 1 0]
  2. match emb = [1. 1.2 1. 1. 0. ] score = 0.20000000298023224
  3. match emb = [1. 2.2 2. 1. 0. ] score = 1.5620499849319458
  4. match emb = [1. 0.1 0. 0. 0. ] score = 1.6763054132461548

Sparse embedding

  1. import numpy as np
  2. import scipy.sparse as sp
  3. from jina import DocumentArray
  4. da1 = DocumentArray.empty(4)
  5. da1.embeddings = sp.csr_matrix(np.array(
  6. [[0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [1, 1, 1, 1, 0], [1, 2, 2, 1, 0]]
  7. ))
  8. da2 = DocumentArray.empty(5)
  9. da2.embeddings = sp.csr_matrix(np.array(
  10. [
  11. [0.0, 0.1, 0.0, 0.0, 0.0],
  12. [1.0, 0.1, 0.0, 0.0, 0.0],
  13. [1.0, 1.2, 1.0, 1.0, 0.0],
  14. [1.0, 2.2, 2.0, 1.0, 0.0],
  15. [4.0, 5.2, 2.0, 1.0, 0.0],
  16. ]
  17. ))
  18. da1.match(da2, metric='euclidean', limit=3)
  19. query = da1[2]
  20. print(f'query emb = {query.embedding}')
  21. for m in query.matches:
  22. print('match emb =', m.embedding, 'score =', m.scores['euclidean'].value)
  1. query emb = (0, 0) 1
  2. (0, 1) 1
  3. (0, 2) 1
  4. (0, 3) 1
  5. match emb = (0, 0) 1.0
  6. (0, 1) 1.2
  7. (0, 2) 1.0
  8. (0, 3) 1.0 score = 0.20000000298023224
  9. match emb = (0, 0) 1.0
  10. (0, 1) 2.2
  11. (0, 2) 2.0
  12. (0, 3) 1.0 score = 1.5620499849319458
  13. match emb = (0, 0) 1.0
  14. (0, 1) 0.1 score = 1.6763054132461548

Keep only ID

Default A.match(B) will copy the top-K matched Documents from B to A.matches. When these matches are big, copying them can be time-consuming. In this case, one can leverage .match(..., only_id=True) to keep only id:

  1. from jina import DocumentArray
  2. import numpy as np
  3. A = DocumentArray.empty(2)
  4. A.texts = ['hello', 'world']
  5. A.embeddings = np.random.random([2, 10])
  6. B = DocumentArray.empty(3)
  7. B.texts = ['long-doc1', 'long-doc2', 'long-doc3']
  8. B.embeddings = np.random.random([3, 10])

Only ID

  1. A.match(B, only_id=True)
  2. for m in A.traverse_flat('m'):
  3. print(m.json())
  1. {
  2. "adjacency": 1,
  3. "id": "4a8ad5fe4f9b11ec90e61e008a366d48",
  4. "scores": {
  5. "cosine": {
  6. "value": 0.08097544
  7. }
  8. }
  9. }
  10. ...

Default (keep all attributes)

  1. A.match(B)
  2. for m in A.traverse_flat('m'):
  3. print(m.json())
  1. {
  2. "adjacency": 1,
  3. "embedding": {
  4. "cls_name": "numpy",
  5. "dense": {
  6. "buffer": "csxkKGfE7T+/JUBkNzHiP3Lx96W4SdE/SVXrOxYv7T9Fmb+pp3rvP8YdsjGsXuw/CNbxUQ7v2j81AjCpbfjrP6g5iPB9hL4/PHljbxPi1D8=",
  7. "dtype": "<f8",
  8. "shape": [
  9. 10
  10. ]
  11. }
  12. },
  13. "id": "9078d1ec4f9b11eca9141e008a366d48",
  14. "scores": {
  15. "cosine": {
  16. "value": 0.15957883
  17. }
  18. },
  19. "text": "long-doc1"
  20. }
  21. ...

GPU support

If .embeddings is a Tensorflow tensor, PyTorch tensor or Paddle tensor, .match() function can work directly on GPU. To do that, simply set device=cuda. For example,

  1. from jina import DocumentArray
  2. import numpy as np
  3. import torch
  4. da1 = DocumentArray.empty(10)
  5. da1.embeddings = torch.tensor(np.random.random([10, 256]))
  6. da2 = DocumentArray.empty(10)
  7. da2.embeddings = torch.tensor(np.random.random([10, 256]))
  8. da1.match(da2, device='cuda')

Tip

When DocumentArray/DocumentArrayMemmap contain too many documents to fit into GPU memory, one can set batch_size to allievate the problem of OOM on GPU.

  1. da1.match(da2, device='cuda', batch_size=256)

Let’s do a simple benchmark on CPU vs. GPU .match():

  1. from jina import DocumentArray
  2. Q = 10
  3. M = 1_000_000
  4. D = 768
  5. da1 = DocumentArray.empty(Q)
  6. da2 = DocumentArray.empty(M)

on CPU via Numpy

  1. import numpy as np
  2. da1.embeddings = np.random.random([Q, D]).astype(np.float32)
  3. da2.embeddings = np.random.random([M, D]).astype(np.float32)
  1. %timeit da1.match(da2, only_id=True)
  1. 6.18 s ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

on GPU via PyTorch

  1. import torch
  2. da1.embeddings = torch.tensor(np.random.random([Q, D]).astype(np.float32))
  3. da2.embeddings = torch.tensor(np.random.random([M, D]).astype(np.float32))
  1. %timeit da1.match(da2, device='cuda', batch_size=1_000, only_id=True)
  1. 3.97 s ± 6.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that in the above GPU example we did a conversion. In practice, there is no need to do this conversion, .embedding/.blob as well as their bulk versions .embeddings/.blobs can store PyTorch/Tensorflow/Paddle/Scipy tensor natively. That is, in practice, you just need to assign the result directly into .embeddings in your Encoder via:

  1. da.embeddings = torch_model(da.blobs) # <- no .numpy() is necessary

And then in just use .match(da).

Evaluate matches

You can easily evaluate the performance of matches via evaluate(), provided that you have the ground truth of the matches.

Jina provides some common metrics used in the information retrieval community that allows one to evaluate the nearest-neighbour matches. These metrics include: precision, recall, R-precision, hit rate, NDCG, etc. The full list of functions can be found in evaluation.

For example, let’s create a DocumentArray with random embeddings and matching it to itself:

  1. import numpy as np
  2. from jina import DocumentArray
  3. da = DocumentArray.empty(10)
  4. da.embeddings = np.random.random([10, 3])
  5. da.match(da, exclude_self=True)

Now da.matches contains the matches. Let’s use it as the ground truth. Now let’s create imperfect matches by mixing in ten “noise Documents” to every d.matches.

  1. da2 = copy.deepcopy(da)
  2. for d in da2:
  3. d.matches.extend(DocumentArray.empty(10))
  4. d.matches = d.matches.shuffle()
  5. print(da2.evaluate(da, metric='precision_at_k', k=5))

Now we should have the average [email protected] close to 0.5.

  1. 0.5399999999999999

Note that this value is an average number over all Documents of da2. If you want to look at the individual evaluation, you can check evaluations attribute, e.g.

  1. for d in da2:
  2. print(d.evaluations['precision_at_k'].value)
  1. 0.4000000059604645
  2. 0.6000000238418579
  3. 0.5
  4. 0.5
  5. 0.5
  6. 0.4000000059604645
  7. 0.5
  8. 0.4000000059604645
  9. 0.5
  10. 0.30000001192092896

Note that evaluate() works only when two DocumentArray have the same length and their Documents are aligned by a hash function. The default hash function simply uses id. You can specify your own hash function.

Traverse nested elements

traverse_flat() function is an extremely powerful tool for iterating over nested and recursive Documents. You get a generator as the return value, which generates Documents on the provided traversal paths. You can use or modify Documents and the change will be applied in-place.

Syntax of traversal path

.traverse_flat() function accepts a traversal_paths string which can be defined as follow:

  1. path1,path2,path3,...

Tip

Its syntax is similar to subscripts in numpy.einsum(), but without -> operator.

Note that,

  • paths are separated by comma ,;

  • each path is a string represents a path from the top-level Documents to the destination. You can use c to select chunks, m to select matches;

  • a path can be a single letter, e.g. c, m or multi-letters, e.g. ccc, cmc, depending on how deep you want to go;

  • to select top-level Documents, you can use r;

  • a path can only go deep, not go back. You can use comma , to “reset” the path back to the very top-level;

Example

Let’s look at an example. Assume you have the following Document structure:

Click to see the construction of the nested Document

DocumentArray - 图2

DocumentArray - 图3

  1. from jina import DocumentArray, Document
  2. root = Document(id='r1')
  3. chunk1 = Document(id='r1c1')
  4. root.chunks.append(chunk1)
  5. root.chunks[0].matches.append(Document(id='r1c1m1'))
  6. chunk2 = Document(id='r1c2')
  7. root.chunks.append(chunk2)
  8. chunk2_chunk1 = Document(id='r1c2c1')
  9. chunk2_chunk2 = Document(id='r1c2c2')
  10. root.chunks[1].chunks.extend([chunk2_chunk1, chunk2_chunk2])
  11. root.chunks[1].chunks[0].matches.extend([Document(id='r1c2c1m1'), Document(id='r1c2c1m2')])
  12. chunk3 = Document(id='r1c3')
  13. root.chunks.append(chunk3)
  14. da = DocumentArray([root])
  15. root.plot()

../../../_images/traverse-example-docs.svg

Now one can use da.traverse_flat('c') To get all the Chunks of the root Document; da.traverse_flat('m') to can get all the Matches of the root Document.

This allows us to composite the c and m to find Chunks/Matches which are in a deeper level:

  • da.traverse_flat('cm') will find all Matches of the Chunks of root Document.

  • da.traverse_flat('cmc') will find all Chunks of the Matches of Chunks of root Document.

  • da.traverse_flat('c,m') will find all Chunks and Matches of root Document.

Examples

DocumentArray - 图5

DocumentArray - 图6

  1. for ma in da.traverse_flat('cm'):
  2. for m in ma:
  3. print(m.json())
  1. {
  2. "adjacency": 1,
  3. "granularity": 1,
  4. "id": "r1c1m1"
  5. }
  1. for ma in da.traverse_flat('ccm'):
  2. for m in ma:
  3. print(m.json())
  1. {
  2. "adjacency": 1,
  3. "granularity": 2,
  4. "id": "r1c2c1m1"
  5. }
  6. {
  7. "adjacency": 1,
  8. "granularity": 2,
  9. "id": "r1c2c1m2"
  10. }
  1. for ma in da.traverse('cm', 'ccm'):
  2. for m in ma:
  3. print(m.json())
  1. {
  2. "adjacency": 1,
  3. "granularity": 1,
  4. "id": "r1c1m1"
  5. }
  6. {
  7. "adjacency": 1,
  8. "granularity": 2,
  9. "id": "r1c2c1m1"
  10. }
  11. {
  12. "adjacency": 1,
  13. "granularity": 2,
  14. "id": "r1c2c1m2"
  15. }

When calling da.traverse_flat('cm,ccm') the result in our example will be:

  1. DocumentArray([
  2. Document(id='r1c1m1', adjacency=1, granularity=1),
  3. Document(id='r1c2c1m1', adjacency=1, granularity=2),
  4. Document(id='r1c2c1m2', adjacency=1, granularity=2)
  5. ])

jina.types.arrays.mixins.traverse.TraverseMixin.traverse_flat_per_path() is another method for Document traversal. It works like traverse_flat but groups Documents into DocumentArrays based on traversal path. When calling da.traverse_flat_per_path('cm,ccm'), the resulting generator yields the following DocumentArray:

  1. DocumentArray([
  2. Document(id='r1c1m1', adjacency=1, granularity=1),
  3. ])
  4. DocumentArray([
  5. Document(id='r1c2c1m1', adjacency=1, granularity=2),
  6. Document(id='r1c2c1m2', adjacency=1, granularity=2)
  7. ])

Flatten Document

If you simply want to traverse all chunks and matches regardless their levels. You can simply use flatten(). It will return a DocumentArray with all chunks and matches flattened into the top-level, no more nested structure.

Batching

One can batch a large DocumentArray into small ones via batch(). This is useful when a DocumentArray is too big to process at once. It is particular useful on DocumentArrayMemmap, which ensures the data gets loaded on-demand and in a conservative manner.

  1. from jina import DocumentArray
  2. da = DocumentArray.empty(1000)
  3. for b_da in da.batch(batch_size=256):
  4. print(len(b_da))
  1. 256
  2. 256
  3. 256
  4. 232

Tip

For processing batches in parallel, please refer to map_batch().

Parallel processing

See also

  • map(): to parallel process element by element, return an interator of elements;

  • map_batch(): to parallel process batch by batch, return an iterator of batches;

  • apply(): like .map(), but return a DocumentArray;

  • apply_batch(): like .map_batch(), but return a DocumentArray;

Working with large DocumentArray element-wise can be time-consuming. The naive way is to run a for-loop and enumerate all Document one by one. Jina provides map() to speed up things quite a lot. It is like Python built-in map() function but mapping the function to every element of the DocumentArray in parallel. There is also map_batch() that works on the minibatch level.

Let’s see an example, where we want to preprocess ~6000 image Documents. First we fill the URI to each Document.

  1. from jina import DocumentArray
  2. docs = DocumentArray.from_files('*.jpg') # 6000 image Document with .uri set

To load and preprocess docs, we have:

  1. def foo(d):
  2. return (d.load_uri_to_image_blob()
  3. .set_image_blob_normalization()
  4. .set_image_blob_channel_axis(-1, 0))

This load the image from file into .blob do some normalization and set the channel axis. Now, let’s compare the time difference when we do things sequentially and use DocumentArray.map() with different backends.

For-loop

  1. for d in docs:
  2. foo(d)

Map with process backend

  1. for d in docs.map(foo, backend='process'):
  2. pass

Map with thread backend

  1. for d in docs.map(foo, backend='thread'):
  2. pass
  1. map-process ... map-process takes 5 seconds (5.55s)
  2. map-thread ... map-thread takes 10 seconds (10.28s)
  3. foo-loop ... foo-loop takes 18 seconds (18.52s)

One can see a significant speedup with .map().

When to choose process or thread backend?

It depends on how your func in .map(func) look like:

  • First, if you want func to modify elements inplace, the you can only use thread backend. With process backend you can only rely on the return values of .map(), the modification happens inside func is lost.

  • Second, follow what people often suggests: IO-bound func uses thread, CPU-bound func uses process.

Tip

If you only modify elements in-place, and do not need return values, you can write:

  1. da = DocumentArray(...)
  2. da.apply(func)

Visualization

If a DocumentArray contains all image Document, you can plot all images in one sprite image using plot_image_sprites().

  1. from jina import DocumentArray
  2. docs = DocumentArray.from_files('*.jpg')
  3. docs.plot_image_sprites()

../../../_images/sprite-image.png

If a DocumentArray has valid .embeddings, you can visualize the embeddings interactively using plot_embeddings().

Hint

Note that .plot_embeddings() applies to any DocumentArray not just image ones. For image DocumentArray, you can do one step more to attach the image sprite on to the visualization points.

  1. da.plot_embeddings(image_sprites=True)
  1. import numpy as np
  2. from jina import DocumentArray
  3. docs = DocumentArray.from_files('*.jpg')
  4. docs.embeddings = np.random.random([len(docs), 256]) # some random embeddings
  5. docs.plot_embeddings(image_sprites=True)

../../../_images/embedding-projector.gif

Sampling

DocumentArray provides a .sample function that samples k elements without replacement. It accepts two parameters, k and seed. k defines the number of elements to sample, and seed helps you generate pseudo-random results. Note that k should always be less than or equal to the length of the DocumentArray.

To make use of the function:

  1. from jina import Document, DocumentArray
  2. da = DocumentArray() # initialize a random DocumentArray
  3. for idx in range(100):
  4. da.append(Document(id=idx)) # append 100 Documents into `da`
  5. sampled_da = da.sample(k=10) # sample 10 Documents
  6. sampled_da_with_seed = da.sample(k=10, seed=1) # sample 10 Documents with seed.

Shuffle

DocumentArray provides a .shuffle function that shuffles the entire DocumentArray. It accepts the parameter seed . seed helps you generate pseudo-random results. By default, seed is None.

To make use of the function:

  1. from jina import Document, DocumentArray
  2. da = DocumentArray() # initialize a random DocumentArray
  3. for idx in range(100):
  4. da.append(Document(id=idx)) # append 100 Documents into `da`
  5. shuffled_da = da.shuffle() # shuffle the DocumentArray
  6. shuffled_da_with_seed = da.shuffle(seed=1) # shuffle the DocumentArray with seed.

Split by .tags

DocumentArray provides a .split function that splits the DocumentArray into multiple DocumentArrays according to the tag value (stored in tags) of each Document. It returns a Python dict where Documents with the same tag value are grouped together, with their orders preserved from the original DocumentArray.

To make use of the function:

  1. from jina import Document, DocumentArray
  2. da = DocumentArray()
  3. da.append(Document(tags={'category': 'c'}))
  4. da.append(Document(tags={'category': 'c'}))
  5. da.append(Document(tags={'category': 'b'}))
  6. da.append(Document(tags={'category': 'a'}))
  7. da.append(Document(tags={'category': 'a'}))
  8. rv = da.split(tag='category')
  9. assert len(rv['c']) == 2 # category `c` is a DocumentArray has 2 Documents

Pythonic list interface

One can see DocumentArray as a Python list. Hence, many Python high-level iterator functions/tools can be used on DocumentArray as well.

Iterate via itertools

As DocumentArray is an Iterable, you can also use Python’s built-in itertools module on it. This enables advanced “iterator algebra” on the DocumentArray.

For instance, you can group a DocumentArray by parent_id:

  1. from jina import DocumentArray, Document
  2. from itertools import groupby
  3. da = DocumentArray([Document(parent_id=f'{i % 2}') for i in range(6)])
  4. groups = groupby(sorted(da, key=lambda d: d.parent_id), lambda d: d.parent_id)
  5. for key, group in groups:
  6. key, len(list(group))
  1. ('0', 3)
  2. ('1', 3)

Filter

You can use Python’s built-in filter() to filter elements in a DocumentArray object:

  1. from jina import DocumentArray, Document
  2. da = DocumentArray([Document() for _ in range(6)])
  3. for j in range(6):
  4. da[j].scores['metric'] = j
  5. for d in filter(lambda d: d.scores['metric'].value > 2, da):
  6. print(d)
  1. {'id': 'b5fa4871-cdf1-11eb-be5d-e86a64801cb1', 'scores': {'values': {'metric': {'value': 3.0}}}}
  2. {'id': 'b5fa4872-cdf1-11eb-be5d-e86a64801cb1', 'scores': {'values': {'metric': {'value': 4.0}}}}
  3. {'id': 'b5fa4873-cdf1-11eb-be5d-e86a64801cb1', 'scores': {'values': {'metric': {'value': 5.0}}}}

You can build a DocumentArray object from the filtered results:

  1. from jina import DocumentArray, Document
  2. da = DocumentArray([Document(weight=j) for j in range(6)])
  3. da2 = DocumentArray(d for d in da if d.weight > 2)
  4. print(da2)
  1. DocumentArray has 3 items:
  2. {'id': '3bd0d298-b6da-11eb-b431-1e008a366d49', 'weight': 3.0},
  3. {'id': '3bd0d324-b6da-11eb-b431-1e008a366d49', 'weight': 4.0},
  4. {'id': '3bd0d392-b6da-11eb-b431-1e008a366d49', 'weight': 5.0}

Sort

DocumentArray is a subclass of MutableSequence, therefore you can use Python’s built-in sort to sort elements in a DocumentArray object:

  1. from jina import DocumentArray, Document
  2. da = DocumentArray(
  3. [
  4. Document(tags={'id': 1}),
  5. Document(tags={'id': 2}),
  6. Document(tags={'id': 3})
  7. ]
  8. )
  9. da.sort(key=lambda d: d.tags['id'], reverse=True)
  10. print(da)

To sort elements in da in-place, using tags[id] value in a descending manner:

  1. <jina.types.arrays.document.DocumentArray length=3 at 5701440528>
  2. {'id': '6a79982a-b6b0-11eb-8a66-1e008a366d49', 'tags': {'id': 3.0}},
  3. {'id': '6a799744-b6b0-11eb-8a66-1e008a366d49', 'tags': {'id': 2.0}},
  4. {'id': '6a799190-b6b0-11eb-8a66-1e008a366d49', 'tags': {'id': 1.0}}