DocumentArrayMemmap

When a DocumentArray object contains a large number of Documents, holding it in memory can be very demanding, DocumentArrayMemmap is a drop-in replacement of DocumentArray in this scenario.

Important

DocumentArrayMemmap shares almost the same API as DocumentArray besides insert, inplace reverse, inplace sort.

How does it work?

A DocumentArrayMemmap stores all Documents directly on disk, while keeping a small lookup table in memory and a buffer pool of Documents with a fixed size. The lookup table contains the offset and length of each Document so it is much smaller than the full DocumentArray. Elements are loaded on-demand to memory during access. Memory-loaded Documents are kept in the buffer pool to allow modifying Documents.

Construct

  1. from jina import DocumentArrayMemmap
  2. dam = DocumentArrayMemmap() # use a local temporary folder as storage
  3. dam2 = DocumentArrayMemmap('./my-memmap') # use './my-memmap' as storage

Delete

To delete all contents in a DocumentArrayMemmap object, simply call .clear(). It will clean all content on the disk.

You can also check the disk usage of a DocumentArrayMemmap by .physical_size property.

Convert to/from DocumentArray

  1. from jina import Document, DocumentArray, DocumentArrayMemmap
  2. da = DocumentArray([Document(text='hello'), Document(text='world')])
  3. # convert DocumentArray to DocumentArrayMemmap
  4. dam = DocumentArrayMemmap()
  5. dam.extend(da)
  6. # convert DocumentArrayMemmap to DocumentArray
  7. da = DocumentArray(dam)

Advanced

Warning

DocumentArrayMemmap is in general used for one-way access, either read-only or write-only. Interleaving reading and writing on a DocumentArrayMemmap is not safe and not recommended in production.

Understand buffer pool

Recently added, modified or accessed Documents are kept in an in-memory buffer pool. This allows all changes to Documents to be applied first in memory and then be persisted to disk in a lazy way (i.e. when they quit the buffer pool or when the dam object’s destructor is called). If you want to instantly persist the changed Documents, you can call .flush().

The number can be configured with the constructor argument buffer_pool_size (1,000 by default). Only the buffer_pool_size most recently accessed, modified or added Documents exist in the pool. Replacement of Documents follows the LRU strategy.

  1. from jina import DocumentArrayMemmap
  2. dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)

Warning

The buffer pool ensures that in-memory modified Documents are persisted to disk. Therefore, you should not reference Documents manually and modify them if they might be outside of the buffer pool. The next section explains the best practices when modifying Documents.

Modify elements

Modifying elements of a DocumentArrayMemmap is possible because accessed and modified Documents are kept in the buffer pool:

  1. from jina import DocumentArrayMemmap, Document
  2. d1 = Document(text='hello')
  3. d2 = Document(text='world')
  4. dam = DocumentArrayMemmap('./my-memmap')
  5. dam.extend([d1, d2])
  6. dam[0].text = 'goodbye'
  7. print(dam[0].text)
  1. goodbye

However, there are practices to avoid: Mainly, you should not modify Documents that you reference manually and that might not be in the buffer pool. Here are some practices to avoid:

  1. Keep more references than the buffer pool size and modify them:

    ❌ Don’t

    1. from jina import Document, DocumentArrayMemmap
    2. docs = [Document(text='hello') for _ in range(100)]
    3. dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
    4. dam.extend(docs)
    5. for doc in docs:
    6. doc.text = 'goodbye'
    7. dam[50].text
    1. hello

    ✅ Do

    Use the dam object to modify instead:

    1. from jina import Document, DocumentArrayMemmap
    2. docs = [Document(text='hello') for _ in range(100)]
    3. dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
    4. dam.extend(docs)
    5. for doc in dam:
    6. doc.text = 'goodbye'
    7. dam[50].text
    1. goodbye

    It’s also okay if you reference Documents less than the buffer pool size:

    1. from jina import Document, DocumentArrayMemmap
    2. docs = [Document(text='hello') for _ in range(100)]
    3. dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=1000)
    4. dam.extend(docs)
    5. for doc in docs:
    6. doc.text = 'goodbye'
    7. dam[50].text
    1. goodbye
  2. Modify a reference that might have left the buffer pool:

    ❌ Don’t

    1. from jina import Document, DocumentArrayMemmap
    2. dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
    3. my_doc = Document(text='hello')
    4. dam.append(my_doc)
    5. # my_doc leaves the buffer pool after extend
    6. dam.extend([Document(text='hello') for _ in range(99)])
    7. my_doc.text = 'goodbye'
    8. dam[0].text
    1. hello

    ✅ Do

    Get the Document from the dam object and then modify it:

    1. from jina import Document, DocumentArrayMemmap
    2. dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
    3. my_doc = Document(text='hello')
    4. dam.append(my_doc)
    5. # my_doc leaves the buffer pool after extend
    6. dam.extend([Document(text='hello') for _ in range(99)])
    7. dam[my_doc.id].text = 'goodbye' # or dam[0].text = 'goodbye'
    8. dam[0].text
    1. goodbye

To summarize, it’s a best practice to rely on the dam object to reference the Documents that you modify.

Maintain consistency

Considering two DocumentArrayMemmap objects that share the same on-disk storage ./memmap but sit in different processes/threads. After some write operations, the consistency of the lookup table and the buffer pool may be corrupted, as each DocumentArrayMemmap object has its own version of the lookup table and buffer pool in memory. .reload() and .flush() solve this issue:

  1. from jina import Document, DocumentArrayMemmap
  2. d1 = Document(text='hello')
  3. d2 = Document(text='world')
  4. dam = DocumentArrayMemmap('./my-memmap')
  5. dam2 = DocumentArrayMemmap('./my-memmap')
  6. dam.extend([d1, d2])
  7. assert len(dam) == 2
  8. assert len(dam2) == 0
  9. dam2.reload()
  10. assert len(dam2) == 2
  11. dam.clear()
  12. assert len(dam) == 0
  13. assert len(dam2) == 2
  14. dam2.reload()
  15. assert len(dam2) == 0

You don’t need to call .flush() if you add new Documents. However, if you modified an attribute of a Document, you need to use it:

  1. from jina import Document, DocumentArrayMemmap
  2. d1 = Document(text='hello')
  3. dam = DocumentArrayMemmap('./my-memmap')
  4. dam2 = DocumentArrayMemmap('./my-memmap')
  5. dam.append(d1)
  6. d1.text = 'goodbye'
  7. assert len(dam) == 1
  8. assert len(dam2) == 0
  9. dam2.reload()
  10. assert len(dam2) == 1
  11. assert dam2[0].text == 'hello'
  12. dam.flush()
  13. dam2.reload()
  14. assert dam2[0].text == 'goodbye'