Open-Domain Question-Answering on Long Document

The following tutorial will take you through a solution of to question-answering on long documents. This is an inherently difficult task, due to the fuzziness of human language and the infinite number of questions one could ask.

One way to solve this is by predicting answers using a neural network that was trained on pairs of questions and their corresponding answers. In many cases such a dataset is not available, like in the case of most software documentation. Let’s say we want to build a chatbot to answer questions about the Jina documentation. What if I told you that there is a way to reframe this task as a search problem and that this would alleviate the need for a large dataset of matching questions and answers?

How, you ask? Let me explain!

Overview

Our approach to the problem leverages the Doc2query method, which, form a piece of text, predicts different questions the text could potentially answer. For example, given a sentence such as Jina is an open source framework for neural search., the model predicts questions such as What is Jina? or Is Jina open source?.

The idea here is to predict several questions for every part of the original text document, in our case the Jina documentation. Then we use an encoder to create a vector representation for each of the predicted questions. These representations are stored and provide the index for our body of text. When a user prompts the bot with a question, we encode it in the same way we encoded our generated questions. Now we can run a similarity search on the encodings. The encoding of the user’s query is compared with the encodings of generated questions, in our index to find the closes match.

Once the matching answer is found, we can return it to the user.

Now that you have a general idea of what we will be doing, the following section will show you how to define our Flows in Jina. Then we will take a look at how to implement the necessary Executors for our search-based question-answering system.

Indexing the text document

Let’s imagine we extracted a bunch of sentences from Jina’s documentation and stored them in a DocumentArray, as shown below.

  1. from jina import DocumentArray, Executor, requests, Document, Flow
  2. example_sentences = [
  3. 'Document is the basic data type that Jina operates with',
  4. 'Executor processes a DocumentArray in-place',
  5. ...,
  6. 'Jina uses the concept of a flow to tie different executors together'
  7. ]
  8. docs = DocumentArray([Document(content=sentence) for sentence in example_sentences])

As described in the last section, we first need to predict potential questions for each of the elements in the DocumentArray. Then we have to use another model to create vector encodings from the predicted questions. Finally, we store them as the index.

At this point we have enough information to start defining our Flows.

Without further ado, let’s build!

  1. indexing_flow = (
  2. Flow()
  3. # Generate potential questions using doc2query
  4. .add(
  5. name='question_transformer',
  6. uses=QuestionGenerator,
  7. )
  8. # Encode the generated questions
  9. .add(
  10. name='text_encoder',
  11. uses=TextEncoder,
  12. uses_with={'parameters': {'traversal_paths': 'c'}},
  13. )
  14. # Index answers and generated questions
  15. .add(
  16. name='simple_indexer',
  17. uses=SimpleIndexer,
  18. )
  19. )
  20. with indexing_flow:
  21. # Run the indexing on all extracted sentences
  22. indexing_flow.post(on='/index', inputs=docs, on_done=print)

Searching of the user’s query against the index

After having defined the Flow for indexing our document, we are now ready to work on answering user queries. Incoming queries also need to be encoded. For that, we use the same encoder that we used for encoding our generated questions. Then we need SimpleIndexer to perform similarity search, in order to retrieve generated questions and eventually answers the query.

The flow for searching is much simpler than the one for indexing and looks like this:

  1. query_flow = (
  2. Flow()
  3. # Create vector representations from query
  4. .add(name='query_transformer', uses=TextEncoder,)
  5. # Use encoded question to search our index
  6. .add(
  7. name='simple_indexer',
  8. uses=SimpleIndexer,
  9. )
  10. )
  11. with query_flow:
  12. # Run question through the query flow and return answer
  13. search_results = query_flow.post(
  14. on='/search', inputs=user_queries, return_results=True, on_done=print
  15. )

Now that we have seen the overall structure of the approach and have defined our Flows, we can code up the Executors.

Building the Executor to Generate Potential Questions

The first Executor, that we implement, is the QuestionGenerator. It is a wrapper around the model that predicts potential questions, which a given piece of text can answer.

Apart from that, it just loops over all provided parts of input text. After potential questions are predicted for each of the inputs, they are stored as chunks alongside the original text.

  1. from transformers import T5Tokenizer, T5ForConditionalGeneration
  2. class QuestionGenerator(Executor):
  3. @requests
  4. def doc2query(self, docs: DocumentArray, **kwargs):
  5. """Generates potential questions for each answer"""
  6. # Load pretrained doc2query models
  7. self._tokenizer = T5Tokenizer.from_pretrained(
  8. 'castorini/doc2query-t5-base-msmarco'
  9. )
  10. self._model = T5ForConditionalGeneration.from_pretrained(
  11. 'castorini/doc2query-t5-base-msmarco'
  12. )
  13. for d in docs:
  14. input_ids = self._tokenizer.encode(d.content, return_tensors='pt')
  15. # Generte potential queries for each piece of text
  16. outputs = self._model.generate(
  17. input_ids=input_ids,
  18. max_length=64,
  19. do_sample=True,
  20. num_return_sequences=10,
  21. )
  22. # Decode the outputs ot text and store them
  23. for output in outputs:
  24. question = self._tokenizer.decode(
  25. output, skip_special_tokens=True
  26. ).strip()
  27. d.chunks.append(Document(text=question))

We try to give credit where credit is due and want to mention the paper, that introduced the doc2query approach here.

Building the Encoder

The next step is to build the Executor, which we will use to create vector representations from human-readable text.

  1. import torch
  2. from sentence_transformers import SentenceTransformer
  3. class TextEncoder(Executor):
  4. def __init__(self, parameters: dict = {'traversal_paths': 'r'}, *args, **kwargs):
  5. super().__init__(*args, **kwargs)
  6. self.model = SentenceTransformer(
  7. 'paraphrase-mpnet-base-v2', device='cpu', cache_folder='.'
  8. )
  9. self.parameters = parameters
  10. @requests(on=['/search', '/index'])
  11. def encode(self, docs: DocumentArray, **kwargs):
  12. """Wraps encoder from sentence-transformers package"""
  13. traversal_paths = self.parameters.get('traversal_paths')
  14. target = docs.traverse_flat(traversal_paths)
  15. with torch.inference_mode():
  16. embeddings = self.model.encode(target.texts)
  17. target.embeddings = embeddings

Similar to the QuestionGenerator the TextEncoder is simply a wrapper around the SentenceTransformer from the sentence_transformer package. When provided with a DocumentArray containing text, it will encode the text of each element and store the result in the embedding attribute it creates.

Now let’s move on to the last part and create the indexer.

Putting it Together with the Indexer

The indexer is the only one of our Executors that can handle more than one task. Namely, the indexing and the search.

When it is used to perform indexing, index() is called. This stores all provided documents, together with their embeddings, as a DocumentArrayMemmap.

However, when the SimpleIndexer is used to handle an incoming query, the search() function is called, it performs similarity search and ranks the results.

  1. from collections import defaultdict
  2. from jina import DocumentArrayMemmap
  3. class SimpleIndexer(Executor):
  4. """Simple indexer class"""
  5. def __init__(self, **kwargs):
  6. super().__init__(**kwargs)
  7. self._docs = DocumentArrayMemmap(".")
  8. @requests(on='/index')
  9. def index(self, docs: 'DocumentArray', **kwargs):
  10. # Stores the index in attribute
  11. if docs:
  12. self._docs.extend(docs)
  13. @requests(on='/search')
  14. def search(self, docs: 'DocumentArray', **kwargs):
  15. """Append best matches to each document in docs"""
  16. # Match query agains the index using cosine similarity
  17. docs.match(
  18. DocumentArray(self._docs),
  19. metric='cosine',
  20. normalization=(1, 0),
  21. limit=100,
  22. traversal_rdarray='c,',
  23. )
  24. for d in docs:
  25. match_similarity = defaultdict(float)
  26. # For each match
  27. for m in d.matches:
  28. # Get cosine similarity
  29. match_similarity[m.parent_id] += m.scores['cosine'].value
  30. sorted_similarities = sorted(
  31. match_similarity.items(), key=lambda v: v[1], reverse=True
  32. )
  33. # Rank matches by similarity and collect them
  34. d.matches.clear()
  35. for k, _ in sorted_similarities:
  36. m = Document(self._docs[k], copy=True)
  37. d.matches.append(m)
  38. # Only return top 10 answers
  39. if len(d.matches) >= 10:
  40. break
  41. # Remove embedding as it is not needed anymore
  42. d.pop('embedding')

The ranking of the results is thereby represented in the order of the matches inside the matches object. Hence, to show the correct answer to the user, we can simply print the first match from inside the docs, which are stored in the search_results.

  1. # Print the answer text to our question
  2. print(search_results[0].docs[0].matches.texts[0])

We have now seen how to implement a question-answering bot using Jina without the need for a large dataset of matching questions and answers. In practice, we would need to experiment with several parameters, such as the initial extraction of answers from the original text. In this tutorial, we made the assumption that every sentence will be one potential answer. However, in reality, it is likely that some user queries will require multiple sentences or complete paragraphs to answer.