Extractor

pipeline pipeline

The Extractor pipeline joins a prompt, context data store and generative model together to extract knowledge.

The data store can be an embeddings database or a similarity instance with associated input text. The generative model can be a prompt-driven large language model (LLM), an extractive question-answering model or a custom pipeline. This is known as prompt-driven search or retrieval augmented generation (RAG).

Example

The following shows a simple example using this pipeline.

  1. from txtai.embeddings import Embeddings
  2. from txtai.pipeline import Extractor
  3. # LLM prompt
  4. def prompt(question):
  5. return f"""
  6. Answer the following question using the provided context.
  7. Question:
  8. {question}
  9. Context:
  10. """
  11. # Input data
  12. data = [
  13. "US tops 5 million confirmed virus cases",
  14. "Canada's last fully intact ice shelf has suddenly collapsed, " +
  15. "forming a Manhattan-sized iceberg",
  16. "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
  17. "The National Park Service warns against sacrificing slower friends " +
  18. "in a bear attack",
  19. "Maine man wins $1M from $25 lottery ticket",
  20. "Make huge profits without work, earn up to $100,000 a day"
  21. ]
  22. # Build embeddings index
  23. embeddings = Embeddings({"content": True})
  24. embeddings.index([(uid, text, None) for uid, text in enumerate(data)])
  25. # Create and run pipeline
  26. extractor = Extractor(embeddings, "google/flan-t5-base")
  27. extractor([{"query": "What was won?", "question": prompt("What was won?")}])

See the links below for more detailed examples.

NotebookDescription
Prompt-driven search with LLMsEmbeddings-guided and Prompt-driven search with Large Language Models (LLMs)Open In Colab
Prompt templates and task chainsBuild model prompts and connect tasks together with workflowsOpen In Colab
Build RAG pipelines with txtaiGuide on retrieval augmented generation including how to create citationsOpen In Colab
Integrate LLM frameworksIntegrate llama.cpp, LiteLLM and custom generation frameworksOpen In Colab
Extractive QA with txtaiIntroduction to extractive question-answering with txtaiOpen In Colab
Extractive QA with ElasticsearchRun extractive question-answering queries with ElasticsearchOpen In Colab
Extractive QA to build structured dataBuild structured datasets using extractive question-answeringOpen In Colab

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

  1. # Allow documents to be indexed
  2. writable: True
  3. # Content is required for extractor pipeline
  4. embeddings:
  5. content: True
  6. extractor:
  7. path: google/flan-t5-base
  8. workflow:
  9. search:
  10. tasks:
  11. - task: extractor
  12. template: |
  13. Answer the following question using the provided context.
  14. Question:
  15. {text}
  16. Context:
  17. action: extractor

Run with Workflows

Built in tasks make using the extractor pipeline easier.

  1. from txtai.app import Application
  2. # Create and run pipeline with workflow
  3. app = Application("config.yml")
  4. app.add([
  5. "US tops 5 million confirmed virus cases",
  6. "Canada's last fully intact ice shelf has suddenly collapsed, " +
  7. "forming a Manhattan-sized iceberg",
  8. "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
  9. "The National Park Service warns against sacrificing slower friends " +
  10. "in a bear attack",
  11. "Maine man wins $1M from $25 lottery ticket",
  12. "Make huge profits without work, earn up to $100,000 a day"
  13. ])
  14. app.index()
  15. list(app.workflow("search", ["What was won?"]))

Run with API

  1. CONFIG=config.yml uvicorn "txtai.api:app" &
  2. curl \
  3. -X POST "http://localhost:8000/workflow" \
  4. -H "Content-Type: application/json" \
  5. -d '{"name": "search", "elements": ["What was won"]}'

Methods

Python documentation for the pipeline.

Builds a new extractor.

Parameters:

NameTypeDescriptionDefault
similarity

similarity instance (embeddings or similarity pipeline)

required
path

path to model, supports a LLM, Questions or custom pipeline

required
quantize

True if model should be quantized before inference, False otherwise.

False
gpu

if gpu inference should be used (only works if GPUs are available)

True
model

optional existing pipeline model to wrap

None
tokenizer

Tokenizer class

None
minscore

minimum score to include context match, defaults to None

None
mintokens

minimum number of tokens to include context match, defaults to None

None
context

topn context matches to include, defaults to 3

None
task

model task (language-generation, sequence-sequence or question-answering), defaults to auto-detect

None
output

output format, ‘default’ returns (name, answer), ‘flatten’ returns answers and ‘reference’ returns (name, answer, reference)

‘default’
template

prompt template, it must have a parameter for {question} and {context}, defaults to “{question} {context}”

None
separator

context separator

‘ ‘
kwargs

additional keyword arguments to pass to pipeline model

{}

Source code in txtai/pipeline/text/extractor.py

  1. 24
  2. 25
  3. 26
  4. 27
  5. 28
  6. 29
  7. 30
  8. 31
  9. 32
  10. 33
  11. 34
  12. 35
  13. 36
  14. 37
  15. 38
  16. 39
  17. 40
  18. 41
  19. 42
  20. 43
  21. 44
  22. 45
  23. 46
  24. 47
  25. 48
  26. 49
  27. 50
  28. 51
  29. 52
  30. 53
  31. 54
  32. 55
  33. 56
  34. 57
  35. 58
  36. 59
  37. 60
  38. 61
  39. 62
  40. 63
  41. 64
  42. 65
  43. 66
  44. 67
  45. 68
  46. 69
  47. 70
  48. 71
  49. 72
  50. 73
  51. 74
  52. 75
  53. 76
  54. 77
  55. 78
  56. 79
  57. 80
  58. 81
  59. 82
  60. 83
  61. 84
  62. 85
  63. 86
  1. def init(
  2. self,
  3. similarity,
  4. path,
  5. quantize=False,
  6. gpu=True,
  7. model=None,
  8. tokenizer=None,
  9. minscore=None,
  10. mintokens=None,
  11. context=None,
  12. task=None,
  13. output=”default”,
  14. template=None,
  15. separator=” “,
  16. kwargs,
  17. ):
  18. “””
  19. Builds a new extractor.
  20. Args:
  21. similarity: similarity instance (embeddings or similarity pipeline)
  22. path: path to model, supports a LLM, Questions or custom pipeline
  23. quantize: True if model should be quantized before inference, False otherwise.
  24. gpu: if gpu inference should be used (only works if GPUs are available)
  25. model: optional existing pipeline model to wrap
  26. tokenizer: Tokenizer class
  27. minscore: minimum score to include context match, defaults to None
  28. mintokens: minimum number of tokens to include context match, defaults to None
  29. context: topn context matches to include, defaults to 3
  30. task: model task (language-generation, sequence-sequence or question-answering), defaults to auto-detect
  31. output: output format, default returns (name, answer), flatten returns answers and reference returns (name, answer, reference)
  32. template: prompt template, it must have a parameter for {question} and {context}, defaults to “{question} {context}”
  33. separator: context separator
  34. kwargs: additional keyword arguments to pass to pipeline model
  35. “””
  36. # Similarity instance
  37. self.similarity = similarity
  38. # Model can be a LLM, Questions or custom pipeline
  39. self.model = self.load(path, quantize, gpu, model, task, kwargs)
  40. # Tokenizer class use default method if not set
  41. self.tokenizer = tokenizer if tokenizer else Tokenizer() if hasattr(self.similarity, scoring”) and self.similarity.isweighted() else None
  42. # Minimum score to include context match
  43. self.minscore = minscore if minscore is not None else 0.0
  44. # Minimum number of tokens to include context match
  45. self.mintokens = mintokens if mintokens is not None else 0.0
  46. # Top n context matches to include for context
  47. self.context = context if context else 3
  48. # Output format
  49. self.output = output
  50. # Prompt template
  51. self.template = template if template else “{question} {context}”
  52. # Context separator
  53. self.separator = separator

Finds answers to input questions. This method runs queries to find the top n best matches and uses that as the context. A model is then run against the context for each input question, with the answer returned.

Parameters:

NameTypeDescriptionDefault
queue

input question queue (name, query, question, snippet), can be list of tuples/dicts/strings or a single input element

required
texts

optional list of text for context, otherwise runs embeddings search

None
kwargs

additional keyword arguments to pass to pipeline model

{}

Returns:

TypeDescription

list of answers matching input format (tuple or dict) containing fields as specified by output format

Source code in txtai/pipeline/text/extractor.py

  1. 88
  2. 89
  3. 90
  4. 91
  5. 92
  6. 93
  7. 94
  8. 95
  9. 96
  10. 97
  11. 98
  12. 99
  13. 100
  14. 101
  15. 102
  16. 103
  17. 104
  18. 105
  19. 106
  20. 107
  21. 108
  22. 109
  23. 110
  24. 111
  25. 112
  26. 113
  27. 114
  28. 115
  29. 116
  30. 117
  31. 118
  32. 119
  33. 120
  34. 121
  35. 122
  36. 123
  37. 124
  38. 125
  39. 126
  40. 127
  41. 128
  42. 129
  43. 130
  44. 131
  45. 132
  46. 133
  47. 134
  48. 135
  49. 136
  50. 137
  51. 138
  52. 139
  1. def call(self, queue, texts=None, kwargs):
  2. “””
  3. Finds answers to input questions. This method runs queries to find the top n best matches and uses that as the context.
  4. A model is then run against the context for each input question, with the answer returned.
  5. Args:
  6. queue: input question queue (name, query, question, snippet), can be list of tuples/dicts/strings or a single input element
  7. texts: optional list of text for context, otherwise runs embeddings search
  8. kwargs: additional keyword arguments to pass to pipeline model
  9. Returns:
  10. list of answers matching input format (tuple or dict) containing fields as specified by output format
  11. “””
  12. # Save original queue format
  13. inputs = queue
  14. # Convert queue to list, if necessary
  15. queue = queue if isinstance(queue, list) else [queue]
  16. # Convert dictionary inputs to tuples
  17. if queue and isinstance(queue[0], dict):
  18. # Convert dict to tuple
  19. queue = [tuple(row.get(x) for x in [“name”, query”, question”, snippet”]) for row in queue]
  20. if queue and isinstance(queue[0], str):
  21. # Convert string questions to tuple
  22. queue = [(None, row, row, None) for row in queue]
  23. # Rank texts by similarity for each query
  24. results = self.query([query for , query, , in queue], texts)
  25. # Build question-context pairs
  26. names, queries, questions, contexts, topns, snippets = [], [], [], [], [], []
  27. for x, (name, query, question, snippet) in enumerate(queue):
  28. # Get top n best matching segments
  29. topn = sorted(results[x], key=lambda y: y[2], reverse=True)[: self.context]
  30. # Generate context using ordering from texts, if available, otherwise order by score
  31. context = self.separator.join(text for , text, in (sorted(topn, key=lambda y: y[0]) if texts else topn))
  32. names.append(name)
  33. queries.append(query)
  34. questions.append(question)
  35. contexts.append(context)
  36. topns.append(topn)
  37. snippets.append(snippet)
  38. # Run pipeline and return answers
  39. answers = self.answers(names, questions, contexts, [[text for , text, _ in topn] for topn in topns], snippets, kwargs)
  40. # Apply output formatting to answers and return
  41. return self.apply(inputs, queries, answers, topns)