Textractor

pipeline pipeline

The Textractor pipeline extracts and splits text from documents. This pipeline uses either an Apache Tika backend (if Java is available) or BeautifulSoup4.

Example

The following shows a simple example using this pipeline.

  1. from txtai.pipeline import Textractor
  2. # Create and run pipeline
  3. textract = Textractor()
  4. textract("https://github.com/neuml/txtai")

See the link below for a more detailed example.

NotebookDescription
Extract text from documentsExtract text from PDF, Office, HTML and moreOpen In Colab

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

  1. # Create pipeline using lower case class name
  2. textractor:
  3. # Run pipeline with workflow
  4. workflow:
  5. textract:
  6. tasks:
  7. - action: textractor

Run with Workflows

  1. from txtai.app import Application
  2. # Create and run pipeline with workflow
  3. app = Application("config.yml")
  4. list(app.workflow("textract", ["https://github.com/neuml/txtai"]))

Run with API

  1. CONFIG=config.yml uvicorn "txtai.api:app" &
  2. curl \
  3. -X POST "http://localhost:8000/workflow" \
  4. -H "Content-Type: application/json" \
  5. -d '{"name":"textract", "elements":["https://github.com/neuml/txtai"]}'

Methods

Python documentation for the pipeline.

Source code in txtai/pipeline/data/textractor.py

  1. def __init__(self, sentences=False, lines=False, paragraphs=False, minlength=None, join=False, tika=True):
  2. if not TIKA:
  3. raise ImportError('Textractor pipeline is not available - install "pipeline" extra to enable')
  4. super().__init__(sentences, lines, paragraphs, minlength, join)
  5. # Determine if Tika (default if Java is available) or Beautiful Soup should be used
  6. # Beautiful Soup only supports HTML, Tika supports a wide variety of file formats, including HTML.
  7. self.tika = self.checkjava() if tika else False

Source code in txtai/pipeline/data/segmentation.py

  1. def __call__(self, text):
  2. """
  3. Segments text into semantic units.
  4. This method supports text as a string or a list. If the input is a string, the return
  5. type is text|list. If text is a list, a list of returned, this could be a
  6. list of text or a list of lists depending on the tokenization strategy.
  7. Args:
  8. text: text|list
  9. Returns:
  10. segmented text
  11. """
  12. # Get inputs
  13. texts = [text] if not isinstance(text, list) else text
  14. # Extract text for each input file
  15. results = []
  16. for value in texts:
  17. # Get text
  18. value = self.text(value)
  19. # Parse and add extracted results
  20. results.append(self.parse(value))
  21. return results[0] if isinstance(text, str) else results