Tabular

pipeline pipeline

The Tabular pipeline splits tabular data into rows and columns. The tabular pipeline is most useful in creating (id, text, tag) tuples to load into Embedding indexes.

Example

The following shows a simple example using this pipeline.

  1. from txtai.pipeline import Tabular
  2. # Create and run pipeline
  3. tabular = Tabular("id", ["text"])
  4. tabular("path to csv file")

See the link below for a more detailed example.

NotebookDescription
Transform tabular data with composable workflowsTransform, index and search tabular dataOpen In Colab

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

  1. # Create pipeline using lower case class name
  2. tabular:
  3. idcolumn: id
  4. textcolumns:
  5. - text
  6. # Run pipeline with workflow
  7. workflow:
  8. tabular:
  9. tasks:
  10. - action: tabular

Run with Workflows

  1. from txtai.app import Application
  2. # Create and run pipeline with workflow
  3. app = Application("config.yml")
  4. list(app.workflow("tabular", ["path to csv file"]))

Run with API

  1. CONFIG=config.yml uvicorn "txtai.api:app" &
  2. curl \
  3. -X POST "http://localhost:8000/workflow" \
  4. -H "Content-Type: application/json" \
  5. -d '{"name":"tabular", "elements":["path to csv file"]}'

Methods

Python documentation for the pipeline.

__init__(self, idcolumn=None, textcolumns=None, content=False) special

Creates a new Tabular pipeline.

Parameters:

NameTypeDescriptionDefault
idcolumn

column name to use for row id

None
textcolumns

list of columns to combine as a text field

None
content

if True, a dict per row is generated with all fields. If content is a list, a subset of fields is included in the generated rows.

False

Source code in txtai/pipeline/data/tabular.py

  1. def __init__(self, idcolumn=None, textcolumns=None, content=False):
  2. """
  3. Creates a new Tabular pipeline.
  4. Args:
  5. idcolumn: column name to use for row id
  6. textcolumns: list of columns to combine as a text field
  7. content: if True, a dict per row is generated with all fields. If content is a list, a subset of fields
  8. is included in the generated rows.
  9. """
  10. if not PANDAS:
  11. raise ImportError('Tabular pipeline is not available - install "pipeline" extra to enable')
  12. self.idcolumn = idcolumn
  13. self.textcolumns = textcolumns
  14. self.content = content

__call__(self, data) special

Splits data into rows and columns.

Parameters:

NameTypeDescriptionDefault
data

input data

required

Returns:

TypeDescription

list of (id, text, tag)

Source code in txtai/pipeline/data/tabular.py

  1. def __call__(self, data):
  2. """
  3. Splits data into rows and columns.
  4. Args:
  5. data: input data
  6. Returns:
  7. list of (id, text, tag)
  8. """
  9. items = [data] if not isinstance(data, list) else data
  10. # Combine all rows into single return element
  11. results = []
  12. dicts = []
  13. for item in items:
  14. # File path
  15. if isinstance(item, str):
  16. _, extension = os.path.splitext(item)
  17. extension = extension.replace(".", "").lower()
  18. if extension == "csv":
  19. df = pd.read_csv(item)
  20. results.append(self.process(df))
  21. # Dict
  22. if isinstance(item, dict):
  23. dicts.append(item)
  24. # List of dicts
  25. elif isinstance(item, list):
  26. df = pd.DataFrame(item)
  27. results.append(self.process(df))
  28. if dicts:
  29. df = pd.DataFrame(dicts)
  30. results.extend(self.process(df))
  31. return results[0] if not isinstance(data, list) else results