Fetching from TSV,CSV

This is the simplest way to pass data to the indexer. It was created due to xmlpipe2 limitations. Namely, indexer must map each attribute and field tag in XML file to corresponding schema element. This mapping requires some time. And time increases with increasing the number of fields and attributes in schema. There is no such issue in tsvpipe because each field and attribute is a particular column in TSV file. So, in some cases tsvpipe could work slightly faster than xmlpipe2.

File format

The first column in TSV/CSV file must be a document ID. The rest ones must mirror the declaration of fields and attributes in schema definition. Note that you don’t need to declare the document ID in the schema, since it’s always considered to be present, should be in the 1st column and needs to be a unique signed positive non-zero 64-bit integer.

The difference between tsvpipe and csvpipe is delimiter and quoting rules. tsvpipe has tab character as hardcoded delimiter and has no quoting rules. csvpipe has option csvpipe_delimiter for delimiter with default value , and also has quoting rules, such as:

  • any field may be quoted
  • fields containing a line-break, double-quote or commas should be quoted
  • a double quote character in a field must be represented by two double quote characters

Declaration of TSV stream

tsvpipe_command directive is mandatory and contains the shell command invoked to produce the TSV stream which gets indexed. The command can just read a TSV file but it can also be a program that generates on-the-fly the tab delimited content.

TSV indexed columns

The following directives can be used to declare the types of the indexed columns:

  • tsvpipe_field - declares a text field.
  • tsvpipe_field_string - declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.
  • tsvpipe_attr_uint - declares an integer attribute
  • tsvpipe_attr_timestamp - declares a timestamp attribute
  • tsvpipe_attr_bool - declares a boolean attribute
  • tsvpipe_attr_float - declares a float attribute
  • tsvpipe_attr_bigint - declares a big integer attribute
  • tsvpipe_attr_multi - declares a multi-value attribute with integers
  • tsvpipe_attr_multi_64 - declares a multi-value attribute with 64bit integers
  • tsvpipe_attr_string - declares a string attribute
  • tsvpipe_attr_json - declares a JSON attribute

Example of a source using a TSV file

  1. source tsv_test
  2. {
  3. type = tsvpipe
  4. tsvpipe_command = cat /tmp/rock_bands.tsv
  5. tsvpipe_field = name
  6. tsvpipe_attr_multi = genre_tags
  7. }
  1. 1 Led Zeppelin 35,23,16
  2. 2 Deep Purple 35,92
  3. 3 Frank Zappa 35,23,16,92,33,24

Declaration of CSV stream

csvpipe_command directive is mandatory and contains the shell command invoked to produce the TSV stream which gets indexed. The command can just read a CSV file but it can also be a program that generates on-the-fly the comma delimited content.

CSV indexed columns

The following directives can be used to declare the types of the indexed columns:

  • csvpipe_field - declares a text field.
  • csvpipe_field_string - declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.
  • csvpipe_attr_uint - declares an integer attribute
  • csvpipe_attr_timestamp - declares a timestamp attribute
  • csvpipe_attr_bool - declares a boolean attribute
  • csvpipe_attr_float - declares a float attribute
  • csvpipe_attr_bigint - declares a big integer attribute
  • csvpipe_attr_multi - declares a multi-value attribute with integers
  • csvpipe_attr_multi_64 - declares a multi-value attribute with 64bit integers
  • csvpipe_attr_string - declares a string attribute
  • csvpipe_attr_json - declares a JSON attribute

Example of a source using a CSV file

  1. source csv_test
  2. {
  3. type = csvpipe
  4. csvpipe_command = cat /tmp/rock_bands.csv
  5. csvpipe_field = name
  6. csvpipe_attr_multi = genre_tags
  7. }
  1. 1,"Led Zeppelin","35,23,16"
  2. 2,"Deep Purple","35,92"
  3. 3,"Frank Zappa","35,23,16,92,33,24"

Main+delta schema

There’s a frequent situation when the total dataset is too big to be rebuilt from scratch often, but the amount of new records is rather small. Example: a forum with a 1,000,000 archived posts, but only 1,000 new posts per day.

In this case, “live” (almost real time) table updates could be implemented using so called “main+delta” scheme.

The idea is to set up two sources and two tables, with one “main” table for the data which only changes rarely (if ever), and one “delta” for the new documents. In the example above, 1,000,000 archived posts would go to the main table, and newly inserted 1,000 posts/day would go to the delta table. Delta table could then be rebuilt very frequently, and the documents can be made available to search in a matter of minutes. Specifying which documents should go to what table and rebuilding the main table could also be made fully automatic. One option would be to make a counter table which would track the ID which would split the documents, and update it whenever the main table is rebuilt.

Example: Fully automated live updates

  1. # in MySQL
  2. CREATE TABLE sph_counter
  3. (
  4. counter_id INTEGER PRIMARY KEY NOT NULL,
  5. max_doc_id INTEGER NOT NULL
  6. );
  7. # in sphinx.conf
  8. source main
  9. {
  10. # ...
  11. sql_query_pre = SET NAMES utf8
  12. sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
  13. sql_query = SELECT id, title, body FROM documents \
  14. WHERE id<=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
  15. }
  16. source delta : main
  17. {
  18. sql_query_pre = SET NAMES utf8
  19. sql_query = SELECT id, title, body FROM documents \
  20. WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
  21. }
  22. table main
  23. {
  24. source = main
  25. path = /path/to/main
  26. # ... all the other settings
  27. }
  28. **note how all other settings are copied from main, but source and path are overridden (they MUST be)**
  29. table delta : main
  30. {
  31. source = delta
  32. path = /path/to/delta
  33. }

A better split variable is to use a timestamp column instead of the ID as timestamps can track not just new documents, but also modified ones.

For the datasets that can have documents modified or deleted, the delta table should also provide a list with documents that suffered changes in order to be suppressed and not be used in search queries. This is achieved with the feature called Kill lists. The document ids to be killed can be provided in an auxiliary query defined by sql_query_killlist. The delta must point the tables for which the kill-lists will be applied by directive killlist_target. The effect of kill-lists is permanent on the target table, meaning even if the search is made without the delta table, the suppressed documents will not appear in searches.

Note how we’re overriding sql_query_pre in the delta source. We need to explicitly have that override. Otherwise REPLACE query would be run when building the delta source too, effectively nullifying it. However, when we issue the directive in the inherited source for the first time, it removes all inherited values, so the encoding setup is also lost. So sql_query_pre in the delta can not just be empty; and we need to issue the encoding setup query explicitly once again.

Adding data from tables