Advanced HTML tokenization

Stripping HTML tags

html_strip

  1. html_strip = {0|1}

Whether to strip HTML markup from incoming full-text data. Optional, default is 0. Known values are 0 (disable stripping) and 1 (enable stripping).

Both HTML tags and entities and considered markup and get processed.

HTML tags are removed, their contents (i.e., everything between <p> and </p>) are left intact by default. You can choose to keep and index attributes of the tags (e.g., HREF attribute in an A tag, or ALT in an IMG one). Several well-known inline tags are completely removed, all other tags are treated as block level and replaced with whitespace. For example, te<strong>st</strong> text will be indexed as a single keyword ‘test’, however, te<p>st</p> will be indexed as two keywords ‘te’ and ‘st’. Known inline tags are as follows: A, B, I, S, U, BASEFONT, BIG, EM, FONT, IMG, LABEL, SMALL, SPAN, STRIKE, STRONG, SUB, SUP, TT.

HTML entities get decoded and replaced with corresponding UTF-8 characters. Stripper supports both numeric forms (such as &#239;) and text forms (such as &oacute; or &nbsp;). All entities as specified by HTML4 standard are supported.

Stripping should work with properly formed HTML and XHTML, but, just as most browsers, may produce unexpected results on malformed input (such as HTML with stray <'s or unclosed >'s).

Only the tags themselves, and also HTML comments, are stripped. To strip the contents of the tags too (eg. to strip embedded scripts), see html_remove_elements option. There are no restrictions on tag names; ie. everything that looks like a valid tag start, or end, or a comment will be stripped.

  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG

SQL JSON PHP Python javascript Java CONFIG

  1. CREATE TABLE products(title text, price float) html_strip = '1'
  1. POST /cli -d "
  2. CREATE TABLE products(title text, price float) html_strip = '1'"
  1. $index = new \Manticoresearch\Index($client);
  2. $index->setName('products');
  3. $index->create([
  4. 'title'=>['type'=>'text'],
  5. 'price'=>['type'=>'float']
  6. ],[
  7. 'html_strip' => '1'
  8. ]);
  1. utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'')
  1. res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'');
  1. utilsApi.sql("CREATE TABLE products(title text, price float) html_strip = '1'");
  1. table products {
  2. html_strip = 1
  3. type = rt
  4. path = tbl
  5. rt_field = title
  6. rt_attr_uint = price
  7. }

html_index_attrs

  1. html_index_attrs = img=alt,title; a=title;

A list of markup attributes to index when stripping HTML. Optional, default is empty (do not index markup attributes).

Specifies HTML markup attributes whose contents should be retained and indexed even though other HTML markup is stripped. The format is per-tag enumeration of indexable attributes, as shown above.

  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG

SQL JSON PHP Python javascript Java CONFIG

  1. CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'
  1. POST /cli -d "
  2. CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'"
  1. $index = new \Manticoresearch\Index($client);
  2. $index->setName('products');
  3. $index->create([
  4. 'title'=>['type'=>'text'],
  5. 'price'=>['type'=>'float']
  6. ],[
  7. 'html_index_attrs' => 'img=alt,title; a=title;',
  8. 'html_strip' => '1'
  9. ]);
  1. utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'')
  1. res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'');
  1. utilsApi.sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'");
  1. table products {
  2. html_index_attrs = img=alt,title; a=title;
  3. html_strip = 1
  4. type = rt
  5. path = tbl
  6. rt_field = title
  7. rt_attr_uint = price
  8. }

html_remove_elements

  1. html_remove_elements = element1[, element2, ...]

A list of HTML elements for which to strip contents along with the elements themselves. Optional, default is empty string (do not strip contents of any elements).

This feature allows to strip element contents, ie. everything that is between the opening and the closing tags. It is useful to remove embedded scripts, CSS, etc. Short tag form for empty elements (ie. <br/>) is properly supported; ie. the text that follows such tag will not be removed.

The value is a comma-separated list of element (tag) names whose contents should be removed. Tag names are case insensitive.

  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG

SQL JSON PHP Python javascript Java CONFIG

  1. CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'
  1. POST /cli -d "
  2. CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'"
  1. $index = new \Manticoresearch\Index($client);
  2. $index->setName('products');
  3. $index->create([
  4. 'title'=>['type'=>'text'],
  5. 'price'=>['type'=>'float']
  6. ],[
  7. 'html_remove_elements' => 'style, script',
  8. 'html_strip' => '1'
  9. ]);
  1. utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'')
  1. res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'');
  1. utilsApi.sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'");
  1. table products {
  2. html_remove_elements = style, script
  3. html_strip = 1
  4. type = rt
  5. path = tbl
  6. rt_field = title
  7. rt_attr_uint = price
  8. }

Extracting important parts from HTML

index_sp

  1. index_sp = {0|1}

Whether to detect and index sentence and paragraph boundaries. Optional, default is 0 (do not detect and index).

This directive enables sentence and paragraph boundary indexing. It’s required for the SENTENCE and PARAGRAPH operators to work. Sentence boundary detection is based on plain text analysis, so you only need to set index_sp = 1 to enable it. Paragraph detection is however based on HTML markup, and happens in the HTML stripper. So to index paragraph locations you also need to enable the stripper by specifying html_strip = 1. Both types of boundaries are detected based on a few built-in rules enumerated just below.

Sentence boundary detection rules are as follows.

  • Question and exclamation signs (? and !) are always a sentence boundary.
  • Trailing dot (.) is a sentence boundary, except:
    • When followed by a letter. That’s considered a part of an abbreviation (as in “S.T.A.L.K.E.R” or “Goldman Sachs S.p.A.”).
    • When followed by a comma. That’s considered an abbreviation followed by a comma (as in “Telecom Italia S.p.A., founded in 1994”).
    • When followed by a space and a small letter. That’s considered an abbreviation within a sentence (as in “News Corp. announced in February”).
    • When preceded by a space and a capital letter, and followed by a space. That’s considered a middle initial (as in “John D. Doe”).

Paragraph boundaries are inserted at every block-level HTML tag. Namely, those are (as taken from HTML 4 standard) ADDRESS, BLOCKQUOTE, CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P, PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.

Both sentences and paragraphs increment the keyword position counter by 1.

  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG

SQL JSON PHP Python javascript Java CONFIG

  1. CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'
  1. POST /cli -d "
  2. CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'"
  1. $index = new \Manticoresearch\Index($client);
  2. $index->setName('products');
  3. $index->create([
  4. 'title'=>['type'=>'text'],
  5. 'price'=>['type'=>'float']
  6. ],[
  7. 'index_sp' => '1',
  8. 'html_strip' => '1'
  9. ]);
  1. utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'')
  1. res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'');
  1. utilsApi.sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'");
  1. table products {
  2. index_sp = 1
  3. html_strip = 1
  4. type = rt
  5. path = tbl
  6. rt_field = title
  7. rt_attr_uint = price
  8. }

index_zones

  1. index_zones = h*, th, title

A list of in-field HTML/XML zones to index. Optional, default is empty (do not index zones).

Zones can be formally defined as follows. Everything between an opening and a matching closing tag is called a span, and the aggregate of all spans corresponding sharing the same tag name is called a zone. For instance, everything between the occurrences of <H1> and </H1> in the document field belongs to H1 zone.

Zone indexing, enabled by index_zones directive, is an optional extension of the HTML stripper. So it will also require that the stripper is enabled (with html_strip = 1). The value of the index_zones should be a comma-separated list of those tag names and wildcards (ending with a star) that should be indexed as zones.

Zones can nest and overlap arbitrarily. The only requirement is that every opening tag has a matching tag. You can also have an arbitrary number of both zones (as in unique zone names, such as H1) and spans (all the occurrences of those H1 tags) in a document. Once indexed, zones can then be used for matching with the ZONE operator, see extended_query_syntax.

  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG

SQL JSON PHP Python javascript Java CONFIG

  1. CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'
  1. POST /cli -d "
  2. CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'"
  1. $index = new \Manticoresearch\Index($client);
  2. $index->setName('products');
  3. $index->create([
  4. 'title'=>['type'=>'text'],
  5. 'price'=>['type'=>'float']
  6. ],[
  7. 'index_zones' => 'h*,th,title',
  8. 'html_strip' => '1'
  9. ]);
  1. utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'')
  1. res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'');
  1. utilsApi.sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'");
  1. table products {
  2. index_zones = h*, th, title
  3. html_strip = 1
  4. type = rt
  5. path = tbl
  6. rt_field = title
  7. rt_attr_uint = price
  8. }

Creating a distributed table

Manticore supports distributed tables. They look like usual plain or real-time tables, but internally they are just a ‘proxy’, or named collection of another child tables used for actual searching. When a query is directed at such table, it is distributed among all tables in the collection. Server collects responses of the queries and processes them as necessary:

  • applies sorting
  • recalculates final values of aggregates, etc

From the client’s standpoint it looks transparent, as if you just queried any single table.

Distributed tables can be composed from any other tables fitting your requirements

Nesting distributed tables is supported by declaring them with agent (even if they are on the same machine). Distributed tables cannot be declared with local and they will be ignored.

Percolate and template tables should not be mixed with plain and/or RT tables.

Distributed table is defined by type ‘distributed’ in the configuration file or via SQL clause CREATE TABLE

In a configuration file

  1. table foo {
  2. type = distributed
  3. local = bar
  4. local = bar1, bar2
  5. agent = 127.0.0.1:9312:baz
  6. agent = host1|host2:tbl
  7. agent = host1:9301:tbl1|host2:tbl2 [ha_strategy=random retry_count=10]
  8. ...
  9. }

Via SQL

  1. CREATE TABLE distributed_index type='distributed' local='local_index' agent='127.0.0.1:9312:remote_index'

Children

Either way the key component of a distributed table is a list of children (the tables it points to).

  • Lines, starting with local = enumerate local tables, served in the same server. Several local tables may be written as several local = lines, or combined into one list, separated by commas.
  • Lines, starting with agent = enumerate remote tables, served anywhere. Each line represents one agent, or endpoint.

Each agent can include several external locations and options specifying how to work with them.

Note that for remotes the server knows nothing about the type of the table, and it may cause errors, if, say, you issue CALL PQ to remote ‘foo’ which is not a percolate table.

Creating a local distributed table

A distributed table in Manticore Search doesn’t hold any data. Instead it acts as a ‘master node’ to proxy the demanded query to other tables and provide merged results from the responses it receives from the ‘node’ tables. A distributed table can connect to local tables or tables located on other servers. The simplest example of a distributed table looks so:

  • Configuration file
  • RT mode
  • PHP
  • Python
  • javascript
  • Java

Configuration file RT mode PHP Python javascript Java

  1. table index_dist {
  2. type = distributed
  3. local = index1
  4. local = index2
  5. ...
  6. }
  1. CREATE TABLE local_dist type='distributed' local='index1' local='index2';
  1. $params = [
  2. 'body' => [
  3. 'settings' => [
  4. 'type' => 'distributed',
  5. 'local' => [
  6. 'index1',
  7. 'index2'
  8. ]
  9. ]
  10. ],
  11. 'index' => 'products'
  12. ];
  13. $index = new \Manticoresearch\Index($client);
  14. $index->create($params);
  1. utilsApi.sql('CREATE TABLE local_dist type=\'distributed\' local=\'index1\' local=\'index2\'')
  1. res = await utilsApi.sql('CREATE TABLE local_dist type=\'distributed\' local=\'index1\' local=\'index2\'');
  1. utilsApi.sql("CREATE TABLE local_dist type='distributed' local='index1' local='index2'");