3.3. Search

Search indexes enable you to query a database by using the Lucene Query Parser Syntax. A search index uses one, or multiple, fields from your documents. You can use a search index to run queries, find documents based on the content they contain, or work with groups, facets, or geographical searches.

To create a search index, you add a JavaScript function to a design document in the database. An index builds after processing one search request or after the server detects a document update. The index function takes the following parameters:

1. Field name - The name of the field you want to use when you query the index. If you set this parameter to default, then this field is queried if no field is specified in the query syntax.

  1. Data that you want to index, for example, doc.address.country.

3. (Optional) The third parameter includes the following fields: boost, facet, index, and store. These fields are described in more detail later.

By default, a search index response returns 25 rows. The number of rows that is returned can be changed by using the limit parameter. Each response includes a bookmark field. You can include the value of the bookmark field in later queries to look through the responses.

Example design document that defines a search index:

  1. {
  2. "_id": "_design/search_example",
  3. "indexes": {
  4. "animals": {
  5. "index": "function(doc){ ... }"
  6. }
  7. }
  8. }

A search index will inherit the partitioning type from the options.partitioned field of the design document that contains it.

3.3.1. Index functions

Attempting to index by using a data field that does not exist fails. To avoid this problem, use the appropriate guard clause.

  1. index("myfield", "this is a string");
  2. index("myfield", 123);

The function that is contained in the index field is a JavaScript function that is called for each document in the database. The function takes the document as a parameter, extracts some data from it, and then calls the function that is defined in the index field to index that data.

The index function takes three parameters, where the third parameter is optional.

The first parameter is the name of the field you intend to use when querying the index, and which is specified in the Lucene syntax portion of subsequent queries. An example appears in the following query:

  1. query=color:red

The Lucene field name color is the first parameter of the index function.

The query parameter can be abbreviated to q, so another way of writing the query is as follows:

  1. q=color:red

If the special value "default" is used when you define the name, you do not have to specify a field name at query time. The effect is that the query can be simplified:

  1. query=red

The second parameter is the data to be indexed. Keep the following information in mind when you index your data:

  • This data must be only a string, number, or boolean. Other types will cause an error to be thrown by the index function call.
  • If an error is thrown when running your function, for this reason or others, the document will not be added to that search index.

The third, optional, parameter is a JavaScript object with the following fields:

Index function (optional parameter)

  • boost - A number that specifies the relevance in search results. Content that is indexed with a boost value greater than 1 is more relevant than content that is indexed without a boost value. Content with a boost value less than one is not so relevant. Value is a positive floating point number. Default is 1 (no boosting).
  • facet - Creates a faceted index. See Faceting. Values are true or false. Default is false.
  • index - Whether the data is indexed, and if so, how. If set to false, the data cannot be used for searches, but can still be retrieved from the index if store is set to true. See Analyzers. Values are true or false. Default is true
  • store - If true, the value is returned in the search result; otherwise, the value is not returned. Values are true or false. Default is false.

Example search index function:

  1. function(doc) {
  2. index("default", doc._id);
  3. if (doc.min_length) {
  4. index("min_length", doc.min_length, {"store": true});
  5. }
  6. if (doc.diet) {
  7. index("diet", doc.diet, {"store": true});
  8. }
  9. if (doc.latin_name) {
  10. index("latin_name", doc.latin_name, {"store": true});
  11. }
  12. if (doc.class) {
  13. index("class", doc.class, {"store": true});
  14. }
  15. }

3.3.1.1. Index guard clauses

The index function requires the name of the data field to index as the second parameter. However, if that data field does not exist for the document, an error occurs. The solution is to use an appropriate ‘guard clause’ that checks if the field exists, and contains the expected type of data, before any attempt to create the corresponding index.

Example of failing to check whether the index data field exists:

  1. if (doc.min_length) {
  2. index("min_length", doc.min_length, {"store": true});
  3. }

You might use the JavaScript typeof function to implement the guard clause test. If the field exists and has the expected type, the correct type name is returned, so the guard clause test succeeds and it is safe to use the index function. If the field does not exist, you would not get back the expected type of the field, therefore you would not attempt to index the field.

JavaScript considers a result to be false if one of the following values is tested:

  • ‘undefined’
  • null
  • The number +0
  • The number -0
  • NaN (not a number)
  • “” (the empty string)

Using a guard clause to check whether the required data field exists, and holds a number, before an attempt to index:

  1. if (typeof(doc.min_length) === 'number') {
  2. index("min_length", doc.min_length, {"store": true});
  3. }

Use a generic guard clause test to ensure that the type of the candidate data field is defined.

Example of a ‘generic’ guard clause:

  1. if (typeof(doc.min_length) !== 'undefined') {
  2. // The field exists, and does have a type, so we can proceed to index using it.
  3. ...
  4. }

3.3.2. Analyzers

Analyzers are settings that define how to recognize terms within text. Analyzers can be helpful if you need to index multiple languages.

Here’s the list of generic analyzers, and their descriptions, that are supported by search:

  • classic - The standard Lucene analyzer, circa release 3.1.
  • email - Like the standard analyzer, but tries harder to match an email address as a complete token.
  • keyword - Input is not tokenized at all.
  • simple - Divides text at non-letters.
  • standard - The default analyzer. It implements the Word Break rules from the Unicode Text Segmentation algorithm
  • whitespace - Divides text at white space boundaries.

Example analyzer document:

  1. {
  2. "_id": "_design/analyzer_example",
  3. "indexes": {
  4. "INDEX_NAME": {
  5. "index": "function (doc) { ... }",
  6. "analyzer": "$ANALYZER_NAME"
  7. }
  8. }
  9. }

3.3.2.1. Language-specific analyzers

These analyzers omit common words in the specific language, and many also remove prefixes and suffixes. The name of the language is also the name of the analyzer. See package org.apache.lucene.analysis for more information.

LanguageAnalyzer
arabicorg.apache.lucene.analysis.ar.ArabicAnalyzer
armenianorg.apache.lucene.analysis.hy.ArmenianAnalyzer
basqueorg.apache.lucene.analysis.eu.BasqueAnalyzer
bulgarianorg.apache.lucene.analysis.bg.BulgarianAnalyzer
brazilianorg.apache.lucene.analysis.br.BrazilianAnalyzer
catalanorg.apache.lucene.analysis.ca.CatalanAnalyzer
cjkorg.apache.lucene.analysis.cjk.CJKAnalyzer
chineseorg.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer
czechorg.apache.lucene.analysis.cz.CzechAnalyzer
danishorg.apache.lucene.analysis.da.DanishAnalyzer
dutchorg.apache.lucene.analysis.nl.DutchAnalyzer
englishorg.apache.lucene.analysis.en.EnglishAnalyzer
finnishorg.apache.lucene.analysis.fi.FinnishAnalyzer
frenchorg.apache.lucene.analysis.fr.FrenchAnalyzer
germanorg.apache.lucene.analysis.de.GermanAnalyzer
greekorg.apache.lucene.analysis.el.GreekAnalyzer
galicianorg.apache.lucene.analysis.gl.GalicianAnalyzer
hindiorg.apache.lucene.analysis.hi.HindiAnalyzer
hungarianorg.apache.lucene.analysis.hu.HungarianAnalyzer
indonesianorg.apache.lucene.analysis.id.IndonesianAnalyzer
irishorg.apache.lucene.analysis.ga.IrishAnalyzer
italianorg.apache.lucene.analysis.it.ItalianAnalyzer
japaneseorg.apache.lucene.analysis.ja.JapaneseAnalyzer
japaneseorg.apache.lucene.analysis.ja.JapaneseTokenizer
latvianorg.apache.lucene.analysis.lv.LatvianAnalyzer
norwegianorg.apache.lucene.analysis.no.NorwegianAnalyzer
persianorg.apache.lucene.analysis.fa.PersianAnalyzer
polishorg.apache.lucene.analysis.pl.PolishAnalyzer
portugueseorg.apache.lucene.analysis.pt.PortugueseAnalyzer
romanianorg.apache.lucene.analysis.ro.RomanianAnalyzer
russianorg.apache.lucene.analysis.ru.RussianAnalyzer
spanishorg.apache.lucene.analysis.es.SpanishAnalyzer
swedishorg.apache.lucene.analysis.sv.SwedishAnalyzer
thaiorg.apache.lucene.analysis.th.ThaiAnalyzer
turkishorg.apache.lucene.analysis.tr.TurkishAnalyzer

3.3.2.2. Per-field analyzers

The perfield analyzer configures multiple analyzers for different fields.

Example of defining different analyzers for different fields:

  1. {
  2. "_id": "_design/analyzer_example",
  3. "indexes": {
  4. "INDEX_NAME": {
  5. "analyzer": {
  6. "name": "perfield",
  7. "default": "english",
  8. "fields": {
  9. "spanish": "spanish",
  10. "german": "german"
  11. }
  12. },
  13. "index": "function (doc) { ... }"
  14. }
  15. }
  16. }

3.3.2.3. Stop words

Stop words are words that do not get indexed. You define them within a design document by turning the analyzer string into an object.

The default stop words for the standard analyzer are included below:

  1. "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if",
  2. "in", "into", "is", "it", "no", "not", "of", "on", "or", "such",
  3. "that", "the", "their", "then", "there", "these", "they", "this",
  4. "to", "was", "will", "with"

Example of defining non-indexed (‘stop’) words:

  1. {
  2. "_id": "_design/stop_words_example",
  3. "indexes": {
  4. "INDEX_NAME": {
  5. "analyzer": {
  6. "name": "portuguese",
  7. "stopwords": [
  8. "foo",
  9. "bar",
  10. "baz"
  11. ]
  12. },
  13. "index": "function (doc) { ... }"
  14. }
  15. }
  16. }

3.3.2.4. Testing analyzer tokenization

You can test the results of analyzer tokenization by posting sample data to the _search_analyze endpoint.

Example of using HTTP to test the keyword analyzer:

  1. POST /_search_analyze HTTP/1.1
  2. Content-Type: application/json
  3. {"analyzer":"keyword", "text":"ablanks@renovations.com"}

Example of using the command line to test the keyword analyzer:

  1. curl 'https://$HOST:5984/_search_analyze' -H 'Content-Type: application/json'
  2. -d '{"analyzer":"keyword", "text":"ablanks@renovations.com"}'

Result of testing the keyword analyzer:

  1. {
  2. "tokens": [
  3. "ablanks@renovations.com"
  4. ]
  5. }

Example of using HTTP to test the standard analyzer:

  1. POST /_search_analyze HTTP/1.1
  2. Content-Type: application/json
  3. {"analyzer":"standard", "text":"ablanks@renovations.com"}

Example of using the command line to test the standard analyzer:

  1. curl 'https://$HOST:5984/_search_analyze' -H 'Content-Type: application/json'
  2. -d '{"analyzer":"standard", "text":"ablanks@renovations.com"}'

Result of testing the standard analyzer:

  1. {
  2. "tokens": [
  3. "ablanks",
  4. "renovations.com"
  5. ]
  6. }

3.3.3. Queries

After you create a search index, you can query it.

  • Issue a partition query using: GET /$DATABASE/_partition/$PARTITION_KEY/_design/$DDOC/_search/$INDEX_NAME
  • Issue a global query using: GET /$DATABASE/_design/$DDOC/_search/$INDEX_NAME

Specify your search by using the query parameter.

Example of using HTTP to query a partitioned index:

  1. GET /$DATABASE/_partition/$PARTITION_KEY/_design/$DDOC/_search/$INDEX_NAME?include_docs=true&query="*:*"&limit=1 HTTP/1.1
  2. Content-Type: application/json

Example of using HTTP to query a global index:

  1. GET /$DATABASE/_design/$DDOC/_search/$INDEX_NAME?include_docs=true&query="*:*"&limit=1 HTTP/1.1
  2. Content-Type: application/json

Example of using the command line to query a partitioned index:

  1. curl https://$HOST:5984/$DATABASE/_partition/$PARTITION_KEY/_design/$DDOC/
  2. _search/$INDEX_NAME?include_docs=true\&query="*:*"\&limit=1 \

Example of using the command line to query a global index:

  1. curl https://$HOST:5984/$DATABASE/_design/$DDOC/_search/$INDEX_NAME?
  2. include_docs=true\&query="*:*"\&limit=1 \

3.3.3.1. Query Parameters

A full list of query parameters can be found in the API Reference.

You must enable faceting before you can use the following parameters:

  • counts
  • drilldown
  • ranges

3.3.3.2. Relevance

When more than one result might be returned, it is possible for them to be sorted. By default, the sorting order is determined by ‘relevance’.

Relevance is measured according to Apache Lucene Scoring. As an example, if you search a simple database for the word example, two documents might contain the word. If one document mentions the word example 10 times, but the second document mentions it only twice, then the first document is considered to be more ‘relevant’.

If you do not provide a sort parameter, relevance is used by default. The highest scoring matches are returned first.

If you provide a sort parameter, then matches are returned in that order, ignoring relevance.

If you want to use a sort parameter, and also include ordering by relevance in your search results, use the special fields -<score> or <score> within the sort parameter.

3.3.3.3. POSTing search queries

Instead of using the GET HTTP method, you can also use POST. The main advantage of POST queries is that they can have a request body, so you can specify the request as a JSON object. Each parameter in the query string of a GET request corresponds to a field in the JSON object in the request body.

Example of using HTTP to POST a search request:

  1. POST /db/_design/ddoc/_search/searchname HTTP/1.1
  2. Content-Type: application/json

Example of using the command line to POST a search request:

  1. curl 'https://$HOST:5984/db/_design/ddoc/_search/searchname' -X POST -H 'Content-Type: application/json' -d @search.json

Example JSON document that contains a search request:

  1. {
  2. "q": "index:my query",
  3. "sort": "foo",
  4. "limit": 3
  5. }

3.3.4. Query syntax

The CouchDB search query syntax is based on the Lucene syntax. Search queries take the form of name:value unless the name is omitted, in which case they use the default field, as demonstrated in the following examples:

Example search query expressions:

  1. // Birds
  2. class:bird
  1. // Animals that begin with the letter "l"
  2. l*
  1. // Carnivorous birds
  2. class:bird AND diet:carnivore
  1. // Herbivores that start with letter "l"
  2. l* AND diet:herbivore
  1. // Medium-sized herbivores
  2. min_length:[1 TO 3] AND diet:herbivore
  1. // Herbivores that are 2m long or less
  2. diet:herbivore AND min_length:[-Infinity TO 2]
  1. // Mammals that are at least 1.5m long
  2. class:mammal AND min_length:[1.5 TO Infinity]
  1. // Find "Meles meles"
  2. latin_name:"Meles meles"
  1. // Mammals who are herbivore or carnivore
  2. diet:(herbivore OR omnivore) AND class:mammal
  1. // Return all results
  2. *:*

Queries over multiple fields can be logically combined, and groups and fields can be further grouped. The available logical operators are case-sensitive and are AND, +, OR, NOT and -. Range queries can run over strings or numbers.

If you want a fuzzy search, you can run a query with ~ to find terms like the search term. For instance, look~ finds the terms book and took.

You can alter the importance of a search term by adding ^ and a positive number. This alteration makes matches containing the term more or less relevant, proportional to the power of the boost value. The default value is 1, which means no increase or decrease in the strength of the match. A decimal value of 0 - 1 reduces importance. making the match strength weaker. A value greater than one increases importance, making the match strength stronger.

Wildcard searches are supported, for both single (?) and multiple (*) character searches. For example, dat? would match date and data, whereas dat* would match date, data, database, and dates. Wildcards must come after the search term.

Use *:* to return all results.

If the search query does not specify the "group_field" argument, the response contains a bookmark. If this bookmark is later provided as a URL parameter, the response skips the rows that were seen already, making it quick and easy to get the next set of results.

The following characters require escaping if you want to search on them:

  1. + - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

To escape one of these characters, use a preceding backslash character (\).

The response to a search query contains an order field for each of the results. The order field is an array where the first element is the field or fields that are specified in the sort parameter. See the sort parameter. If no sort parameter is included in the query, then the order field contains the Lucene relevance score. If you use the ‘sort by distance’ feature as described in geographical searches, then the first element is the distance from a point. The distance is measured by using either kilometers or miles.

3.3.4.1. Faceting

CouchDB Search also supports faceted searching, enabling discovery of aggregate information about matches quickly and easily. You can match all documents by using the special ?q=*:* query syntax, and use the returned facets to refine your query. To indicate that a field must be indexed for faceted queries, set {"facet": true} in its options.

Example of search query, specifying that faceted search is enabled:

  1. function(doc) {
  2. index("type", doc.type, {"facet": true});
  3. index("price", doc.price, {"facet": true});
  4. }

To use facets, all the documents in the index must include all the fields that have faceting enabled. If your documents do not include all the fields, you receive a bad_request error with the following reason, “The field_name does not exist.” If each document does not contain all the fields for facets, create separate indexes for each field. If you do not create separate indexes for each field, you must include only documents that contain all the fields. Verify that the fields exist in each document by using a single if statement.

Example if statement to verify that the required fields exist in each document:

  1. if (typeof doc.town == "string" && typeof doc.name == "string") {
  2. index("town", doc.town, {facet: true});
  3. index("name", doc.name, {facet: true});
  4. }

3.3.4.2. Counts

The counts facet syntax takes a list of fields, and returns the number of query results for each unique value of each named field.

Example of a query using the counts facet syntax:

  1. ?q=*:*&counts=["type"]

Example response after using of the counts facet syntax:

  1. {
  2. "total_rows":100000,
  3. "bookmark":"g...",
  4. "rows":[...],
  5. "counts":{
  6. "type":{
  7. "sofa": 10,
  8. "chair": 100,
  9. "lamp": 97
  10. }
  11. }
  12. }

3.3.4.3. Drilldown

You can restrict results to documents with a dimension equal to the specified label. Restrict the results by adding drilldown=["dimension","label"] to a search query. You can include multiple drilldown parameters to restrict results along multiple dimensions.

  1. GET /things/_design/inventory/_search/fruits?q=*:*&drilldown=["state","old"]&drilldown=["item","apple"]&include_docs=true HTTP/1.1

For better language interoperability, you can achieve the same by supplying a list of lists:

  1. GET /things/_design/inventory/_search/fruits?q=*:*&drilldown=[["state","old"],["item","apple"]]&include_docs=true HTTP/1.1

You can also supply a list of lists for drilldown in bodies of POST requests.

Note that, multiple values for a single key in a drilldown means an OR relation between them and there is an AND relation between multiple keys.

Using a drilldown parameter is similar to using key:value in the q parameter, but the drilldown parameter returns values that the analyzer might skip.

For example, if the analyzer did not index a stop word like "a", using drilldown returns it when you specify drilldown=["key","a"].

3.3.4.4. Ranges

The range facet syntax reuses the standard Lucene syntax for ranges to return counts of results that fit into each specified category. Inclusive range queries are denoted by brackets ([, ]). Exclusive range queries are denoted by curly brackets ({, }).

Example of a request that uses faceted search for matching ranges:

  1. ?q=*:*&ranges={"price":{"cheap":"[0 TO 100]","expensive":"{100 TO Infinity}"}}

Example results after a ranges check on a faceted search:

  1. {
  2. "total_rows":100000,
  3. "bookmark":"g...",
  4. "rows":[...],
  5. "ranges": {
  6. "price": {
  7. "expensive": 278682,
  8. "cheap": 257023
  9. }
  10. }
  11. }

3.3.5. Geographical searches

In addition to searching by the content of textual fields, you can also sort your results by their distance from a geographic coordinate using Lucene’s built-in geospatial capabilities.

To sort your results in this way, you must index two numeric fields, representing the longitude and latitude.

You can then query by using the special <distance...> sort field, which takes five parameters:

  • Longitude field name: The name of your longitude field (mylon in the example).
  • Latitude field name: The name of your latitude field (mylat in the example).
  • Longitude of origin: The longitude of the place you want to sort by distance from.
  • Latitude of origin: The latitude of the place you want to sort by distance from.
  • Units: The units to use: km for kilometers or mi for miles. The distance is returned in the order field.

You can combine sorting by distance with any other search query, such as range searches on the latitude and longitude, or queries that involve non-geographical information.

That way, you can search in a bounding box, and narrow down the search with extra criteria.

Example geographical data:

  1. {
  2. "name":"Aberdeen, Scotland",
  3. "lat":57.15,
  4. "lon":-2.15,
  5. "type":"city"
  6. }

Example of a design document that contains a search index for the geographic data:

  1. function(doc) {
  2. if (doc.type && doc.type == 'city') {
  3. index('city', doc.name, {'store': true});
  4. index('lat', doc.lat, {'store': true});
  5. index('lon', doc.lon, {'store': true});
  6. }
  7. }

An example of using HTTP for a query that sorts cities in the northern hemisphere by their distance to New York:

  1. GET /examples/_design/cities-designdoc/_search/cities?q=lat:[0+TO+90]&sort="<distance,lon,lat,-74.0059,40.7127,km>" HTTP/1.1

An example of using the command line for a query that sorts cities in the northern hemisphere by their distance to New York:

  1. curl 'https://$HOST:5984/examples/_design/cities-designdoc/_search/cities?q=lat:[0+TO+90]&sort="<distance,lon,lat,-74.0059,40.7127,km>"'

Example (abbreviated) response, containing a list of northern hemisphere cities sorted by distance to New York:

  1. {
  2. "total_rows": 205,
  3. "bookmark": "g1A...XIU",
  4. "rows": [
  5. {
  6. "id": "city180",
  7. "order": [
  8. 8.530665755719783,
  9. 18
  10. ],
  11. "fields": {
  12. "city": "New York, N.Y.",
  13. "lat": 40.78333333333333,
  14. "lon": -73.96666666666667
  15. }
  16. },
  17. {
  18. "id": "city177",
  19. "order": [
  20. 13.756343205985946,
  21. 17
  22. ],
  23. "fields": {
  24. "city": "Newark, N.J.",
  25. "lat": 40.733333333333334,
  26. "lon": -74.16666666666667
  27. }
  28. },
  29. {
  30. "id": "city178",
  31. "order": [
  32. 113.53603438866077,
  33. 26
  34. ],
  35. "fields": {
  36. "city": "New Haven, Conn.",
  37. "lat": 41.31666666666667,
  38. "lon": -72.91666666666667
  39. }
  40. }
  41. ]
  42. }

3.3.6. Highlighting search terms

Sometimes it is useful to get the context in which a search term was mentioned so that you can display more emphasized results to a user.

To get more emphasized results, add the highlight_fields parameter to the search query. Specify the field names for which you would like excerpts, with the highlighted search term returned.

By default, the search term is placed in <em> tags to highlight it, but the highlight can be overridden by using the highlights_pre_tag and highlights_post_tag parameters.

The length of the fragments is 100 characters by default. A different length can be requested with the highlights_size parameter.

The highlights_number parameter controls the number of fragments that are returned, and defaults to 1.

In the response, a highlights field is added, with one subfield per field name.

For each field, you receive an array of fragments with the search term highlighted.

Example of using HTTP to search with highlighting enabled:

  1. GET /movies/_design/searches/_search/movies?q=movie_name:Azazel&highlight_fields=["movie_name"]&highlight_pre_tag="**"&highlight_post_tag="**"&highlights_size=30&highlights_number=2 HTTP/1.1
  2. Authorization: ...

Example of using the command line to search with highlighting enabled:

  1. curl "https://$HOST:5984/movies/_design/searches/_search/movies?q=movie_name:Azazel&highlight_fields=\[\"movie_name\"\]&highlight_pre_tag=\"**\"&highlight_post_tag=\"**\"&highlights_size=30&highlights_number=2

Example of highlighted search results:

  1. {
  2. "highlights": {
  3. "movie_name": [
  4. " on the Azazel Orient Express",
  5. " Azazel manuals, you"
  6. ]
  7. }
  8. }