Category Archives: Elasticsearch

Elasticsearch custom tokenizers – nGram

If you’ve been trying to query the Elasticsearch index for partial string matches (similarly to SQL’s “LIKE” operator), like i did initially, you’d get surprised to learn that default ES setup does not offer such functionality.

 

Here’s an example using “match” type query (read more about QueryDSL here):

curl -XGET 'http://search.my-server.com/blog/users/_search?pretty=true' -d '
{
  "query" : {
    "match" : {
      "username" : "mar"
    }
  }
}'

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
     "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

whereas, when i search after full username, the result is following:

curl -XGET 'http://search.my-server.com/blog/users/_search?pretty=true' -d '
{
  "query" : {
    "match" : {
      "username" : "mariusz"
    }
  }
}'

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 5.5108595,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "users",
      "_id" : "835",
      "_score" : 5.5108595,
      "_source" : {
        "id" : 835,
        "version" : 5,
        "creationTime" : "2013/11/29 03:13:27 PM,UTC",
        "modificationTime" : "2014/01/03 01:50:17 PM,UTC",
        "username" : "mariusz",
        "firstName" : "Mariusz",
        "lastName" : "Przydatek",
        "homeAddress" : [],
        "email" : "me@mariuszprzydatek.com",
        "interests" : ["Start-ups", "VC", "Java", "Elasticsearch", "AngularJS"],
        "websiteUrl" : "http://mariuszprzydatek.com",
        "twitter" : "https://twitter.com/mprzydatek",
        "avatar" : "http://www.gravatar.com/avatar/8d8a9d08eddb126c3301070af22f9933.png",
      }
    } ]
  }
}

Wondering why’s that? I’ll save you the trouble of studying Elasticsearch specs and provide the explanation here and now.

 

It’s all about how your data (“username” field to be precise) is being indexed by Elasticsearch; to be specific: which built-in tokenizer (one of many) is being used to create search tokens.

By default ES is using the “standard” tokenizer (more details about it here). What we need instead is the nGram tokenizer (details).

 

Here’s how you can check how your data has actually been “tokenized” by ES:

curl -XGET 'http://search.my-server.com/blog/users/_search?pretty=true' -d '
{
  "query" : {
    "match" : {
      "username" : "mariusz"
    }
  },
  "script_fields" : {
    "terms" : {
      "script" : "doc[field].values",
      "params" : {
        "field" : "username"
      }
    }
  }
}'

{
  "took" : 1191,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 5.5053496,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "users",
      "_id" : "835",
      "_score" : 5.5053496,
      "fields" : {
        "terms" : [ "mariusz" ]
      }
    } ]
  }
}

So as you can see near the end of the JSON above, there’s only one token created for field username: “mariusz”. No wonder why querying for partial string “mar” wasn’t working.

 

What you need to do in order to allow partial string search, is following:

  1. remove the whole current index (i know – sorry – there’s no other way, the data has to be re-tokenized again and where/when it happens, is at the indexing time)
  2. create a new custom tokenizer
  3. create a new custom analyzer
  4. create new index that has the new tokenizer/analyzer set as defaults

 

Let’s start with removing the old index:

curl -XDELETE 'http://search.my-server.com/blog'

{
  "ok" : true,
  "acknowledged" : true
}

 

Now, we can combine steps 2,3 and 4 within a single command:

curl -XPUT 'http://search.my-server.com/blog/' -d '
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "default": {
          "type" : "custom",
          "tokenizer" : "my_ngram_tokenizer",
          "filter" : "lowercase"
        }
      },
      "tokenizer" : {
        "my_ngram_tokenizer" : {
          "type" : "nGram",
          "min_gram" : "3",
          "max_gram" : "20",
          "token_chars": [ "letter", "digit" ]
        }
      }
    }
  }
}'

{
 "ok" : true,
 "acknowledged" : true
}

 

Let’s add now the same data (profile of user mariusz) and see how it got tokenized:

curl -XGET 'http://search.my-server.com/blog/users/_search?pretty=true' -d '
{
  "query" : {
    "match" : {
      "username" : "mar"
    }
  },
  "script_fields" : {
    "terms" : {
      "script" : "doc[field].values",
      "params" : {
        "field" : "username"
      }
    }
  }
}'

{
  "took" : 309,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.26711923,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "users",
      "_id" : "22",
      "_score" : 0.26711923,
      "fields" : {
        "terms" : [ "ari", "ariu", "arius", "ariusz", "ius", "iusz",
                    "mar", "mari", "mariu", "marius", "mariusz", "riu",
                    "rius", "riusz", "sz", "usz" ]
      }
    } ]
  }
}

 

 

Wow, hell of a ride it was 🙂 Now you can see way more tokens created. I leave it up to you to check whether querying for partial string “mar” works now.

 

Take care!

 

 

Resources: