mariuszprzydatek.com

Elasticsearch custom tokenizers – nGram

Elasticsearch, Others February 18, 2014 Leave a comment

If you’ve been trying to query the Elasticsearch index for partial string matches (similarly to SQL’s “LIKE” operator), like i did initially, you’d get surprised to learn that default ES setup does not offer such functionality.

Here’s an example using “match” type query (read more about QueryDSL here):

curl -XGET 'http://search.my-server.com/blog/users/_search?pretty=true' -d '
{
  "query" : {
    "match" : {
      "username" : "mar"
    }
  }
}'

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
     "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

whereas, when i search after full username, the result is following:

curl -XGET 'http://search.my-server.com/blog/users/_search?pretty=true' -d '
{
  "query" : {
    "match" : {
      "username" : "mariusz"
    }
  }
}'

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 5.5108595,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "users",
      "_id" : "835",
      "_score" : 5.5108595,
      "_source" : {
        "id" : 835,
        "version" : 5,
        "creationTime" : "2013/11/29 03:13:27 PM,UTC",
        "modificationTime" : "2014/01/03 01:50:17 PM,UTC",
        "username" : "mariusz",
        "firstName" : "Mariusz",
        "lastName" : "Przydatek",
        "homeAddress" : [],
        "email" : "me@mariuszprzydatek.com",
        "interests" : ["Start-ups", "VC", "Java", "Elasticsearch", "AngularJS"],
        "websiteUrl" : "http://mariuszprzydatek.com",
        "twitter" : "https://twitter.com/mprzydatek",
        "avatar" : "http://www.gravatar.com/avatar/8d8a9d08eddb126c3301070af22f9933.png",
      }
    } ]
  }
}

Wondering why’s that? I’ll save you the trouble of studying Elasticsearch specs and provide the explanation here and now.

It’s all about how your data (“username” field to be precise) is being indexed by Elasticsearch; to be specific: which built-in tokenizer (one of many) is being used to create search tokens.

By default ES is using the “standard” tokenizer (more details about it here). What we need instead is the nGram tokenizer (details).

Here’s how you can check how your data has actually been “tokenized” by ES:

curl -XGET 'http://search.my-server.com/blog/users/_search?pretty=true' -d '
{
  "query" : {
    "match" : {
      "username" : "mariusz"
    }
  },
  "script_fields" : {
    "terms" : {
      "script" : "doc[field].values",
      "params" : {
        "field" : "username"
      }
    }
  }
}'

{
  "took" : 1191,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 5.5053496,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "users",
      "_id" : "835",
      "_score" : 5.5053496,
      "fields" : {
        "terms" : [ "mariusz" ]
      }
    } ]
  }
}

So as you can see near the end of the JSON above, there’s only one token created for field username: “mariusz”. No wonder why querying for partial string “mar” wasn’t working.

What you need to do in order to allow partial string search, is following:

remove the whole current index (i know – sorry – there’s no other way, the data has to be re-tokenized again and where/when it happens, is at the indexing time)
create a new custom tokenizer
create a new custom analyzer
create new index that has the new tokenizer/analyzer set as defaults

Let’s start with removing the old index:

curl -XDELETE 'http://search.my-server.com/blog'

{
  "ok" : true,
  "acknowledged" : true
}

Now, we can combine steps 2,3 and 4 within a single command:

curl -XPUT 'http://search.my-server.com/blog/' -d '
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "default": {
          "type" : "custom",
          "tokenizer" : "my_ngram_tokenizer",
          "filter" : "lowercase"
        }
      },
      "tokenizer" : {
        "my_ngram_tokenizer" : {
          "type" : "nGram",
          "min_gram" : "3",
          "max_gram" : "20",
          "token_chars": [ "letter", "digit" ]
        }
      }
    }
  }
}'

{
 "ok" : true,
 "acknowledged" : true
}

Let’s add now the same data (profile of user mariusz) and see how it got tokenized:

curl -XGET 'http://search.my-server.com/blog/users/_search?pretty=true' -d '
{
  "query" : {
    "match" : {
      "username" : "mar"
    }
  },
  "script_fields" : {
    "terms" : {
      "script" : "doc[field].values",
      "params" : {
        "field" : "username"
      }
    }
  }
}'

{
  "took" : 309,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.26711923,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "users",
      "_id" : "22",
      "_score" : 0.26711923,
      "fields" : {
        "terms" : [ "ari", "ariu", "arius", "ariusz", "ius", "iusz",
                    "mar", "mari", "mariu", "marius", "mariusz", "riu",
                    "rius", "riusz", "sz", "usz" ]
      }
    } ]
  }
}

Wow, hell of a ride it was 🙂 Now you can see way more tokens created. I leave it up to you to check whether querying for partial string “mar” works now.

Take care!

Resources:

ES QueryDSL (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html)
Standard tokenizer (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-standard-tokenizer.html)
nGram tokenizer (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html)
Elasticsearch analysis docs (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis.html)

Token-based Authentication Plugin for ActiveMQ

Java, JMS January 4, 2014 Leave a comment

This post is a part of ActiveMQ Custom Security Plugins series.

Similarly to how we did in case of the IP-based Authentication Plugin for ActiveMQ, in order to limit the connectivity to the ActiveMQ server based on Token (assuming the connecting client, eg. a browser through a JavaScript over STOMP protocol) is providing such token when trying to establish a connection with the broker), we’ll need to override the addConnection() method of the BrokerFilter.class.

For the purpose of this example, i’ll be using Redis as the data store against which i’ll be checking the Tokens of connecting clients; to make a decision whether a client is allowed to establish a connection with the broker (Token exists in Redis) or not (otherwise). To hit Redis from Java i’ll be using the Jedis driver.

Step1: Implementation of the plugin logic:

import org.apache.activemq.broker.Broker;
import org.apache.activemq.broker.BrokerFilter;
import org.apache.activemq.broker.ConnectionContext;
import org.apache.activemq.command.ConnectionInfo;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import redis.clients.jedis.Jedis;
import java.util.Map;

public class TokenAuthenticationBroker extends BrokerFilter {

  private final Logger logger = LoggerFactory.getLogger(getClass());
  public final static String REDIS_KEY = "authentication:activemq:tokens";

  Map<String, String> redisConfig;

  public TokenAuthenticationBroker(Broker next, Map<String, String> redisConfig) {
    super(next);
    this.redisConfig = redisConfig;
  }

  @Override
  public void addConnection(ConnectionContext context, ConnectionInfo info) throws Exception {
    String host = redisConfig.get("host");
    int port = Integer.parseInt(redisConfig.get("port"));

    logger.debug("Establishing Redis connection using [host={}, port={}] ", host, port);
    Jedis jedis = new Jedis(host, port);

    String token = context.getUserName();

    logger.debug("Querying Redis using [key={}, token={}] ", REDIS_KEY, token);
    String response = jedis.hget(REDIS_KEY, token);

    if(response == null) {
      throw new SecurityException("Token not not found in the data store");
    } else {
      logger.debug("Found token [{}] belonging to user: {}. Allowing connection", token, response);
    super.addConnection(context, info);
    }
  }
}

As you can see in the example above, the token provided by the connecting client can be read in ActiveMQ directly from the context (by using the getUserName() method; assuming the client is sending the token as a query parameter named “username”). Having the token, next thing we need to do is to query the Redis store (under the REDIS_KEY) and check whether the token exists (hget() method invoked on jedis object/driver). Depending on the value of response, we’re making the decision whether to addConnection() or throw an SecurityException.

Also, after the actual plug-in logic has been implemented, the plug-in must be configured and installed. For this purpose, we need an implementation of the BrokerPlugin.class, which is used to expose the configuration of a plug-in and to install the plug-in into the ActiveMQ broker.

Step2: Implementation of the plugin “installer”:

import org.apache.activemq.broker.Broker;
import org.apache.activemq.broker.BrokerPlugin;
import java.util.Map;

public class TokenAuthenticationPlugin implements BrokerPlugin {

  Map<String, String> redisConfig;

  @Override
  public Broker installPlugin(Broker broker) throws Exception {
    return new TokenAuthenticationBroker(broker, redisConfig);
  }

  public Map<String, String> getRedisConfig() {
    return redisConfig;
  }

  public void setRedisConfig(Map<String, String> redisConfig) {
    this.redisConfig = redisConfig;
  }
}

The installPlugin() method above is used to instantiate the plug-in and return a new intercepted broker for the next plug-in in the chain. The TokenAuthenticationPlugin.class also contains getter and setter methods used to configure the TokenAuthenticationBroker. These setter and getter methods are available via a Spring beans–style XML configuration in the ActiveMQ XML configuration file (example below).

Step3: Configuring the plugin in activemq.xml:

// "/apache-activemq/conf/activemq.xml"
<broker brokerName="localhost" dataDirectory="${activemq.base}/data" xmlns="http://activemq.apache.org/schema/core">
  <plugins>
    <bean id="tokenAuthenticationPlugin" class="com.mycompany.mysystem.activemq.TokenAuthenticationPlugin" xmlns="http://www.springframework.org/schema/beans">
      <property name="redisConfig">
        <map>
          <entry key="host" value="localhost" />
          <entry key="port" value="6379" />
        </map>
      </property>
    </bean>
  </plugins>
</broker>

That’s all there is to it 🙂

Happy Coding!

Resources:

ActiveMQ Custom Security Plugins (https://mariuszprzydatek.com/2014/01/03/activemq-custom-security-plugins/)
IP-based Authentication Plugin for ActiveMQ (https://mariuszprzydatek.com/2014/01/04/ip-based-authentication-plugin-for-activemq/)

IP-based Authentication Plugin for ActiveMQ

Java, JMS January 4, 2014 1 Comment

To limit the connectivity to the ActiveMQ server based on IP address, we’ll need to override the addConnection() method of the BrokerFilter.class, mentioned in my initial post on ActiveMQ Custom Security Plugins.

Example implementation (from the book “ActiveMQ in Action”):

import org.apache.activemq.broker.Broker;
import org.apache.activemq.broker.BrokerFilter;
import org.apache.activemq.broker.ConnectionContext;
import org.apache.activemq.command.ConnectionInfo;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class IPAuthenticationBroker extends BrokerFilter {

  List<String> allowedIPAddresses;
  Pattern pattern = Pattern.compile("^/([0-9\\.]*):(.*)");

  public IPAuthenticationBroker(Broker next, List<String> allowedIPAddresses) {
    super(next);
    this.allowedIPAddresses = allowedIPAddresses;
  }

  public void addConnection(ConnectionContext context, ConnectionInfo info) throws Exception {
    String remoteAddress = context.getConnection().getRemoteAddress();
    Matcher matcher = pattern.matcher(remoteAddress);
    if (matcher.matches()) {
      String ip = matcher.group(1);
        if (!allowedIPAddresses.contains(ip)) {
          throw new SecurityException("Connecting from IP address " + ip + " is not allowed" );
        }
    } else {
      throw new SecurityException("Invalid remote address " + remoteAddress);
    }
    super.addConnection(context, info);
  }
}

As you can see, the implementation above performs a simple check of the IP address using a regular expression to determine the ability to connect. If that IP address is allowed to connect, the call is delegated to the BrokerFilter.addConnection() method. If that IP address isn’t allowed to connect, an exception is thrown.

After the actual plug-in logic has been implemented, the plug-in must be configured and installed. For this purpose, we need an implementation of the BrokerPlugin.class, which is used to expose the configuration of a plug-in and to install the plug-in into the ActiveMQ broker.

import org.apache.activemq.broker.Broker;
import org.apache.activemq.broker.BrokerPlugin;
import java.util.List;

public class IPAuthenticationPlugin implements BrokerPlugin {

  List<String> allowedIPAddresses;

  public Broker installPlugin(Broker broker) throws Exception {
    return new IPAuthenticationBroker(broker, allowedIPAddresses);
  }

  public List<String> getAllowedIPAddresses() {
    return allowedIPAddresses;
  }

  public void setAllowedIPAddresses(List<String> allowedIPAddresses) {
    this.allowedIPAddresses = allowedIPAddresses;
  }
}

The installPlugin() method above is used to instantiate the plug-in and return a new intercepted broker for the next plug-in in the chain. The IPAuthenticationPlugin.class also contains getter and setter methods used to configure the IPAuthenticationBroker. These setter and getter methods are available via a Spring beans–style XML configuration in the ActiveMQ XML configuration file (example below).

// "\apache-activemq\conf\activemq.xml"
<broker brokerName="localhost" dataDirectory="${activemq.base}/data" xmlns="http://activemq.apache.org/schema/core">
  <plugins>
    <bean id="ipAuthenticationPlugin" class="com.mycompany.mysystem.activemq.IPAuthenticationPlugin" xmlns="http://www.springframework.org/schema/beans">
      <property name="allowedIPAddresses">
        <list>
          <value>127.0.0.1</value>
        </list>
      </property>
    </bean>
  </plugins>
</broker>

To summarize, creating custom security plugins using ActiveMQ plugin API, consists of following three steps:

Implementing the plugin logic (overriding methods of the BrokerFilter.class – first code snippet above)
Coding the plugin “installer” (implementing the BrokerPlugin.class – second code snippet)
Configuring the plugin in activemq.xml file (Spring beans-style XML – third code snippet)

Happy coding!

Resources:

ActiveMQ Custom Security Plugins (https://mariuszprzydatek.com/2014/01/03/activemq-custom-security-plugins/)

mariuszprzydatek.com

Elasticsearch custom tokenizers – nGram

Token-based Authentication Plugin for ActiveMQ

IP-based Authentication Plugin for ActiveMQ

Welcome to my Blog on Software Engineering

Recent Posts

Subscribe to RSS

Follow Blog via Email

Archives

Categories

Recent reads

Mariusz Przydatek