Others | mariuszprzydatek.com

Category Archives: Others

Web Storage – client-side data storage

JavaScript, Others, Specifications April 8, 2014

While investigating the best solution for client-side data storage i came across W3C Web Storage specification, which may be of interest to you as well.

The specification “…defines an API for persistent data storage of key-value pair data in Web clients“. It mentions two different types of storage:

Session storage – purpose of which is to remember all data in the current session, but forget it as soon as the browser tab or window gets closed
Local storage – which stores the data across multiple browser sessions (persistent storage) and as a result makes it possible to close the page (or window) and still preserve the data within the browser

Both mechanisms use the same Storage interface:

interface Storage {
  readonly attribute unsigned long length;
  DOMString? key(unsigned long index);
  getter DOMString getItem(DOMString key);
  setter creator void setItem(DOMString key, DOMString value);
  deleter void removeItem(DOMString key);
  void clear();
};

The storage facility is similar to traditional HTTP cookie storage but offers some benefits commonly understood as:

Storage capacity: Browsers have enabled a minimum of 5Mb of storage inside a web storage object (IE has allowed 10Mb but it varies by storage type and browser).
Data transmission: Objects are not sent automatically with each request but must be requested.
Client side access: Servers cannot directly write to web storage which provides some additional controls from client-side scripting.
Data storage: Array level name/value pairs provides a more flexible data model

Basic operations on both Web Storage mechanisms, look like this:

// session storage
  sessionStorage.setItem('key', 'value');         // set
  var item = sessionStorage.getItem('key');       // retrieve
  var item = sessionStorage.removeItem('key');    // remove
  sessionStorage.clear();                         // clear all
  var no_of_items = sessionStorage.length;        // no. of current items

// local storage
  localStorage.setItem('key', 'value');           // set
  var item = localStorage.getItem('key');         // retrieve
  var item = localStorage.removeItem('key');      // remove
  localStorage.clear();                           // clear all
  var no_of_items = localStorage.length;          // no. of current items

The specification also provides a StorageEvent interface to be fired whenever the storage area changes. It exposes following attributes:

storageArea -that tells the type of storage used (Session or Local)
key – key which is being changed.
oldValue – the old value of the key.
newValue – the new value of the key.
url – the URL of the page whose key is changed.

Privacy Implications:

As has been discussed in the W3C spec and other forums, there are some considerations for privacy in place both within the spec design and implemented in the variable user agent controls present today in common web browsers. Within the spec, there are options for user agents to:
Restrict access to local storage to “third party domains” or those domains that do not match the top-level domain (e.g., that sit within i-frames). Sub-domains are considered separate domains unlike cookies.
Session and time-based expirations can be set to make data finite vs. permanent.
Whitelist and blacklisting features can be used for access controls.

Key facts:

Storage per origin: All storage from the same origin will share the same storage space. An origin is a tuple of scheme/host/port (or a globally unique identifier). For example, http://www.example.org and http://abc.example.org are two separate origins, as are http://example.org and https://example.org as well as http://example.org:80 and http://example.org:8000
Storage limit: As of now, most browsers that have implemented Web Storage, have placed the storage limit at 5 Mb per domain. You should be able to change this storage limit on a per-domain basis in the browser settings:
- Chrome: Advanced>Privacy> Content>Cookies
- Safari: Privacy>Cookies and Other Website Data; “Details”
- Firefox: Tools> Clear Recent History > Cookies
- IE: Internet Options> General> Browsing History>Delete> Cookies and Website Data
Security considerations: Storage is assigned on a per-origin basis. Someone might use DNS Spoofing to make themselves look like a particular domain when in fact they aren’t, thereby gaining access to the storage area of that domain on a user’s computer. SSL can be used in order to prevent this from happening, so users can be absolutely sure that the site they are viewing is from the same domain name.
Where not to use it: If two different users are using different pathnames on a single domain, they can access the storage area of the whole origin and therefore each other’s data. Hence, it is advisable for people on free hosts who have their sites on different directories of the same domain (for example, freehostingspace.org/user1/ and freehostingspace.org/user2/), to not use Web Storage on their pages for the time being.
Web Storage is not part of the HTML5 spec: It is a whole specification in itself.

Cookies:

Cookies and Web Storage really serve different purposes. Cookies are primarily for reading server-side, whereas Web Storage can only be read client-side. So the question is, in your app, who needs the data — the client or the server?

If it’s your client (your JavaScript), then by all means use Web Storage. You’re wasting bandwidth by sending all the data in the HTTP header each time.
If it’s your server, Web Storage isn’t so useful because you’d have to forward the data along somehow (with Ajax or hidden form fields or something). This might be okay if the server only needs a small subset of the total data for each request.

Web Storage vs. Cookies:

Web Storage:
- Pros
  - Support by most modern browsers
  - Stored directly in the browser
  - Same-origin rules apply to local storage data
  - Is not sent with every HTTP request
  - ~5MB storage per domain (that’s 5120KB)
- Cons
  - Not supported by anything before:
    - IE 8
    - Firefox 3.5
    - Safari 4
    - Chrome 4
    - Opera 10.5
    - iOS 2.0
    - Android 2.0
  - If the server needs stored client information you purposefully have to send it.

Cookies:
- Pros
  - Legacy support (it’s been around forever)
  - Persistent data
  - Expiration dates
- Cons
  - Each domain stores all its cookies in a single string, which can make parsing data difficult
  - Data is not encrypted
  - Cookies are sent with every HTTP request Limited size (4KB)
  - SQL injection can be performed from a cookie

If you’re interested in Cookies, you can read more here.

Finally, if you’re looking for a client-side data storage solution for AngularJS, you may want to take a look at angular-cache.

Take care!

Resources:

W3C Web Storage specification (http://dev.w3.org/html5/webstorage/)
Web Storage Support Test (http://dev-test.nemikor.com/web-storage/support-test/)
TRUSTe technology blog (http://www.truste.com/developer/?p=297)
Shwetank Dixit blog (http://dev.opera.com/articles/view/web-storage/)
Local Storage vs Cookies (http://stackoverflow.com/questions/3220660/local-storage-vs-cookies)
Cookies (http://www.quirksmode.org/js/cookies.html)
Angular-Cache (http://jmdobry.github.io/angular-cache/)

Elasticsearch custom tokenizers – nGram

Elasticsearch, Others February 18, 2014 Leave a comment

If you’ve been trying to query the Elasticsearch index for partial string matches (similarly to SQL’s “LIKE” operator), like i did initially, you’d get surprised to learn that default ES setup does not offer such functionality.

Here’s an example using “match” type query (read more about QueryDSL here):

curl -XGET 'http://search.my-server.com/blog/users/_search?pretty=true' -d '
{
  "query" : {
    "match" : {
      "username" : "mar"
    }
  }
}'

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
     "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

whereas, when i search after full username, the result is following:

curl -XGET 'http://search.my-server.com/blog/users/_search?pretty=true' -d '
{
  "query" : {
    "match" : {
      "username" : "mariusz"
    }
  }
}'

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 5.5108595,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "users",
      "_id" : "835",
      "_score" : 5.5108595,
      "_source" : {
        "id" : 835,
        "version" : 5,
        "creationTime" : "2013/11/29 03:13:27 PM,UTC",
        "modificationTime" : "2014/01/03 01:50:17 PM,UTC",
        "username" : "mariusz",
        "firstName" : "Mariusz",
        "lastName" : "Przydatek",
        "homeAddress" : [],
        "email" : "me@mariuszprzydatek.com",
        "interests" : ["Start-ups", "VC", "Java", "Elasticsearch", "AngularJS"],
        "websiteUrl" : "http://mariuszprzydatek.com",
        "twitter" : "https://twitter.com/mprzydatek",
        "avatar" : "http://www.gravatar.com/avatar/8d8a9d08eddb126c3301070af22f9933.png",
      }
    } ]
  }
}

Wondering why’s that? I’ll save you the trouble of studying Elasticsearch specs and provide the explanation here and now.

It’s all about how your data (“username” field to be precise) is being indexed by Elasticsearch; to be specific: which built-in tokenizer (one of many) is being used to create search tokens.

By default ES is using the “standard” tokenizer (more details about it here). What we need instead is the nGram tokenizer (details).

Here’s how you can check how your data has actually been “tokenized” by ES:

curl -XGET 'http://search.my-server.com/blog/users/_search?pretty=true' -d '
{
  "query" : {
    "match" : {
      "username" : "mariusz"
    }
  },
  "script_fields" : {
    "terms" : {
      "script" : "doc[field].values",
      "params" : {
        "field" : "username"
      }
    }
  }
}'

{
  "took" : 1191,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 5.5053496,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "users",
      "_id" : "835",
      "_score" : 5.5053496,
      "fields" : {
        "terms" : [ "mariusz" ]
      }
    } ]
  }
}

So as you can see near the end of the JSON above, there’s only one token created for field username: “mariusz”. No wonder why querying for partial string “mar” wasn’t working.

What you need to do in order to allow partial string search, is following:

remove the whole current index (i know – sorry – there’s no other way, the data has to be re-tokenized again and where/when it happens, is at the indexing time)
create a new custom tokenizer
create a new custom analyzer
create new index that has the new tokenizer/analyzer set as defaults

Let’s start with removing the old index:

curl -XDELETE 'http://search.my-server.com/blog'

{
  "ok" : true,
  "acknowledged" : true
}

Now, we can combine steps 2,3 and 4 within a single command:

curl -XPUT 'http://search.my-server.com/blog/' -d '
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "default": {
          "type" : "custom",
          "tokenizer" : "my_ngram_tokenizer",
          "filter" : "lowercase"
        }
      },
      "tokenizer" : {
        "my_ngram_tokenizer" : {
          "type" : "nGram",
          "min_gram" : "3",
          "max_gram" : "20",
          "token_chars": [ "letter", "digit" ]
        }
      }
    }
  }
}'

{
 "ok" : true,
 "acknowledged" : true
}

Let’s add now the same data (profile of user mariusz) and see how it got tokenized:

curl -XGET 'http://search.my-server.com/blog/users/_search?pretty=true' -d '
{
  "query" : {
    "match" : {
      "username" : "mar"
    }
  },
  "script_fields" : {
    "terms" : {
      "script" : "doc[field].values",
      "params" : {
        "field" : "username"
      }
    }
  }
}'

{
  "took" : 309,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.26711923,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "users",
      "_id" : "22",
      "_score" : 0.26711923,
      "fields" : {
        "terms" : [ "ari", "ariu", "arius", "ariusz", "ius", "iusz",
                    "mar", "mari", "mariu", "marius", "mariusz", "riu",
                    "rius", "riusz", "sz", "usz" ]
      }
    } ]
  }
}

Wow, hell of a ride it was 🙂 Now you can see way more tokens created. I leave it up to you to check whether querying for partial string “mar” works now.

Take care!

Resources:

ES QueryDSL (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html)
Standard tokenizer (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-standard-tokenizer.html)
nGram tokenizer (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html)
Elasticsearch analysis docs (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis.html)

Amazon AWS – CloudWatch – monitoring

Amazon AWS, Others September 21, 2013

Basic information about Amazon CloudWatch Service:

AWS Free Tier availability:

10 Metrics,
10 Alarms,
1,000,000 API requests

Developer Resources:

Functionality:

Monitoring AWS resources automatically (without installing additional software):
- Basic Monitoring for Amazon EC2 instances: ten pre-selected metrics at five-minute frequency, free of charge.
- Detailed Monitoring for Amazon EC2 instances: seven pre-selected metrics at one-minute frequency, for an additional charge.
- Amazon EBS volumes: eight pre-selected metrics at five-minute frequency, free of charge.
- Elastic Load Balancers: ten pre-selected metrics at one-minute frequency, free of charge.
- Amazon RDS DB instances: thirteen pre-selected metrics at one-minute frequency, free of charge.
- Amazon SQS queues: eight pre-selected metrics at five-minute frequency, free of charge.
- Amazon SNS topics: four pre-selected metrics at five-minute frequency, free of charge.
- Amazon ElastiCache nodes: twenty-nine pre-selected metrics at one-minute frequency, free of charge.
- Amazon DynamoDB tables: seven pre-selected metrics at five-minute frequency, free of charge.
- AWS Storage Gateways: eleven pre-selected gateway metrics and five pre-selected storage volume metrics at five-minute frequency, free of charge.
- Amazon Elastic MapReduce job flows: twenty-three pre-selected metrics at five-minute frequency, free of charge.
- Auto Scaling groups: seven pre-selected metrics at one-minute frequency, optional and charged at standard pricing.
- Estimated charges on your AWS bill: you can also choose to enable metrics to monitor your AWS charges.
Submitting Custom Metrics generated by your own applications (or by AWS resources not mentioned above) and having them monitored by Amazon CloudWatch. You can submit these metrics to Amazon CloudWatch via a simple Put API request.
Setting alarms on any of your metrics to receive notifications or take other automated actions when your metric crosses your specified threshold. You can also use alarms to detect and shut down EC2 instances that are unused or underutilized.
Viewing graphs and statistics for any of your metrics, and getting a quick overview of all your alarms and monitored AWS resources in one location on the Amazon CloudWatch dashboard.
Useing Auto Scaling to add or remove Amazon EC2 instances dynamically based on your Amazon CloudWatch metrics.

Resources:

Pricing (http://aws.amazon.com/cloudwatch/#Pricing)
Product page (http://aws.amazon.com/cloudwatch)

mariuszprzydatek.com

Category Archives: Others

Web Storage – client-side data storage

Elasticsearch custom tokenizers – nGram

Amazon AWS – CloudWatch – monitoring

Welcome to my Blog on Software Engineering

Recent Posts

Subscribe to RSS

Follow Blog via Email

Archives

Categories

Recent reads

Mariusz Przydatek