Dec 20, 2016

My first look at elastic search

Elasticsearch is a document oriented database. The entire object is stored in a denormalized state which enhances retrieval performance, as you do not need to join tables, like in case of relational database. On the other hand, update becomes little complicated, since you have to update all the documents rather than updating a referenced table. Relational db also let you specify constraints (like foreign key etc), which is not with the case of document oriented db.

Elasticsearch is schema flexible, which means you do not have to specify schema upfront, rather it recommended to have schema (mapping) defined based on you search need.

Basic concept

  • Cluster
  • Node
  • Index - Its a collection of document which are somewhat similar. Index is stored in set of shards and during search, elasticsearch merges result from all searched shards. So if you have documents which has minor difference in structure, its better to store it in single index with multiple types, which will result in less number of merge. But in case if you have completely different structure, then its better to have separate index, as the way elasticsearch stores data, fields that exist in one type will also consume resources for documents of types where this field does not exist. (localhost:9200/my_index)
  • Type - Type is defined for a document which has set of properties. You can have multiple types for a single index. ( localhost:9200/my_index/my_type)
  • Document
  • Shards & Replicas - Elasticsearch provides the ability to subdivide your index into multiple pieces called shards, which horizontally split/scale the data and also help in performance by allowing you to distribute and parallelize operations across shards.

Mapping
Enabled - Elasticsearch tries to index all of the fields you give it, but sometimes you want to just store the field without indexing it as you may not want to query on the basis of that field.
Dynamic - By defaults fields are dynamically added to a document. The dynamic setting (true, false, strict) controls whether new fields can be added dynamically or not. Setting dynamic does not alter content of source.

Analysis
Analysis is the process of converting text into tokens or terms which are added to the inverted index for searching. Analyzers are used for both indexing and searching (only in case of match query and not term). By default it uses the same analyzer for both indexing and searching, but you can have different analyzer for searching.

An analyzer does following

  • A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.
  • A tokenizer receives a stream of characters, breaks it up into individual tokens, and outputs a stream of tokens. ex whitespace, standard (based on whitespace and punctuation), ngram (min/max gram)
  • A token filter receives the token stream and may add, remove, or change tokens. ex lowercase, stopword (stop the word from being indexed), snowball (removes trailing characters such as "ed," "ing," etc.)

No comments:

Post a Comment