Search Engine – تجربه های یک برنامه نویس

Search Engine

Sphinx search engine overview

December 17, 2017 by arfprogrammer No Comments

Installation

In this section, we will install Sphinx.

To install Sphinx, run:

sudo apt-get install sphinxsearch

Now you have successfully installed Sphinx on your server. Before starting the Sphinx daemon, let’s configure it.

Creating the Database For File Index and Search

In this section, we will set up a database and create tables that contains path of text files that we want to sphinx index them and search query from them.

mysql -u root -p

Enter the password for the MyQL root user when asked. Your prompt will change to mysql>.

Create a database named sphinx_index and use it;

CREATE DATABASE sphinx_index;
USE sphinx_index;

Create files path list table

CREATE TABLE fileindex ( id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,text VARCHAR(100) NOT NULL);

Add files path to fileindex table

INSERT INTO fileindex ( text ) VALUES ( '\path\to\files' )

Then exit the MySQL shell.

Configuring Sphinx

In this section, we will configure the Sphinx configuration file.

Create the sphinx.conf file.

sudo gedit /etc/sphinxsearch/sphinx.conf

Sphinx configuration consists of 3 main blocks that are essential to run. They are index, searchd, and source. Each of these blocks is described below, and at the end of this step, the entirety of sphinx.conf is included for you to paste into the file.

The source block contains the type of source, username and password to the MySQL server. The first column of the SQL query should be a unique id. The SQL query will run on every index and dump the data to Sphinx index file. Below are descriptions of each field and the source block itself.

sql_host: Hostname for the MySQL host. In our example, this is the localhost. This can be a domain or IP address.
sql_user: Username for the MySQL login. In our example, this is root.
sql_pass: Password for the MySQL user. In our example, this is the root MySQL user’s password
sql_db: Name of the database that stores data. In our example, this is test.
sql_query: This is the query thats dumps data to
sql_query_pre: Pre-fetch query, or pre-query. They are used to setup encoding, mark records that are going to be indexed, update internal counters, set various per-connection SQL server options and variables. Perhaps the most frequent pre-query usage is to specify the encoding that the server will use for the rows it returns. Note that Sphinx accepts only UTF-8 texts.
sql_field_string:Combined string attribute and full-text field declaration.
sql_file_field: Reads document contents from file system instead of database.
Offloads database
Prevents cache trashing on database side
Much faster in some cases

source src1
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = 3337033
sql_db = sphinx_index
sql_port = 3306 # optional, default is 3306
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query_pre = SET NAMES utf8
sql_query = SELECT id,text from fileindex
sql_file_field = text
sql_field_string = text

}

The index component contains the source and the path to store the data.

source: Name of the source block. In our example, this is src1.
path: This path to save the index.
docinfo: Document attribute values (docinfo) storage mode. Optional, default is ‘extern’. Known values are ‘none’, ‘extern’ and ‘inline’.

index filename
{
source = src1
path = /var/lib/sphinxsearch/data/files
docinfo = extern
}

The searchd component contains the port and other variables to run the Sphinx daemon.

listen: This is the port which sphinx daemon will run. In our example, this is 9312.
query_log: This path to save the query log.
pid_file: This is path to PID file of Sphinx daemon.
max_matches: Maximum number matches to return per search term.
seamless_rotate: Prevents searchd stalls while rotating indexes with huge amounts of data to precache.
preopen_indexes: Whether to forcibly preopen all indexes on startup.
unlink_old: Whether to unlink old index copies on successful rotation.
log: Log file name. Optional, default is ‘searchd.log’.
read_timeout: Network client request read timeout, in seconds.
max_children: Maximum amount of children to fork (or in other words, concurrent searches to run in parallel). Optional, default is 0 (unlimited).
seamless_rotate: Prevents searchd stalls while rotating indexes with huge amounts of data to precache. Optional, default is 1 (enable seamless rotation).
binlog_path: Binary log (aka transaction log) files path. Optional, default is build-time configured data directory.

searchd
{
listen = 9312
log = /var/log/sphinxsearch/searchd.log
query_log = /var/log/sphinxsearch/query.log
read_timeout = 5
max_children = 30
pid_file = /var/run/sphinxsearch/searchd.pid
max_matches = 1000
seamless_rotate = 1
preopen_indexes = 1
unlink_old = 1
binlog_path = /var/lib/sphinxsearch/data

}

Adding Data to the Index

In this section, we’ll add data to the Sphinx index.

Add data to index using the config we created earlier.

sudo indexer --all --rotate

You should get something that looks like the following.

Sphinx 2.2.10-id64-release (2c212e0)
Copyright (c) 2001-2015, Andrew Aksyonoff
Copyright (c) 2008-2015, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/etc/sphinxsearch/sphinx.conf'...
WARNING: key 'max_matches' was permanently removed from Sphinx configuration. Refer to documentation for details.
indexing index 'filename'...
collected 1 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 1 docs, 41896 bytes
total 0.073 sec, 566705 bytes/sec, 13.52 docs/sec
total 8 reads, 0.014 sec, 9.0 kb/call avg, 1.8 msec/call avg
total 12 writes, 0.000 sec, 4.6 kb/call avg, 0.0 msec/call avg
rotating indices: successfully sent SIGHUP to searchd (pid=1087).

Starting Sphinx

First open /etc/default/sphinxsearch to check Sphinx daemon is tuned off or on.

sudo nano /etc/default/sphinxsearch

To enable Sphinx, find the line START and set it to yes.

START=yes

Then, save and close the file.

Finally, start the Sphinx daemon.

sudo service sphinxsearch start

Testing Search

For search from indexed contents we should use official native SphinxAPI implementations for PHP, Perl, Python, Ruby and Java or third party API ports and plugins for Perl, C#, Haskell, Ruby-on-Rails.
Official native SphinxAPIs are included within the distribution package.
Download sphinx search source from GitHub

cd API

Create Search.py

from sphinxapi import *
client = SphinxClient()
client.SetServer('127.0.0.1', 9312)
client.Query('text to search')

execute this code with python2.7 and result should like this if search is successful

{'status': 0, 'matches': [{'id': 1, 'weight': 2500, 'attrs': {'text': '/home/arf/Downloads/ElasticSearch.md'}}], 'fields': ['text'], 'time': '0.000', 'total_found': 1, 'warning': '', 'attrs': [['text', 7]], 'words': [{'docs': 1, 'hits': 1, 'word': 'text'}, {'docs': 1, 'hits': 159, 'word': 'to'}, {'docs': 1, 'hits': 32, 'word': 'search'}], 'error': '', 'total': 1}

and if search is not successful result should like bellow

{'status': 0, 'matches': [], 'fields': ['text'], 'time': '0.000', 'total_found': 0, 'warning': '', 'attrs': [['text', 7]], 'words': [{'docs': 0, 'hits': 0, 'word': 'eqfc'}], 'error': '', 'total': 0}

Other Descriptions

It looks like that sphinx works fine with database systems for indexing and search fields.
Unfortunately, Sphinx can’t index .doc and .pdf file types directly. You’ll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.

References

Sphinx Documentation
How To Install and Configure Sphinx on Ubuntu 14.04
How to index plain text files for search in Sphinx
Sphinx Search Engine & Python API
Full-Text Search with Sphinx and PHP
Indexing Word Documents and PDFs with Sphinx

Reading time: 5 min

Search Engine

Elastics Search Engine Overview

December 17, 2017 by arfprogrammer No Comments

Installation

Elasticsearch requires Java 7

java -version
echo $JAVA_HOME

Download package

curl -L -O https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.4.5.tar.gz

Then extract it as follows

tar -xvf elasticsearch-1.4.5.tar.gz

It will then create a bunch of files and folders in your current directory. We then go into the bin directory as follows:

cd elasticsearch-1.4.5/bin

And now we are ready to start our node and single cluster

./elasticsearch

If everything goes well, you should see a bunch of messages that look like below:

./elasticsearch
[2014-03-13 13:42:17,218][INFO ][node ] [New Goblin] version[1.4.5], pid[2085], build[5c03844/2014-02-25T15:52:53Z]
[2014-03-13 13:42:17,219][INFO ][node ] [New Goblin] initializing ...
[2014-03-13 13:42:17,223][INFO ][plugins ] [New Goblin] loaded [], sites []
[2014-03-13 13:42:19,831][INFO ][node ] [New Goblin] initialized
[2014-03-13 13:42:19,832][INFO ][node ] [New Goblin] starting ...
[2014-03-13 13:42:19,958][INFO ][transport ] [New Goblin] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.8.112:9300]}
[2014-03-13 13:42:23,030][INFO ][cluster.service] [New Goblin] new_master [New Goblin][rWMtGj3dQouz2r6ZFL9v4g][mwubuntu1][inet[/192.168.8.112:9300]], reason: zen-disco-join (elected_as_master)
[2014-03-13 13:42:23,100][INFO ][discovery ] [New Goblin] elasticsearch/rWMtGj3dQouz2r6ZFL9v4g
[2014-03-13 13:42:23,125][INFO ][http ] [New Goblin] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.8.112:9200]}
[2014-03-13 13:42:23,629][INFO ][gateway ] [New Goblin] recovered [1] indices into cluster_state
[2014-03-13 13:42:23,630][INFO ][node ] [New Goblin] started

we can override either the cluster or node name. This can be done from the command line when starting Elasticsearch as follows:

./elasticsearch --cluster.name my_cluster_name --node.name my_node_name

By default, Elasticsearch uses port 9200 to provide access to its REST API. This port is configurable if necessary.

Introduction

ElasticSearch can be used both as a search engine as well as a data store.

Node and cluster

Every instance of ElasticSearch is called a node. Several nodes are grouped in a cluster.
This is the base of the cloud nature of ElasticSearch.
To join two or more nodes in a cluster, the following rules must be observed:

The version of ElasticSearch must be the same (v0.20, v0.9, v1.4, and so on) or the join is rejected.
The cluster name must be the same.
The network must be configured to support broadcast discovery (it is configured to it by default) and they can communicate with each other.

Plugins

ElasticSearch natively provides a large set of functionalities that can be extended with additional plugins.
During a node startup, a lot of required services are automatically started. The most important are:

Cluster services: These manage the cluster state, intra-node communication, and synchronization.
Indexing Service: This manages all indexing operations, initializing all active indices and shards.
Mapping Service: This manages the document types stored in the cluster (we’ll discuss mapping in Chapter 3, Managing Mapping).
Network Services: These are services such as HTTP REST services (default on port 9200), internal ES protocol (port 9300) and the Thrift server (port 9500), applicable only if the Thrift plugin is installed.
Plugin Service: This enables us to enhance the basic ElasticSearch functionality in a customizable manner. (It’s discussed in Chapter 2, Downloading and Setting Up, for installation and Chapter 12, Plugin Development, for detailed usage.)
River Service: It is a pluggable service running within ElasticSearch cluster, pulling data (or being pushed with data) that is then indexed into the cluster. (We’ll see it in Chapter 8, Rivers.)
Language Scripting Services: They allow you to add new language scripting support to ElasticSearch.

Mapping

Our main data container is called index (plural indices) and it can be considered as a database in the traditional SQL world. In an index, the data is grouped into data types called mappings in ElasticSearch. A mapping describes how the records are composed (fields).
Every record that must be stored in ElasticSearch must be a JSON object.
Natively, ElasticSearch is a schema-less data store; when you enter records in it during the insert process it processes the records, splits it into fields, and updates the schema to manage the inserted data.
To manage huge volumes of records, ElasticSearch uses the common approach to split an index into multiple shards so that they can be spread on several nodes. Shard management is transparent to the users; all common record operations are managed automatically in the ElasticSearch application layer. Every record is stored in only a shard; the sharding algorithm is based on a record ID, so many operations that require loading and changing of records/objects, can be achieved without hitting all the shards, but only the shard (and its replica) that contains your object. The following schema compares ElasticSearch structure with SQL and MongoDB ones:

ElasticSearch	SQL	MongoDB
Index (Indices)	Database	Database
Shard	Shard	Shard
Mapping/Type	Table	Collection
Field	Field	Field
Object (JSON Object)	Record (Tuples)	Record (BSON Object)

Index

An index can have one or more replicas; the shards are called primary if they are part of the primary replica, and secondary ones if they are part of replicas.
To maintain consistency in write operations, the following workflow is executed:

The write operation is first executed in the primary shard
If the primary write is successfully done, it is propagated simultaneously in all the secondary shards
If a primary shard becomes unavailable, a secondary one is elected as primary (if available) and then the flow is re-executed
During search operations, if there are some replicas, a valid set of shards is chosen randomly between primary and secondary to improve its performance. ElasticSearch has several allocation algorithms to better distribute shards on nodes. For reliability, replicas are allocated in a way that if a single node becomes unavailable, there is always at least one replica of each shard that is still available on the remaining nodes.

To create an explicit mapping, perform the following steps:
1. You can explicitly create a mapping by adding a new document in ElasticSearch:
On a Linux shell:

#create an index
curl -XPUT http://127.0.0.1:9200/test
#{acknowledged:true}
#put a document
curl -XPUT http://127.0.0.1:9200/test/mytype/1 -d
'{"name":"Paul", "age":35}'
#{"ok":true,"_index":"test","_type":"mytype","_id":"1","_
version":1}
#get the mapping and pretty print it
curl –XGET http://127.0.0.1:9200/test/mytype/_
mapping?pretty=true

This is how the resulting mapping, autocreated by ElasticSearch, should look:

{
"mytype" : {
"properties" : {
"age" : {
"type" : "long"
},
"name" : {
"type" : "string"
}
}
}
}

Array

Every field is automatically managed as an array. For example, in order to store tags for a
document, this is how the mapping must be:

{
"document" : {
"properties" : {
"name" : {"type" : "string", "index":"analyzed"},
"tag" : {"type" : "string", "store" : "yes" , "index":"not_analyzed"},...
...
}
}
}

This mapping is valid for indexing this document:

{“name”: “document1”, “tag”: “awesome”}

It can also be used for the following document:

{“name”: “document2”, “tag”: [“cool”, “awesome”, “amazing”]}

Mapping an Object

You can rewrite the mapping of the order type form of the Mapping base types recipe using
an array of items:

{
"order" : {
"properties" : {
"id" : {"type" : "string", "store" : "yes", "index":"not_analyzed"},
"date" : {"type" : "date", "store" : "no", "index":"not_analyzed"},
"customer_id" : {"type" : "string", "store" : "yes", "index":"not_analyzed"},
"sent" : {"type" : "boolean", "store" : "no", "index":"not_analyzed"},
"item" : {
"type" : "object",
"properties" : {
"name" : {"type" : "string", "store" : "no", "index":"analyzed"},
"quantity" : {"type" : "integer", "store" : "no", "index":"not_analyzed"},
"vat" : {"type" : "double", "store" : "no", "index":"not_analyzed"}
}
}
}
}
}

ElasticSearch speaks native JSON, so every complex JSON structure can be mapped into it.
When ElasticSearch is parsing an object type, it tries to extract fields and processes them as
its defined mapping; otherwise it learns the structure of the object using reflection.
The following are the most important attributes for an object:
* properties : This is a collection of fields or objects (we consider them as columns in the SQL world).
* enabled : This is enabled if the object needs to be processed. If it’s set to false , the data contained in the object is not indexed as it cannot be searched (the default value is true ).
* dynamic : This allows ElasticSearch to add new field names to the object using reflection on the values of inserted data (the default value is true ). If it’s set to false, when you try to index an object containing a new field type, it’ll be rejected silently. If it’s set to strict , when a new field type is present in the object, an error is raised and the index process is skipped. Controlling the dynamic parameter allows you to be
safe about changes in the document structure.
* include_in_all : This adds the object values (the default value is true ) to the special _all field (used to aggregate the text of all the document fields). The most-used attribute is properties , which allows you to map the fields of the object in ElasticSearch fields. Disabling the indexing part of the document reduces the index size; however, the data cannot be searched. In other words, you end up with a smaller file on disk, but there is a cost incurred in functionality.

Insert Index

An Index is similar to Database concept in SQL, a container for types, such as tables in SQL, and documents, such as records in SQL.

http:///

To create an index, we will perform the following steps:
1. Using the command line, we can execute a PUT call:

curl -XPUT http://127.0.0.1:9200/myindex -d `
{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 1
}
}
}`

The result returned by ElasticSearch, if everything goes well, should be:

{“acknowledged”:true}
If the index already exists, then a 400 error is returned:

{“error”:”IndexAlreadyExistsException[[myindex] Already exists]”,”status”:400}

Settings:

"settings" : {
"number_of_shards" : 2,
"number_of_replicas" : 1
}

During index creation, the replication can be set with two parameters in the settings/index object:

number_of_shard : This controls the number of shards that compose the index (every shard can store up to 2^32 documents).
number_of_replicas : This controls the number of replicas (how many times your data is replicated in the cluster for high availability). A good practice is to set this value to at least to 1.

Open and close Index

if you want to keep your data but save resources (memory/CPU), a good alternative to deleting an Index is to close it.
ElasticSearch allows you to open or close an index to put it in online/offline mode.

For opening/closing an index, we will perform the following steps:
1. From the command line, we can execute a POST call to close an index:

curl -XPOST http://127.0.0.1:9200/myindex/_close

If the call is successful, the result returned by ElasticSearch should be:

{"acknowledged":true}

To open an index from the command line, enter:

curl -XPOST http://127.0.0.1:9200/myindex/_open

If the call is successful, the result returned by ElasticSearch should be:

{"acknowledged":true}

PUT Mapping

This recipe shows how to put a type of mapping in an index. This kind of operation can be considered the ElasticSearch version of an SQL create table.

The HTTP method for puttting a mapping is PUT (POST also works).
The URL format for putting a mapping is:

http://&lt;server&gt;/&lt;index_name&gt;/&lt;type_name&gt;/_mapping

To put a mapping in an Index, we will perform the following steps:
1. If we consider the type order of the previous chapter, the call will be:

curl -XPUT http://localhost:9200/myindex/order/_mapping -d '
{
"order" : {
"properties" : {
"id" : {"type" : "string", "store" : "yes", "index":"not_analyzed"},
"date" : {"type" : "date", "store" : "no" , "index":"not_analyzed"},
"customer_id" : {"type" : "string", "store" : "yes" , "index":"not_analyzed"},
"sent" : {"type" : "boolean", "index":"not_analyzed"},
"name" : {"type" : "string", "index":"analyzed"},
"quantity" : {"type" : "integer", "index":"not_analyzed"},
"vat" : {"type" : "double", "index":"no"}
}
}
}'

If successful, the result returned by ElasticSearch should be:

{"acknowledged":true}

Getting a mapping

The HTTP method to get a mapping is GET.
The URL formats for getting a mapping are:

http://&lt;server&gt;/_mapping
http://<server>/<index_name>/_mapping
http://<server>/<index_name>/<type_name>/_mapping

To get a mapping from the type of an index, we will perform the following steps:
1. If we consider the type order of the previous chapter, the call will be:

curl -XGET 'http://localhost:9200/myindex/order/_
mapping?pretty=true'

The pretty argument in the URL will pretty print the response output.
2. The result returned by ElasticSearch should be:

{
"myindex" : {
"mappings" : {
"order" : {
"properties" : {
"customer_id" : {
"type" : "string",
"index" : "not_analyzed",
"store" : true
},
... truncated
}
}
}
}
}

The mapping is stored at the cluster level in ElasticSearch. The call checks both index and
type existence, and then returns the stored mapping.

Refreshing an index

ElasticSearch allows the user to control the state of the searcher using forced refresh on an index. If not forced, the new indexed document will only be searchable after a fixed time interval (usually 1 second).

The URL formats for refreshing an index, are:

http://&lt;server&gt;/&lt;index_name(s)&gt;/_refresh

The URL format for refreshing all the indices in a cluster, is:

http://&lt;server&gt;/_refresh

The HTTP method used for both URLs is POST.
To refresh an index, we will perform the following steps:
1. If we consider the type order of the previous chapter, the call will be:

curl -XPOST 'http://localhost:9200/myindex/_refresh'

The result returned by ElasticSearch should be:

{"_shards":{"total":4,"successful":2,"failed":0}}

Flushing an index

ElasticSearch, for performance reasons, stores some data in memory and on a transaction log. If we want to free memory, empty the transaction log, and be sure that our data is safely written on disk, we need to flush an index.
ElasticSearch automatically provides a periodic disk flush, but forcing a flush can be useful, for example:

When we have to shutdown a node to prevent stale data
To have all the data in a safe state (for example, after a big indexing operation to have all the data flushed and refreshed)

The HTTP method used for the URL operations is POST.
The URL format for flushing an index is:

http://&lt;server&gt;/&lt;index_name(s)&gt;/_flush[?refresh=True]

The URL format for flushing all the indices in a Cluster is:

http://&lt;server&gt;/_flush[?refresh=True]

To flush an index, we will perform the following steps:
1. If we consider the type order of the previous chapter, the call will be:

curl -XPOST 'http://localhost:9200/myindex/_flush?refresh=True'

The result returned by ElasticSearch, if everything goes well, should be:

{"_shards":{"total":4,"successful":2,"failed":0}}

The result contains the shard operation status.

Optimizing an index

The core of ElasticSearch is based on Lucene, which stores the data in segments on the disk. During the life of an Index, a lot of segments are created and changed. With the increase of segment numbers, the speed of search decreases due to the time required to read all of them. The optimize operation allows us to consolidate the index for faster search
performance, reducing segments.

The URL format to optimize one or more indices is:

http://&lt;server&gt;/&lt;index_name(s)&gt;/_optimize

The URL format to optimize all the indices in a cluster is:

http://&lt;server&gt;/_optimize

The HTTP method used is POST.
To optimize an index, we will perform the following steps:
1. If we consider the Index created in the Creating an index recipe, the call will be:

curl -XPOST 'http://localhost:9200/myindex/_optimize'

The result returned by ElasticSearch should be:

{"_shards":{"total":4,"successful":2,"failed":0}}

The result contains the shard operation status.

Checking if an index or type exists

A common pitfall error is to query for indices and types that don’t exist. To prevent this issue, ElasticSearch gives the user the ability to check the index and type existence.
This check is often used during an application startup to create indices and types that are required for it to work correctly.

The HTTP method to check the existence is HEAD.
The URL format for checking an index is:

http://&lt;server&gt;/&lt;index_name&gt;/

The URL format for checking a type is:

http://&lt;server&gt;/&lt;index_name&gt;/&lt;type&gt;/

To check if an index exists, we will perform the following steps:
1. If we consider the index created in the Creating an index recipe in this chapter, the call will be:

curl –i -XHEAD 'http://localhost:9200/myindex/'

The –i curl options allows dumping the server headers.
2. If the index exists, an HTTP status code 200 is returned. If missing, then a 404 error is returned.

To check if a type exists, we will perform the following steps:

If we consider the mapping created in the putting a mapping in an index recipe (in this chapter), the call will be:

curl –i -XHEAD 'http://localhost:9200/myindex/order/'

If the index exists, an HTTP status code 200 is returned. If missing, then a 404 error is returned.

Managing index settings

Index settings are more important because they allow us to control several important ElasticSearch functionalities such as sharding/replica, caching, term management, routing, and analysis.

To manage the index settings, we will perform the steps given as follows:
1. To retrieve the settings of your current Index, the URL format is the following:

http://&lt;server&gt;/&lt;index_name&gt;/_settings

We are reading information via REST API, so the method will be GET, and an example of a call using the index created in the Creating an index recipe, is:

curl -XGET 'http://localhost:9200/myindex/_settings'

The response will be something similar to:

{
"myindex" : {
"settings" : {
"index" : {
"uuid" : "pT65_cn_RHKmg1wPX7BGjw",
"number_of_replicas" : "1",
"number_of_shards" : "2",
"version" : {
"created" : "1020099"
}
}
}
}
}

The response attributes depend on the index settings. In this case, the response will be the number of replicas ( 1 ), and shard ( 2 ), and the index creation version ( 1020099 ). The UUID represents the unique ID of the index.
4. To modify the index settings, we need to use the PUT method. A typical settings change is to increase the replica number:

curl -XPUT 'http://localhost:9200/myindex/_settings' –d '{"index":{ "number_of_replicas": "2"}}'

ElasticSearch provides a lot of options to tune the index behavior, such as:

Replica management:
index.number_of_replica : This is the number of replicas each shard has
index.auto_expand_replicas : This parameter allows us to define a dynamic number of replicas related to the number of shards. Using set index.auto_expand_replicas to 0-all allows us to create an index that is replicated in every node (very useful for settings or cluster propagated data such as language options/stopwords).
Refresh interval (by default 1s ): In the previous recipe, Refreshing an index, we saw how to manually refresh an index. The index settings ( index.refresh_interval ) control the rate of automatic refresh.
Cache management: These settings ( index.cache.* ) control the cache size and its life. It is not common to change them (refer to ElasticSearch documentation for all the available options at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-cache.html ).
Write management: ElasticSearch provides several settings to block read/write operations in an index and changing metadata. They live in index.blocks settings.
Shard allocation management: These settings control how the shards must be allocated. They live in the index.routing.allocation.* namespace.
There are other index settings that can be configured for very specific needs. In every new version of ElasticSearch, the community extends these settings to cover new scenarios and

There is more…

The refresh_interval parameter provides several tricks to optimize the indexing speed. It controls the rate of refresh, and refreshing reduces the Index performances due to opening and closing of files. A good practice is to disable the refresh interval (set -1 ) during a big indexing
bulk and restoring the default behavior after it. This can be done with the following steps:
1. Disabling the refresh:

curl -XPOST 'http://localhost:9200/myindex/_settings' –d '
{"index":{"index_refresh_interval": "-1"}}'

Bulk indexing some millions of documents
Restoring the refresh:

curl -XPOST 'http://localhost:9200/myindex/_settings' –d '
{"index":{"index_refresh_interval": "1s"}}'

Optionally, optimizing the index for search performances:

curl -XPOST 'http://localhost:9200/myindex/_optimize'

Using index aliases

Real world applications have a lot of indices and queries that span on more indices.
This scenario requires defining all the indices names on which we need to perform queries; aliases allow grouping them by a common name.
Some common scenarios of this usage are:
* Log indices divided by date (such as log_YYMMDD ) for which we want to create an alias for the last week, the last month, today, yesterday, and so on. This pattern is commonly used in log applications such as logstash ( http://logstash.net/ ).
* Collecting website contents in several indices (New York Times, The Guardian, and so on) for those we want to refer to as an index aliases called sites.

The URL format for control aliases are:

http://&lt;server&gt;/_aliases
http://<server>/<index>/_alias/<alias_name>

To manage the index aliases, we will perform the following steps:
1. We need to read the status of the aliases for all indices via the REST API, so the method will be GET, and an example of a call is:

curl -XGET 'http://localhost:9200/_aliases'

It should give a response similar to this:

{
"myindex": {
"aliases": {}
},
"test": {
"aliases": {}
}
}

Aliases can be changed with add and delete commands.
3. To read an alias for a single Index, we use the _alias endpoint:
curl -XGET ‘http://localhost:9200/myindex/_alias’
The result should be:

{
"myindex" : {
"aliases" : {
"myalias1" : { }
}
}
}

To add an alias:

curl -XPUT 'http://localhost:9200/myindex/_alias/myalias1'

The result should be:

{"acknowledged":true}

This action adds the myindex index to the myalias1 alias.
5. To delete an alias:

curl -XDELETE 'http://localhost:9200/myindex/_alias/myalias1'

The result should be:

{"acknowledged":true}

The delete action has now removed myindex from the alias myalias1 .

Indexing a document

In ElasticSearch, there are two vital operations namely, Indexing and Searching.
Indexing means inserting one or more document in an index; this is similar to the insert command of a relational database.
In Lucene, the core engine of ElasticSearch, inserting or updating a document has the same cost. In Lucene and ElasticSearch, update means replace.

To index a document, several REST entry points can be used:

Method	URL
POST	`http://<server>/<index_name>/<type>`
PUT/POST	`http://<server>/<index_name>/<type> /<id>`
PUT/POST	`http://<server>/<index_name>/<type> /<id>/_create`

We will perform the following steps:
1. If we consider the type order mentioned in earlier chapters, the call to index a
document will be:

curl -XPOST 'http://localhost:9200/myindex/
order/2qLrAfPVQvCRMe7Ku8r0Tw' -d '{
"id" : "1234",
"date" : "2013-06-07T12:14:54",
"customer_id" : "customer1",
"sent" : true,
"in_stock_items" : 0,
"items":[
{"name":"item1", "quantity":3, "vat":20.0},
{"name":"item2", "quantity":2, "vat":20.0},
{"name":"item3", "quantity":1, "vat":10.0}
]
}'

If the index operation is successful, the result returned by ElasticSearch should be:

{
"_index":"myindex",
"_type":"order",
"_id":"2qLrAfPVQvCRMe7Ku8r0Tw",
"_version":1,
"created":true
}

Some additional information is returned from an indexing operation such as:
* An auto-generated ID, if not specified
* The version of the indexed document as per the Optimistic Concurrency Control
* Information if the record has been created

ElasticSearch allows the passing of several query parameters in the index API URL for
controlling how the document is indexed. The most commonly used ones are:
* routing : This controls the shard to be used for indexing, for example:

curl -XPOST 'http://localhost:9200/myindex/order?routing=1'

parent : This defines the parent of a child document and uses this value to apply
routing. The parent object must be specified in the mappings, such as:

curl -XPOST 'http://localhost:9200/myindex/order?parent=12'

timestamp : This is the timestamp to be used in indexing the document. It must be activated in the mappings, such as in the following:

curl -XPOST 'http://localhost:9200/myindex/order?timestamp=
2013-01-25T19%3A22%3A22'

consistency ( one / quorum / all ): By default, an index operation succeeds if set as a quorum (>replica/2+1) and if active shards are available. The write consistency value can be changed for indexing:

curl -XPOST 'http://localhost:9200/myindex/order?consistency=one'

replication ( sync / async ) : ElasticSearch returns replication from an index operation when all the shards of the current replication group have executed the operation. Setting the replication async allows us to execute the index synchronously
only on the primary shard and asynchronously on other shards, returning from the call faster.
curl -XPOST ‘http://localhost:9200/myindex/order?replication=async’
version : This allows us to use the Optimistic Concurrency Control ( http://en.wikipedia.org/wiki/Optimistic_concurrency_control ). At first, in the indexing of a document, version is set as 1 by default. At every update, this value is incremented. Optimistic Concurrency Control is a way to manage concurrency in every insert or update operation. The already passed version value is the last seen version (usually returned by a GET or a search). The indexing happens only if the current index version value is equal to the passed one:

curl -XPOST 'http://localhost:9200/myindex/order?version=2'

op_type : This can be used to force a create on a document. If a document with the same ID exists, the Index fails.

curl -XPOST 'http://localhost:9200/myindex/order?op_type=create'...

refresh : This forces a refresh after having the document indexed. It allows us to have the documents ready for search after indexing them:

curl -XPOST 'http://localhost:9200/myindex/order?refresh=true'...

ttl : This allows defining a time to live for a document. All documents in which the ttl has expired are deleted and purged from the index. This feature is very useful to define records with a fixed life. It only works if ttl is explicitly enabled in mapping.
The value can be a date-time or a time value (a numeric value ending with s , m , h , d ).
The following is the command:

curl -XPOST 'http://localhost:9200/myindex/order?ttl=1d'

timeout : This defines a time to wait for the primary shard to be available. Sometimes, the primary shard can be in an un-writable status (relocating or recovering from a gateway) and a timeout for the write operation is raised after 1 minute.

curl -XPOST 'http://localhost:9200/myindex/order?timeout=5m' ...

Getting a document

fter having indexed a document during your application life, it most likely will need to
be retrieved.
The GET REST call allows us to get a document in real time without the need of a refresh.

The GET method allows us to return a document given its index, type and ID.
The REST API URL is:

http://///

To get a document, we will perform the following steps:
1. If we consider the document we indexed in the previous recipe, the call will be:

curl –XGET 'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw?pretty=true'

The result returned by ElasticSearch should be the indexed document:

{
"_index":"myindex",
"_id":"2qLrAfPVQvCRMe7Ku8r0Tw",
"_version":1,
"found":true,
"_source": {
"id" : "1234",
"date" : "2013-06-07T12:14:54",
"customer_id" : "customer1",
"sent" : true,
"items":[
{"name":"item1", "quantity":3, "vat":20.0},
{"name":"item2", "quantity":2, "vat":20.0},
{"name":"item3", "quantity":1, "vat":10.0}
]
}
}

Our indexed data is contained in the _source parameter, but other information is returned as well:

_index : This is the index that stores the document
_type : This denotes the type of the document
_id : This denotes the ID of the document
_version : This denotes the version of the document
found : This denotes if the document has been found

If the record is missing, a 404 error is returned as the status code and the return
JSON will be:

{
"_id": "2qLrAfPVQvCRMe7Ku8r0Tw",
"_index": "myindex",
"_type": "order",
"found": false
}

ElasticSearch GET API doesn’t require a refresh on the document. All the GET calls are in real time. This call is fast because ElasticSearch is implemented to search only on the shard that contains the record without other overhead. The IDs are often cached in memory for faster lookup.
The source of the document is only available if the _source field is stored (the default settings in ElasticSearch).
There are several additional parameters that can be used to control the GET call:

fields : This allows us to retrieve only a subset of fields. This is very useful to reduce
bandwidth or to retrieve calculated fields such as the attachment mapping ones:

curl 'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw?fields=date,sent'

routing : This allows us to specify the shard to be used for the GET operation. To
retrieve a document with the routing used in indexing, the time taken must be the
same as the search time:

curl 'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw?routing=customer_id'

refresh : This allows us to refresh the current shard before doing the GET operation. (It must be used with care because it slows down indexing and introduces some overhead):

curl http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw?refresh=true

preference : This allows controlling which shard replica to choose to execute theGET operation. Generally, ElasticSearch chooses a random shard for the GET call.
Possible values are:
_primary : This is used for the primary shard.
_local : This is used for trying first the local shard and then falling back to a random choice. Using the local shard reduces the bandwidth usage and should generally be used with auto—replicating shards (with the replica seto 0 ).
custom value : This is used for selecting shard-related values such as the customer_id , username , and so on.

Search

The HTTP method used to execute a search is GET (but POST works too), and the REST URL is:

http:///_search
http:///<index_name(s)>/_search
http:///<index_name(s)>/<type_name(s)>/_search

ElasticSearch was born as a search engine
In this recipe, we’ll see that a search in ElasticSearch is not just limited to matching documents but can also calculate additional information required to improve the search quality.

To execute a search and view the results, perform the following steps:
1. From the command line, execute a search, as follows:

curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{"query":{"match_all":{}}}'

In this case, we have used a match_all query which means that all the documents
are returned. We’ll discuss this kind of query in the Matching all documents recipe in
this chapter.
2. The command, if everything is all right, will return the following result:

{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test-index",
"_type" : "test-type",
"_id" : "1",
"_score" : 1.0,
"_source" : {"position": 1,
"parsedtext": "Joe Testere nice guy",
"name": "Joe Tester",
"uuid": "11111"}
}, {
"_index" : "test-index",
"_type" : "test-type",
"_id" : "2",
"_score" : 1.0,
"_source" : {"position": 2,
"parsedtext": "Bill Testere nice guy",
"name": "BillBaloney",
"uuid": "22222"}
}, {
"_index" : "test-index",
"_type" : "test-type",
"_id" : "3",
"_score" : 1.0,
"_source" : {"position": 3,
"parsedtext": "Bill is not\nnice guy",
"name": "Bill Clinton",
"uuid": "33333"}
}]
}
}

The result contains a lot of information, as follows:

took : This is the time, in milliseconds, required to execute the query.
time_out : This indicates whether a timeout has occurred during the search.
This is related to the timeout parameter of the search. If a timeout occurred,
you will get partial or no results.
_shards : This is the status of the shards, which can be divided into the following:
total : This is the total number of shards.
successful : This is the number of shards in which the query was successful.
failed : This is the number of shards in which the query failed, because
some error or exception occurred during the query.
hits : This represents the results and is composed of the following:
total : This is the total number of documents that match the query.
max_score : This is the match score of the first document. Usually this is 1 if no match scoring was computed, for example in sorting or filtering.
hits : This is a list of the result documents.
The result document has a lot of fields that are always available and other fields that depend
on the search parameters. The following are the most important fields:
_index : This is the index that contains the document.
_type : This is the type of the document.
_id : This is the ID of the document.
_source : This is the document’s source (the default is returned, but it can be disabled).
_score : This is the query score of the document.
sort : These are the values that are used to sort, if the documents are sorted.
highlight : These are the highlighted segments, if highlighting was requested.
fields : This denotes some fields can be retrieved without the need to fetch all the source objects.

Highlighting results

When the highlight parameter is passed to the search object, ElasticSearch tries to
execute it on the document’s results.
The highlighting phase, which is after the document fetching phase, tries to extract the
highlight by following these steps:
1. It collects the terms available in the query.
2. It initializes the highlighter with the parameters passed during the query.
3. It extracts the fields we are interested in and tries to load them if they are stored;
otherwise they are taken from the source.
4. It executes the query on a single field in order to detect the more relevant parts.
5. It adds the highlighted fragments that are found in the resulting hit.

In order to search and highlight the results, perform the following steps:
1. From the command line, execute a search with a highlight parameter:

curl -XGET 'http://127.0.0.1:9200/test-index/_search?pretty=true&from=0&size=10' -d '
{
"query": {
"query_string": {"query": "joe"}},
"highlight": {
"pre_tags": ["<b>"],
"fields": {
"parsedtext": {"order": "score"},
"name": {"order": "score"}},
"post_tags": ["</b>"]
}
}
}'

If everything works all right, the command will return the following result:

{
... truncated ...
"hits" : {
"total" : 1,
"max_score" : 0.44194174,
"hits" : [ {
"_index" : "test-index",
"_type" : "test-type",
"_id" : "1",
"_score" : 0.44194174,
"_source" : {
"position": 1,
"parsedtext": "Joe Testere nice guy",
"name": "JoeTester",
"uuid": "11111"},
"highlight" : {
"name" : [ "<b>Joe</b> Tester" ],
"parsedtext" : [ "<b>Joe</b> Testere nice guy" ]
}
}]
}
}

As you can see, in the results, there is a new field called highlight , which contains the highlighted fields along with an array of fragments.

Install Attachment Plugin

Install Attachment plugin

bin/plugin install elasticsearch/elasticsearch-mapper-attachments/3.0.2

Sometimes, your plugin is not available online, or a standard installation fails, so you need to install your plugin manually.

Copy your ZIP file to the plugins directory of your ElasticSearch home installation.
If the directory named plugins doesn’t exist, create it.
Unzip the contents of the plugin to the plugins directory.
Remove the zip archive to clean up unused files.

You will need to restart ElasticSearch for it to load the plugin.

Create Attachment Index

Bellow code will create attachment index

curl –XPUT 'http://localhost:9200/attachment'

Create Attachment Map

curl –XPUT http://localhost:9200/attachment/_mapping -d '{
"person": {
"properties": {
"file": {
"type": "attachment",
"fields": {
"content": {
"type": "string",
"term_vector":"with_positions_offsets",
"store": true
}
}
}
}
}
}'

Load Attachment File

This script will populate attachment file with a base64-encoded

export file_path='/path/to/file'
export attach_file=$(base64 $file_path | perl -pe 's/\n/\\n/g')

Load File Content to Attachment Index

curl –XPUT 'http://localhost:9200/Attachment/1?refresh=true' -d '
{
"file": "`$attach_file`"
}'

Search and Highlighting result

For search specific string you can insert your search term in place of String To Search

curl –XGET 'http://localhost:9200/Attachment/_search -d ' {
"fields": [],
"query": {
"match": {
"file.content": "String To Search"
}
},
"highlight": {
"fields": {
"file.content": {
}
}
}
}'

if you add this content “IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg==” in place of $attach_file and search king queen term, elastic search gives back:

{
"took": 9,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.13561106,
"hits": [
{
"_index": "test",
"_type": "person",
"_id": "1",
"_score": 0.13561106,
"highlight": {
"file.content": [
"\"God Save the &lt;em&gt;Queen&lt;/em&gt;\" (alternatively \"God Save the &lt;em&gt;King&lt;/em&gt;\"\n"
]
}
}
]
}
}

Refrence

ElasticSearch CookBook
ElasticSearch Installation
ElasticSearch Mapper Attachments
Indexing Attachment file to elastic search

Reading time: 29 min

Search Engine

Compare Elastics and Sphinx search engine

December 17, 2017 by arfprogrammer No Comments

Elasticsearch

Main Web: http://www.elasticsearch.org/
Development URL: https://github.com/elasticsearch/elasticsearch
License: Apache 2
Environment: Java

Elasticsearch was created in 2010 by Shay Banon after forgoing work on another search solution, Compass, also built on Lucene and created in 2004.

Marketing Points

Real time data, analytics
Distributed, scaled horizontally. Add nodes for capacity.
High availability, reorganizing clusters of nodes.
Multi-tenancy. Multiple indices in a cluster, added on the fly.
Full text search via Lucene. Most powerful full text search capabilities in any open source product
Document oriented. Store structured JSON docs.
Conflict management
Schema free with the ability to assign specific knowledge at a later time
Restful API
Document changes are recorded in transaction logs in multiple nodes.
Elasticsearch provides a RESTful API endpoint for all requests from all languages.
Own Zen Discovery module for cluster management.

Technical Info

Built on Lucene
Data is stored with PUT and POST requests and retrieved with GET requests. Can check for existence of a document with HEAD requests. JSON documents can be deleted with DELETE requests.
Requests can be made with JSON query language rather than a query string.
Full text docs are stored in memory. A new option in 1.0 allows for doc values which are stored on disk.a
Suggesters are built in to suggest corrections or completions.
Plugin system available for custom functionality.
Possible admin interface via Elastic-HQ
The mapper attachments plugin lets Elasticsearch index file attachments in over a thousand formats (such as PPT, XLS, PDF) using the Apache text extraction library Tika.
It is easier to scale with ElasticSearch
ElasticSearch Had the ability to store and retrieve data, so we did not need a separate data store. The hidden gem behind ES is that it really is a NOSQL style DB with lots of momentum in the industry. The primary reason for ES was the ability to scale easily. Adding servers, rebalancing, ease of use and cluster management
Elasticsearch was built to be real time from the beginning.

Sphinx

Main Web: http://sphinxsearch.com/
Development URL: http://sphinxsearch.com/bugs/my_view_page.php
License: GPLv2
Environment: C++

Sphinx was created in 2001 by Andrew Aksyonoff to solve a personal need for search solution and has remained a standalone project.

Marketing Points

Supports on the fly (real time) and offline batch index creation.
Arbitrary attributes can be stored in the index.
Can index SQL DBs
Can batch index XMLpipe2 and (?) tsvpipe documents
3 different APIs, native libraries provided for SphinxAPI
DB like querying features.
Can Horizontal Scaling with percona

Technical Info

Real time indexes can only be populated using SphinxQL
Disk based indexes can be built from SQL DBs, TSV, or custom XML format.
Example PHP API file to be included in projects communicating with Sphinx.
Uses fsockopen in PHP to make a connection with the Sphinx service similar to how a MySQL connection would be made.
Sphinx is more tightly integrated with MySQL, and was also used for big data
Sphinx integrates more tightly with RDBMSs, especially MySQL.
Sphinx is designed to only retrieve document ids.
The data to be indexed can generally come from very different sources: SQL databases, plain text files, HTML files, mailboxes, and so on. From Sphinx point of view, the data it indexes is a set of structured documents, each of which has the same set of fields and attributes. This is similar to SQL, where each row would correspond to a document, and each column to either a field or an attribute.
Depending on what source Sphinx should get the data from, different code is required to fetch the data and prepare it for indexing.
Note that the original contents of the fields are not stored in the Sphinx index. The text that you send to Sphinx gets processed, and a full-text index (a special data structure that enables quick searches for a keyword) gets built from that text. But the original text contents are then simply discarded. Sphinx assumes that you store those contents elsewhere anyway.
Moreover, it is impossible to fully reconstruct the original text, because the specific whitespace, capitalization, punctuation, etc will all be lost during indexing.
Sphinx – C++ Based Very fast – but required a second database to store and retrieve the data. So essentially, we used Sphinx for our search and it would return the Document ID’s corresponding to the search results. We then queried a key value store in order to retrieve the actual content. We didn’t like the need to have multiple technologies and wanted a single NOSQL style DB that had a free text search index combined. If I recall correctly, I also think that I the licensing for our commercial use was a little restrictive.
Use Sphinx if you want to search through tons of documents/files real quick. It indexes real fast too. I would recommend not to use it in an app that involves JSON or parsing XML to get the search results. Use it for direct dB searches. It works great on MySQL.
Sphinx can’t index document types such as pdf, ppt, doc directly. You’ll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.
Sphinx provides language specific wrappers for the API to communicate with the service.
Sphinx is definitely designed around a SQL type structure, though it has been modified over time to support other data stores.
Decisions like implementing xmlpipe2 and tsvpipe by Sphinx as data sources are somewhat confusing. I think the standard formats offered with Solr and Elasticsearch make more sense.
Sphinx started as a batch indexer and moved (rightly) to real time over time. See Sphinx real time caveats.

Solr

Main Web: http://lucene.apache.org/solr/
Development URL: https://issues.apache.org/jira/browse/SOLR
License: Apache 2
Environment: Java

Solr was created in 2004 at CNet by Yonik Seeley and granted to the Apache Software Foundation in 2006 to become part of the Lucene project.

Marketing Points

Rest-like API
Documents added via XML, JSON, CSV, or binary over HTTP.
Query with GET and receive XML, JSON, CSV, or binary results.
XML configuration
Extensible plugin architecture
AJAX based admin interface
Use apache Zookeeper for cluster management

Technical Info

Solr is a web service that is built on top of the Lucene library. You can talk to it over HTTP from any programming language – so you can take advantage of the power of Lucene without having to write any Java code at all. Solr also adds a number of features that Lucene leaves out such as sharding and replication.
Solr is near real-time.
Solr shouldn’t be used to solve real-time problems. For search engines, Solr is pretty much game and works flawlessly.
Solr works fine on High Traffic web-applications (I read somewhere that it is not suited for this, but I am backing up that statement). It utilizes the RAM, not the CPU.
Solr is highly scalable. Have a look on SolrCloud
Solr can be integrated with Hadoop to build distributed applications
Solr can index proprietary formats like Microsoft Word, PDF, etc.
In Solr you can directly get whole documents with pretty much any kind of data, making it more independent of any external data store and it saves the extra roundtrip
Use Solr if you intend to use it in your web-app(example-site search engine). It will definitely turn out to be great, thanks to its API. You will definitely need that power for a web-app.
Solr is near real-time.

Refrences

Choosing a stand-alone full-text search server: Sphinx or SOLR?
Comparison of full text search engine – Lucene, Sphinx, Postgresql, MySQL?
Open Source Search Comparison
ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
How do Solr, Lucene, Sphinx and Searchify compare?
Which one do you think is better for a big data website: Solr, ElasticSearch, or Sphinx? Why?
ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
Solr and Elasticsearch, a performance study
Building 50TB-scale search engine with MySQL and Sphinx

Reading time: 6 min