Sphinix – تجربه های یک برنامه نویس

Search Engine

Sphinx search engine overview

December 17, 2017 by arfprogrammer No Comments

Installation

In this section, we will install Sphinx.

To install Sphinx, run:

sudo apt-get install sphinxsearch

Now you have successfully installed Sphinx on your server. Before starting the Sphinx daemon, let’s configure it.

Creating the Database For File Index and Search

In this section, we will set up a database and create tables that contains path of text files that we want to sphinx index them and search query from them.

mysql -u root -p

Enter the password for the MyQL root user when asked. Your prompt will change to mysql>.

Create a database named sphinx_index and use it;

CREATE DATABASE sphinx_index;
USE sphinx_index;

Create files path list table

CREATE TABLE fileindex ( id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,text VARCHAR(100) NOT NULL);

Add files path to fileindex table

INSERT INTO fileindex ( text ) VALUES ( '\path\to\files' )

Then exit the MySQL shell.

Configuring Sphinx

In this section, we will configure the Sphinx configuration file.

Create the sphinx.conf file.

sudo gedit /etc/sphinxsearch/sphinx.conf

Sphinx configuration consists of 3 main blocks that are essential to run. They are index, searchd, and source. Each of these blocks is described below, and at the end of this step, the entirety of sphinx.conf is included for you to paste into the file.

The source block contains the type of source, username and password to the MySQL server. The first column of the SQL query should be a unique id. The SQL query will run on every index and dump the data to Sphinx index file. Below are descriptions of each field and the source block itself.

sql_host: Hostname for the MySQL host. In our example, this is the localhost. This can be a domain or IP address.
sql_user: Username for the MySQL login. In our example, this is root.
sql_pass: Password for the MySQL user. In our example, this is the root MySQL user’s password
sql_db: Name of the database that stores data. In our example, this is test.
sql_query: This is the query thats dumps data to
sql_query_pre: Pre-fetch query, or pre-query. They are used to setup encoding, mark records that are going to be indexed, update internal counters, set various per-connection SQL server options and variables. Perhaps the most frequent pre-query usage is to specify the encoding that the server will use for the rows it returns. Note that Sphinx accepts only UTF-8 texts.
sql_field_string:Combined string attribute and full-text field declaration.
sql_file_field: Reads document contents from file system instead of database.
Offloads database
Prevents cache trashing on database side
Much faster in some cases

source src1
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = 3337033
sql_db = sphinx_index
sql_port = 3306 # optional, default is 3306
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query_pre = SET NAMES utf8
sql_query = SELECT id,text from fileindex
sql_file_field = text
sql_field_string = text

}

The index component contains the source and the path to store the data.

source: Name of the source block. In our example, this is src1.
path: This path to save the index.
docinfo: Document attribute values (docinfo) storage mode. Optional, default is ‘extern’. Known values are ‘none’, ‘extern’ and ‘inline’.

index filename
{
source = src1
path = /var/lib/sphinxsearch/data/files
docinfo = extern
}

The searchd component contains the port and other variables to run the Sphinx daemon.

listen: This is the port which sphinx daemon will run. In our example, this is 9312.
query_log: This path to save the query log.
pid_file: This is path to PID file of Sphinx daemon.
max_matches: Maximum number matches to return per search term.
seamless_rotate: Prevents searchd stalls while rotating indexes with huge amounts of data to precache.
preopen_indexes: Whether to forcibly preopen all indexes on startup.
unlink_old: Whether to unlink old index copies on successful rotation.
log: Log file name. Optional, default is ‘searchd.log’.
read_timeout: Network client request read timeout, in seconds.
max_children: Maximum amount of children to fork (or in other words, concurrent searches to run in parallel). Optional, default is 0 (unlimited).
seamless_rotate: Prevents searchd stalls while rotating indexes with huge amounts of data to precache. Optional, default is 1 (enable seamless rotation).
binlog_path: Binary log (aka transaction log) files path. Optional, default is build-time configured data directory.

searchd
{
listen = 9312
log = /var/log/sphinxsearch/searchd.log
query_log = /var/log/sphinxsearch/query.log
read_timeout = 5
max_children = 30
pid_file = /var/run/sphinxsearch/searchd.pid
max_matches = 1000
seamless_rotate = 1
preopen_indexes = 1
unlink_old = 1
binlog_path = /var/lib/sphinxsearch/data

}

Adding Data to the Index

In this section, we’ll add data to the Sphinx index.

Add data to index using the config we created earlier.

sudo indexer --all --rotate

You should get something that looks like the following.

Sphinx 2.2.10-id64-release (2c212e0)
Copyright (c) 2001-2015, Andrew Aksyonoff
Copyright (c) 2008-2015, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/etc/sphinxsearch/sphinx.conf'...
WARNING: key 'max_matches' was permanently removed from Sphinx configuration. Refer to documentation for details.
indexing index 'filename'...
collected 1 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 1 docs, 41896 bytes
total 0.073 sec, 566705 bytes/sec, 13.52 docs/sec
total 8 reads, 0.014 sec, 9.0 kb/call avg, 1.8 msec/call avg
total 12 writes, 0.000 sec, 4.6 kb/call avg, 0.0 msec/call avg
rotating indices: successfully sent SIGHUP to searchd (pid=1087).

Starting Sphinx

First open /etc/default/sphinxsearch to check Sphinx daemon is tuned off or on.

sudo nano /etc/default/sphinxsearch

To enable Sphinx, find the line START and set it to yes.

START=yes

Then, save and close the file.

Finally, start the Sphinx daemon.

sudo service sphinxsearch start

Testing Search

For search from indexed contents we should use official native SphinxAPI implementations for PHP, Perl, Python, Ruby and Java or third party API ports and plugins for Perl, C#, Haskell, Ruby-on-Rails.
Official native SphinxAPIs are included within the distribution package.
Download sphinx search source from GitHub

cd API

Create Search.py

from sphinxapi import *
client = SphinxClient()
client.SetServer('127.0.0.1', 9312)
client.Query('text to search')

execute this code with python2.7 and result should like this if search is successful

{'status': 0, 'matches': [{'id': 1, 'weight': 2500, 'attrs': {'text': '/home/arf/Downloads/ElasticSearch.md'}}], 'fields': ['text'], 'time': '0.000', 'total_found': 1, 'warning': '', 'attrs': [['text', 7]], 'words': [{'docs': 1, 'hits': 1, 'word': 'text'}, {'docs': 1, 'hits': 159, 'word': 'to'}, {'docs': 1, 'hits': 32, 'word': 'search'}], 'error': '', 'total': 1}

and if search is not successful result should like bellow

{'status': 0, 'matches': [], 'fields': ['text'], 'time': '0.000', 'total_found': 0, 'warning': '', 'attrs': [['text', 7]], 'words': [{'docs': 0, 'hits': 0, 'word': 'eqfc'}], 'error': '', 'total': 0}

Other Descriptions

It looks like that sphinx works fine with database systems for indexing and search fields.
Unfortunately, Sphinx can’t index .doc and .pdf file types directly. You’ll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.

References

Sphinx Documentation
How To Install and Configure Sphinx on Ubuntu 14.04
How to index plain text files for search in Sphinx
Sphinx Search Engine & Python API
Full-Text Search with Sphinx and PHP
Indexing Word Documents and PDFs with Sphinx

Reading time: 5 min

Search Engine

Compare Elastics and Sphinx search engine

December 17, 2017 by arfprogrammer No Comments

Elasticsearch

Main Web: http://www.elasticsearch.org/
Development URL: https://github.com/elasticsearch/elasticsearch
License: Apache 2
Environment: Java

Elasticsearch was created in 2010 by Shay Banon after forgoing work on another search solution, Compass, also built on Lucene and created in 2004.

Marketing Points

Real time data, analytics
Distributed, scaled horizontally. Add nodes for capacity.
High availability, reorganizing clusters of nodes.
Multi-tenancy. Multiple indices in a cluster, added on the fly.
Full text search via Lucene. Most powerful full text search capabilities in any open source product
Document oriented. Store structured JSON docs.
Conflict management
Schema free with the ability to assign specific knowledge at a later time
Restful API
Document changes are recorded in transaction logs in multiple nodes.
Elasticsearch provides a RESTful API endpoint for all requests from all languages.
Own Zen Discovery module for cluster management.

Technical Info

Built on Lucene
Data is stored with PUT and POST requests and retrieved with GET requests. Can check for existence of a document with HEAD requests. JSON documents can be deleted with DELETE requests.
Requests can be made with JSON query language rather than a query string.
Full text docs are stored in memory. A new option in 1.0 allows for doc values which are stored on disk.a
Suggesters are built in to suggest corrections or completions.
Plugin system available for custom functionality.
Possible admin interface via Elastic-HQ
The mapper attachments plugin lets Elasticsearch index file attachments in over a thousand formats (such as PPT, XLS, PDF) using the Apache text extraction library Tika.
It is easier to scale with ElasticSearch
ElasticSearch Had the ability to store and retrieve data, so we did not need a separate data store. The hidden gem behind ES is that it really is a NOSQL style DB with lots of momentum in the industry. The primary reason for ES was the ability to scale easily. Adding servers, rebalancing, ease of use and cluster management
Elasticsearch was built to be real time from the beginning.

Sphinx

Main Web: http://sphinxsearch.com/
Development URL: http://sphinxsearch.com/bugs/my_view_page.php
License: GPLv2
Environment: C++

Sphinx was created in 2001 by Andrew Aksyonoff to solve a personal need for search solution and has remained a standalone project.

Marketing Points

Supports on the fly (real time) and offline batch index creation.
Arbitrary attributes can be stored in the index.
Can index SQL DBs
Can batch index XMLpipe2 and (?) tsvpipe documents
3 different APIs, native libraries provided for SphinxAPI
DB like querying features.
Can Horizontal Scaling with percona

Technical Info

Real time indexes can only be populated using SphinxQL
Disk based indexes can be built from SQL DBs, TSV, or custom XML format.
Example PHP API file to be included in projects communicating with Sphinx.
Uses fsockopen in PHP to make a connection with the Sphinx service similar to how a MySQL connection would be made.
Sphinx is more tightly integrated with MySQL, and was also used for big data
Sphinx integrates more tightly with RDBMSs, especially MySQL.
Sphinx is designed to only retrieve document ids.
The data to be indexed can generally come from very different sources: SQL databases, plain text files, HTML files, mailboxes, and so on. From Sphinx point of view, the data it indexes is a set of structured documents, each of which has the same set of fields and attributes. This is similar to SQL, where each row would correspond to a document, and each column to either a field or an attribute.
Depending on what source Sphinx should get the data from, different code is required to fetch the data and prepare it for indexing.
Note that the original contents of the fields are not stored in the Sphinx index. The text that you send to Sphinx gets processed, and a full-text index (a special data structure that enables quick searches for a keyword) gets built from that text. But the original text contents are then simply discarded. Sphinx assumes that you store those contents elsewhere anyway.
Moreover, it is impossible to fully reconstruct the original text, because the specific whitespace, capitalization, punctuation, etc will all be lost during indexing.
Sphinx – C++ Based Very fast – but required a second database to store and retrieve the data. So essentially, we used Sphinx for our search and it would return the Document ID’s corresponding to the search results. We then queried a key value store in order to retrieve the actual content. We didn’t like the need to have multiple technologies and wanted a single NOSQL style DB that had a free text search index combined. If I recall correctly, I also think that I the licensing for our commercial use was a little restrictive.
Use Sphinx if you want to search through tons of documents/files real quick. It indexes real fast too. I would recommend not to use it in an app that involves JSON or parsing XML to get the search results. Use it for direct dB searches. It works great on MySQL.
Sphinx can’t index document types such as pdf, ppt, doc directly. You’ll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.
Sphinx provides language specific wrappers for the API to communicate with the service.
Sphinx is definitely designed around a SQL type structure, though it has been modified over time to support other data stores.
Decisions like implementing xmlpipe2 and tsvpipe by Sphinx as data sources are somewhat confusing. I think the standard formats offered with Solr and Elasticsearch make more sense.
Sphinx started as a batch indexer and moved (rightly) to real time over time. See Sphinx real time caveats.

Solr

Main Web: http://lucene.apache.org/solr/
Development URL: https://issues.apache.org/jira/browse/SOLR
License: Apache 2
Environment: Java

Solr was created in 2004 at CNet by Yonik Seeley and granted to the Apache Software Foundation in 2006 to become part of the Lucene project.

Marketing Points

Rest-like API
Documents added via XML, JSON, CSV, or binary over HTTP.
Query with GET and receive XML, JSON, CSV, or binary results.
XML configuration
Extensible plugin architecture
AJAX based admin interface
Use apache Zookeeper for cluster management

Technical Info

Solr is a web service that is built on top of the Lucene library. You can talk to it over HTTP from any programming language – so you can take advantage of the power of Lucene without having to write any Java code at all. Solr also adds a number of features that Lucene leaves out such as sharding and replication.
Solr is near real-time.
Solr shouldn’t be used to solve real-time problems. For search engines, Solr is pretty much game and works flawlessly.
Solr works fine on High Traffic web-applications (I read somewhere that it is not suited for this, but I am backing up that statement). It utilizes the RAM, not the CPU.
Solr is highly scalable. Have a look on SolrCloud
Solr can be integrated with Hadoop to build distributed applications
Solr can index proprietary formats like Microsoft Word, PDF, etc.
In Solr you can directly get whole documents with pretty much any kind of data, making it more independent of any external data store and it saves the extra roundtrip
Use Solr if you intend to use it in your web-app(example-site search engine). It will definitely turn out to be great, thanks to its API. You will definitely need that power for a web-app.
Solr is near real-time.

Refrences

Choosing a stand-alone full-text search server: Sphinx or SOLR?
Comparison of full text search engine – Lucene, Sphinx, Postgresql, MySQL?
Open Source Search Comparison
ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
How do Solr, Lucene, Sphinx and Searchify compare?
Which one do you think is better for a big data website: Solr, ElasticSearch, or Sphinx? Why?
ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
Solr and Elasticsearch, a performance study
Building 50TB-scale search engine with MySQL and Sphinx

Reading time: 6 min