Elasticsearch

  • Main Web: http://www.elasticsearch.org/
  • Development URL: https://github.com/elasticsearch/elasticsearch
  • License: Apache 2
  • Environment: Java

Elasticsearch was created in 2010 by Shay Banon after forgoing work on another search solution, Compass, also built on Lucene and created in 2004.

Marketing Points

  • Real time data, analytics
  • Distributed, scaled horizontally. Add nodes for capacity.
  • High availability, reorganizing clusters of nodes.
  • Multi-tenancy. Multiple indices in a cluster, added on the fly.
  • Full text search via Lucene. Most powerful full text search capabilities in any open source product
  • Document oriented. Store structured JSON docs.
  • Conflict management
  • Schema free with the ability to assign specific knowledge at a later time
  • Restful API
  • Document changes are recorded in transaction logs in multiple nodes.
  • Elasticsearch provides a RESTful API endpoint for all requests from all languages.
  • Own Zen Discovery module for cluster management.

Technical Info

  • Built on Lucene
  • Data is stored with PUT and POST requests and retrieved with GET requests. Can check for existence of a document with HEAD requests. JSON documents can be deleted with DELETE requests.
  • Requests can be made with JSON query language rather than a query string.
  • Full text docs are stored in memory. A new option in 1.0 allows for doc values which are stored on disk.a
  • Suggesters are built in to suggest corrections or completions.
  • Plugin system available for custom functionality.
  • Possible admin interface via Elastic-HQ
  • The mapper attachments plugin lets Elasticsearch index file attachments in over a thousand formats (such as PPT, XLS, PDF) using the Apache text extraction library Tika.
  • It is easier to scale with ElasticSearch
  • ElasticSearch Had the ability to store and retrieve data, so we did not need a separate data store. The hidden gem behind ES is that it really is a NOSQL style DB with lots of momentum in the industry. The primary reason for ES was the ability to scale easily. Adding servers, rebalancing, ease of use and cluster management
  • Elasticsearch was built to be real time from the beginning.

Sphinx

  • Main Web: http://sphinxsearch.com/
  • Development URL: http://sphinxsearch.com/bugs/my_view_page.php
  • License: GPLv2
  • Environment: C++

Sphinx was created in 2001 by Andrew Aksyonoff to solve a personal need for search solution and has remained a standalone project.

Marketing Points

  • Supports on the fly (real time) and offline batch index creation.
  • Arbitrary attributes can be stored in the index.
  • Can index SQL DBs
  • Can batch index XMLpipe2 and (?) tsvpipe documents
  • 3 different APIs, native libraries provided for SphinxAPI
  • DB like querying features.
  • Can Horizontal Scaling with percona

Technical Info

  • Real time indexes can only be populated using SphinxQL
  • Disk based indexes can be built from SQL DBs, TSV, or custom XML format.
  • Example PHP API file to be included in projects communicating with Sphinx.
  • Uses fsockopen in PHP to make a connection with the Sphinx service similar to how a MySQL connection would be made.
  • Sphinx is more tightly integrated with MySQL, and was also used for big data
  • Sphinx integrates more tightly with RDBMSs, especially MySQL.
  • Sphinx is designed to only retrieve document ids.
  • The data to be indexed can generally come from very different sources: SQL databases, plain text files, HTML files, mailboxes, and so on. From Sphinx point of view, the data it indexes is a set of structured documents, each of which has the same set of fields and attributes. This is similar to SQL, where each row would correspond to a document, and each column to either a field or an attribute.
  • Depending on what source Sphinx should get the data from, different code is required to fetch the data and prepare it for indexing.
  • Note that the original contents of the fields are not stored in the Sphinx index. The text that you send to Sphinx gets processed, and a full-text index (a special data structure that enables quick searches for a keyword) gets built from that text. But the original text contents are then simply discarded. Sphinx assumes that you store those contents elsewhere anyway.
  • Moreover, it is impossible to fully reconstruct the original text, because the specific whitespace, capitalization, punctuation, etc will all be lost during indexing.
  • Sphinx – C++ Based Very fast – but required a second database to store and retrieve the data. So essentially, we used Sphinx for our search and it would return the Document ID’s corresponding to the search results. We then queried a key value store in order to retrieve the actual content. We didn’t like the need to have multiple technologies and wanted a single NOSQL style DB that had a free text search index combined. If I recall correctly, I also think that I the licensing for our commercial use was a little restrictive.
  • Use Sphinx if you want to search through tons of documents/files real quick. It indexes real fast too. I would recommend not to use it in an app that involves JSON or parsing XML to get the search results. Use it for direct dB searches. It works great on MySQL.
  • Sphinx can’t index document types such as pdf, ppt, doc directly. You’ll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.
  • Sphinx provides language specific wrappers for the API to communicate with the service.
  • Sphinx is definitely designed around a SQL type structure, though it has been modified over time to support other data stores.
  • Decisions like implementing xmlpipe2 and tsvpipe by Sphinx as data sources are somewhat confusing. I think the standard formats offered with Solr and Elasticsearch make more sense.
  • Sphinx started as a batch indexer and moved (rightly) to real time over time. See Sphinx real time caveats.

Solr

  • Main Web: http://lucene.apache.org/solr/
  • Development URL: https://issues.apache.org/jira/browse/SOLR
  • License: Apache 2
  • Environment: Java

Solr was created in 2004 at CNet by Yonik Seeley and granted to the Apache Software Foundation in 2006 to become part of the Lucene project.

Marketing Points

  • Rest-like API
  • Documents added via XML, JSON, CSV, or binary over HTTP.
  • Query with GET and receive XML, JSON, CSV, or binary results.
  • XML configuration
  • Extensible plugin architecture
  • AJAX based admin interface
  • Use apache Zookeeper for cluster management

Technical Info

  • Solr is a web service that is built on top of the Lucene library. You can talk to it over HTTP from any programming language – so you can take advantage of the power of Lucene without having to write any Java code at all. Solr also adds a number of features that Lucene leaves out such as sharding and replication.
  • Solr is near real-time.
    Solr shouldn’t be used to solve real-time problems. For search engines, Solr is pretty much game and works flawlessly.
  • Solr works fine on High Traffic web-applications (I read somewhere that it is not suited for this, but I am backing up that statement). It utilizes the RAM, not the CPU.
  • Solr is highly scalable. Have a look on SolrCloud
  • Solr can be integrated with Hadoop to build distributed applications
  • Solr can index proprietary formats like Microsoft Word, PDF, etc.
  • In Solr you can directly get whole documents with pretty much any kind of data, making it more independent of any external data store and it saves the extra roundtrip
  • Use Solr if you intend to use it in your web-app(example-site search engine). It will definitely turn out to be great, thanks to its API. You will definitely need that power for a web-app.
  • Solr is near real-time.

Refrences

Choosing a stand-alone full-text search server: Sphinx or SOLR?
Comparison of full text search engine – Lucene, Sphinx, Postgresql, MySQL?
Open Source Search Comparison
ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
How do Solr, Lucene, Sphinx and Searchify compare?
Which one do you think is better for a big data website: Solr, ElasticSearch, or Sphinx? Why?
ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
Solr and Elasticsearch, a performance study
Building 50TB-scale search engine with MySQL and Sphinx

Share: