Monday, 6 June 2011

Using solr for .net - please stop using your database as a search engine

Google has spoiled us with fast, relevant search and users have come to expect this from all sites that they visit. There really is no more excuses for using database full-text functionality for web-site search (of course, if you have tightly integrated your solution into a piece of your rdbms full-text api, then migration to a different solution will not be trivial).

There are many reasons why using solr for search is a good idea so here are some:

  • Anything that reduces the load on your database is a good thing. I would guess that search functionality has the potential to bring many databases to it's knees.
  • Solr (and lucene on which it's built) is designed for searching - that's pretty much all it does and it's really, really good at it.
  • .net has an excellent API for solr which makes integrating solr with .net incredibly easy.
  • The solr server is written in java and can run pretty much anywhere you can run a jvm.

So why not use Lucene directly? Do I need to use solr?

Having delivered projects using solr and lucene, I would whole-heartedly recommend using solr for the following reasons:

  • solr takes care of being able to modify and query your search index remotely which is not trivial.
  • The .net api for Lucene is several versions behind the official java version for various technical and non-technical reasons (.net: 2.9.2, java: 3.2). You can read more about this on the lucene.net mailing lists if you want to read the ups-and-downs of apache incubator status. By using solr, you effectively bypass this issue (unless the solr api itself changes but this api is significantly simpler than lucene and is much less likely to change).
  • Running the server in a jvm allows you use linux for the search functionality of your application - which is going to work out easier and cheaper if you hosting in the cloud.

What about elasticsearch?

This project looks promising and, although I was able to index a few hundred thousand documents with a trivial amount of code, I found the absence of a schema slightly confusing. I also wasn't able to get any results out of the index using the NEST api for .net at the time of this writing. Since both projects use Lucene under-the-hood, I would suggest that skills are transferable between the two and a migration would be fairly easy.

Tips

I don't want to regurgitate one of the the many useful startup guides but share a few tips that I have discovered along the way.

  • Prepare to re-index often. Make sure your indexing process is repeatable and easily runnable - every time you change the schema, you need to re-index to see the changes (and this will happen fairly often during development).
  • It is obviously hugely dependent on many different factors (system hardware, index complexity etc). but anecdotally I can build solr indexes at approximately 850 items per second (average spec. notebook, with solr running on the same machine). Again, YMMV, but there is a number for you to compare against if you like that sort of thing).
  • Remove everything from the solr example files that you don't need - they are verbose and make it harder to know what pieces are actually being used (this includes the schema and configuration).
  • Don't try and be too clever with the query input string (stripping characters etc.) - for the most part, solr does a good job of parsing the query.
  • If you are searching across multiple fields, the best place to define this is within your solrconfig.xml file:
 <requestHandler name="search" class="solr.SearchHandler" default="true">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <int name="rows">10</int>
      <str name="defType">edismax</str>
      <str name="qf">myImportantField^20.0 myOtherField yetAnotherField</str>
    </lst>
  </requestHandler>
  • Externalize the properties that are different for each platform using the solrcore.properties file (e.g. the location of the solr data directory). This makes it easier to deploy schema and configuration changes to production. You can do this by changing your solrconfig.xml to the following:
<dataDir>${data.dir}</dataDir>
and creating a a solrcore.properties file for each environment with something similar to:
data.dir=/data/solr
  • Don't be afraid to augment your results with data from other sources. Just because you need to show a particular field in your search results, doesn't mean you need to store it in your search index (there is obviously a careful performance trade-off to be made here).

  • If you are implementing "auto suggest", use an NGramFilterFactory in your schema similar to the following:

  • <fieldType name="wildcard" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="25" />
        </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
    
    I hope this gives you the incentive you need to give your database a holiday and improve the search on your website.

    1 comment: