Bit Diffusion: June 2011

Monday, 6 June 2011

Polling your database queues less-often - using udp via sql server clr

We've all done it before: we have a database table somewhere that is being populated with records and we need to know when something new has arrived (let's forget for the moment that a real queuing infrastructure is probably more suited for this). The way it's usually implemented is:

You decide on the maximum age of the items in the queue before they need to be processed - this forms the basis of your polling interval. For example, if a change to the table must be processed within 5 minutes, your polling interval might be 4 minutes (to allow for the actual processing time). Of course, this is a sliding window, so most requests will not reach the maximum age (i.e. if one is inserted just before the table is polled).

There are obviously some inefficiencies here:

What happens if there are no requests? You are going to poll anyway.
What happens if a record is inserted directly after the polling interval? It will have to wait until the next interval.

It would be preferable if you were notified of a change to the table rather having to poll.

The solution (if you are using sql server 2005 or above) is to create a stored procedure that notifies the interested parties that there are new records to process. Think of it as "preempting the poll" - it's best to leave in the polling as a "worse-case" in case the udp broadcast is not received.

The c# code for the sql class lib may look similar to the following:

using System;
using System.Net;
using System.Net.Sockets;
using System.Security.Cryptography;
using System.Text;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

namespace MyApp.Sql
{
    public class Utilities
    {
        [SqlProcedure]
        public static void Sync(string sourceServer, string sourceDatabase, string notificationIp, int notificationPort)
        {
            var ipAddress = !String.IsNullOrEmpty(notificationIp) ? IPAddress.Parse(notificationIp) : IPAddress.Broadcast;
            var remoteEp = new IPEndPoint(ipAddress, notificationPort == 0 ? 53101 : notificationPort);

            using (var udpClient = new UdpClient())
            {
                byte[] input = Encoding.ASCII.GetBytes(String.Format("{0}:{1}", sourceServer, sourceDatabase));
                udpClient.Send(input, input.Length, remoteEp);
            }
        }
    }
}

And to make this component available to your sql installation (if you have the necessary permissions):

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[NotifyChanges]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[NotifyChanges]
GO

IF EXISTS (SELECT * FROM sys.assemblies asms WHERE asms.name = N'MyApp_Notifier')
DROP ASSEMBLY [MyApp_Notifier]

CREATE ASSEMBLY MyApp_Notifier FROM 'MyApp.Sql.Dll'
WITH PERMISSION_SET = UNSAFE
GO
CREATE PROCEDURE [dbo].[NotifyChanges]
@sourceServer [nvarchar](255),
@sourceDatabase [nvarchar](255),
@notificationIp [nvarchar](255),
@notificationPort [int]
WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [MyApp_Notifier].[MyApp.Sql.Utilities].[Sync]
GO

The next step is to create a trigger on your polling table and call this new stored procedure when a new record is inserted. Note that the server name and database name are passed in the udp message in case you have multiple notification tables broadcasting to a single listener.

ALTER TRIGGER [dbo].[MyPollingTrigger] ON [dbo].[MyPollingTable]
AFTER INSERT
AS
DECLARE @DB_NAME VARCHAR ( 127 )
SET @DB_NAME = DB_NAME()
EXEC [NotifyChanges]
  @@SERVERNAME,
  @DB_NAME,
  NULL,
  0

In a future post I will show the udp server that you can use to respond to these events, but it's fairly trivial using the built-in .net socket libraries.

Using solr for .net - please stop using your database as a search engine

Google has spoiled us with fast, relevant search and users have come to expect this from all sites that they visit. There really is no more excuses for using database full-text functionality for web-site search (of course, if you have tightly integrated your solution into a piece of your rdbms full-text api, then migration to a different solution will not be trivial).

There are many reasons why using solr for search is a good idea so here are some:

Anything that reduces the load on your database is a good thing. I would guess that search functionality has the potential to bring many databases to it's knees.
Solr (and lucene on which it's built) is designed for searching - that's pretty much all it does and it's really, really good at it.
.net has an excellent API for solr which makes integrating solr with .net incredibly easy.
The solr server is written in java and can run pretty much anywhere you can run a jvm.

So why not use Lucene directly? Do I need to use solr?

Having delivered projects using solr and lucene, I would whole-heartedly recommend using solr for the following reasons:

solr takes care of being able to modify and query your search index remotely which is not trivial.
The .net api for Lucene is several versions behind the official java version for various technical and non-technical reasons (.net: 2.9.2, java: 3.2). You can read more about this on the lucene.net mailing lists if you want to read the ups-and-downs of apache incubator status. By using solr, you effectively bypass this issue (unless the solr api itself changes but this api is significantly simpler than lucene and is much less likely to change).
Running the server in a jvm allows you use linux for the search functionality of your application - which is going to work out easier and cheaper if you hosting in the cloud.

What about elasticsearch?

This project looks promising and, although I was able to index a few hundred thousand documents with a trivial amount of code, I found the absence of a schema slightly confusing. I also wasn't able to get any results out of the index using the NEST api for .net at the time of this writing. Since both projects use Lucene under-the-hood, I would suggest that skills are transferable between the two and a migration would be fairly easy.

Tips

I don't want to regurgitate one of the the many useful startup guides but share a few tips that I have discovered along the way.

Prepare to re-index often. Make sure your indexing process is repeatable and easily runnable - every time you change the schema, you need to re-index to see the changes (and this will happen fairly often during development).
It is obviously hugely dependent on many different factors (system hardware, index complexity etc). but anecdotally I can build solr indexes at approximately 850 items per second (average spec. notebook, with solr running on the same machine). Again, YMMV, but there is a number for you to compare against if you like that sort of thing).
Remove everything from the solr example files that you don't need - they are verbose and make it harder to know what pieces are actually being used (this includes the schema and configuration).
Don't try and be too clever with the query input string (stripping characters etc.) - for the most part, solr does a good job of parsing the query.
If you are searching across multiple fields, the best place to define this is within your solrconfig.xml file:

 <requestHandler name="search" class="solr.SearchHandler" default="true">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <int name="rows">10</int>
      <str name="defType">edismax</str>
      <str name="qf">myImportantField^20.0 myOtherField yetAnotherField</str>
    </lst>
  </requestHandler>

Externalize the properties that are different for each platform using the solrcore.properties file (e.g. the location of the solr data directory). This makes it easier to deploy schema and configuration changes to production. You can do this by changing your solrconfig.xml to the following:

<dataDir>${data.dir}</dataDir>

and creating a a solrcore.properties file for each environment with something similar to:

data.dir=/data/solr

Don't be afraid to augment your results with data from other sources. Just because you need to show a particular field in your search results, doesn't mean you need to store it in your search index (there is obviously a careful performance trade-off to be made here).

If you are implementing "auto suggest", use an NGramFilterFactory in your schema similar to the following:

<fieldType name="wildcard" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="25" />
    </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

I hope this gives you the incentive you need to give your database a holiday and improve the search on your website.

Introducing synoptic - a console application framework for .net

If you want to go directly to the app, you can find us on github or you can view the wiki.

There needs to be more love for console applications. As a standard, our company produces a console application for every web site we develop. It doesn't contain all the functionality of the website, but does serve the following purpose:

Allows us to automate various parts of the site by giving us something that can be easily invoked via the build process (e.g. running a nightly map-reduce to update our data aggregates).
Allows us to debug the installation without firing up a web browser (e.g. showing the audit trail of a particular transaction if the client has any queries).
It proves to us that the front-end is sufficiently decoupled from the rest of the application to allow us to build a different user-interface for it later if necessary (e.g. mobile etc).

In our minds, the console application is not just a throw-away application - it is a first-class citizen of our solution and should be treated as such. I am going to presume that most (if not all) readers have implemented a console application at some point in their lives. These are some of the issues you probably had to deal with (or chose to ignore):

How do I parse the parameters input from the command line and what format do I support (--param, -param, param, /param?)
How do I show the command line usage?
How do I map these parameters to methods that I want to call? How do I make sure that the parameters are valid for this particular method?
What about stderr and stdout? How do I prevent exception messages from being piped to other commands?
How do I output text neatly on the command line? Given that the console is 80 characters by default (but can be resized by the user), how do I make sure that the text wraps and indents correctly? For example, I often see this:

    This is my long text. I wanted it to be indented. What happens when it wraps to the next line?

or (note how the right edge of the first and second cells are not aligned correctly):

    This is the first cell.    This is text for the first cell.
    This is a much longer cell.    This is text for the second cell.

Wouldn't it be better if it looked like this?

    This is the first cell.        This is text for the first cell
                                   and it wraps with the correct indentation.
    This is a much longer cell.    This is text for the second cell.

Maybe this doesn't seem like a big deal, but it's not terribly difficult to fix (hint: the answer is not to manually insert line-breaks in your text because you don't know before-hand how wide the console is going to be).

These problems are already solved for web applications. The url and form parameters represent the input (which, using whatever web framework you decide to use, designates what method to run and how these parameters are parsed). The layout is obviously taken care of with html.

So what can synoptic do for you?

If you have used any reasonable web mvc framework, you have defined classes and methods which maps to user input (the url). Synoptic does a similar thing, but for command line applications. It's best explained by using an example:

Suppose you were using solr for search in your website and you were attempting to tune or maintain your search index. You might want to be able to do this from the command line so you can measure the performance or relevance regularly and maintain some kind of log. In synoptic, you would define this using the following syntax:

[Command(Name="search", Description="Allows you to perform various operations on the site search engine.")]
public class SearchCommand
{
    [CommandAction]
    public void Query(string term)
    {
        // Search logic goes here.
    }

    [CommandAction]
    public void RebuildIndex()
    {
        // Logic to rebuild you index goes here.
    }

}

Then in your application entry point, you feed the arguments to synoptic:

public class MyProgram
{
    public static void Main(string[] args)
    {
        new CommandRunner().Run(args);
    }
}

You could now invoke the query method from the command line using the following:

myapp.exe search query --term=mysearchterm

... or you could rebuild the search index with the following command:

myapp.exe search rebuild-index

If you run your application without specifying a command, you will see the usage pattern that is automatically generated (this is largely modeled on the git command line client behavior if it looks familiar).

That's all you need to get your first synoptic application up-and-running. There is much more information on the wiki that covers more advanced features such as:

Using dependency injection with your commands
Supporting "global" options (e.g. allowing the user to specify logging verbosity)
Customizing and validating parameters
Using the ConsoleFormatter to format text (including ConsoleTable, ConsoleRow, ConsoleCell and ConsoleStyle which can be used to create perfectly formatted, wrapped and indented content on the command line).

There are many new features we have in mind that we will hopefully be adding shortly (feel free to contribute with ideas or code):

Model-binding so that methods support more than just primitive types
More console-widgets (e.g. download progress etc.)
Additional customizations around global options
Internationalization

A big thank you to Mono.Options for providing such a versatile command parsing library (which synoptic uses internally).