Search Engine Performance: How to Avoid Suffix Queries


Go to Frontpage Next Article

The reason why Lucene's QueryParser doesnt allow suffix queries ("*keyword") is the inefficient nature of such query types. In order to find a word with a suffix query, it must iterate through every term in the index and check if it matches. Depending on the index size, this can potentially kill performance and slow down your application. There is, however, a simple trick which lets you perform suffix queries without the performance disadvantage.

Simply store the reverse of the string and transform suffix queries to prefix queries against the reverse string.

Here's an example. With two strings in your index

A) "Tee House" B) "Waffle House"

you're going to store the reverse string too

A) "esuoH eeT" B) "esuoH elffaW"

and transform each suffix query like "*House" to "esuoH*". The results will be the same as with a real suffix query - just faster!

The disadvantage of this method? denormalization. Since each string is going to be stored twice, index size will become bigger.


Published February 15, 2008 • Tagged as lucene, performance


Comments

blog comments powered by Disqus