10 November 2013

Full-text search in Geary

Updated 8 December 2013 to add the Search Corpus section detailing how we turn emails into searchable text.

It’s been a few weeks since we pushed out a new release of Geary, the email client I’ve been working on at Yorba. One of the big features this version is fast-as-you-type local full-text search. I spent quite a bit of time this cycle implementing the search feature, and now that I’ve had some breathing room, I’d like to write about some of the technical problems I ran into, how I solved them, and what I learned in the process. Bear with me: there’s a lot of complexity hidden by the simple search interface.

For the lazy, here’s a TL;DR:

SQLite’s full-text search capabilities are great.
SQLite’s built-in search tokenizers are not great.
We wanted Geary’s search algorithm to incorporate stemming.
There’s a great open source stemming tokenizer we use called unicodesn.
What you type in the search box gets preprocessed before being handed off to SQLite, to fix a number of potential issues.
We break down emails into a number of plain text fields in the search table.
The SQL query to run the full-text search is more complicated than you might expect.
Lots of work went into making sure the search runs quickly and correctly.
The end result is great, though there’s of course room for improvement.

Let’s dive in. First, some background. Geary uses SQLite as its data store. Geary is meant to be relatively light-weight, so we don’t pull all of an account’s email down immediately, but we grab at least a couple weeks’ worth of messages to have enough context for things like local search to work as expected. On top of that, we don’t currently have any code in Geary to trim the database as it grows, so especially if you’ve been running it regularly for a few months, your mail database can become quite large.

After briefly considering some alternatives like Lucene and Xapian, we settled on simply making use of SQLite’s built-in full-text search capabilities. The potential gain of using a different search library that’s maybe slightly more optimized or has a few extra features seemed small, and SQLite’s FTS4 engine is pretty fast even for large data sets, and easier to start using from an existing SQLite application than another project would be. See these two blog posts for some great summaries of some of the technologies that were up for consideration.

FTS Overview

In SQLite, the FTS system works through a virtual table you create similarly to any other table, and then you simply insert the corpus of text data you want to search over. Behind the scenes, SQLite breaks down the text into individual words, a process called tokenization, and then stores those tokens in an index optimized for fast lookup. To run a full-text search, you simply SELECT data back out of the table using a MATCH operator.

There are different ways to break down text into tokens, so SQLite comes with a few different tokenizers. Unfortunately, most of them are inadequate for use in modern, nontrivial applications.

simple: outside the ASCII range, all characters are included as literal word characters. This means you lose case-insensitivity for almost everything but English, and some punctuation marks like angled quotes are considered part of the adjacent word. No good for the soup of languages and Unicode characters encountered in email.
porter: similar to simple, but additionally stems English tokens using the Porter stemming algorithm, which doesn’t help for non-English words. I’ll discuss stemmers in more detail below.
icu: splits tokens at locale-dependent word boundaries, but then includes punctuation as tokens. The docs say this works in some locales, but breaks down in others. Also, this tokenizer is only optionally included at compile time, and for example the Ubuntu SQLite package doesn’t include it, which is a deal-breaker for Geary.
unicode61: close to ideal: uses full Unicode 6.1 rules to separate tokens, and even optionally removes diacritics, a form of transliteration that can make searching for non-English words more convenient. Unfortunately, like icu, it’s not built by default in Ubuntu’s SQLite package, so we can’t simply turn it on in Geary.

Stemming

Stemming involves searching based on the stem or root of words instead of the literal words themselves. For example, if you search for follow, you probably want to find tokens like follows, followed, following, and maybe even follower. Stemming accomplishes this, in this example, by cutting the suffixes off, storing something like follow in place of the original words. There are many different stemming algorithms, each necessarily tailored to a particular language. See the wikipedia article for a suitably dry discussion about the topic.

Search stemming can be tricky to get right. Too much stemming and you feel like you’re finding all kinds of words you didn’t want. We got comments from a user searching for eventful and finding the much more common word event, which, the user argued, they would have typed if they’d wanted to find. Too little stemming, however, and you’re unable to find documents you know contain certain words but maybe you can’t remember the exact tense or phrasing.

For this reason, it would be handy to be able to disable stemming entirely in certain cases, for example if the user surrounds the word in quotes. Unfortunately, this is impossible in SQLite. Stemming in SQLite happens during tokenization, before the word is committed to the lookup table. It must happen at this phase, because queries against the lookup table must be exact (or exact prefixes, as I’ll discuss later). The trick SQLite pulls here is that the search query is also run through the same tokenization and thus stemming process, so if you search for follows, it transforms that into a lookup on follow in the index and you find the above examples just the same. It works when you want stemming, but it means there’s no way to disable it without rebuilding the entire search index.

Stemming is a natural choice for Geary’s search feature, even if it means potentially finding some matches you don’t really want. Geary presents a list of search results ordered chronologically, so if you know approximately when you received a message, it’s easy to find among some noise. Also, if you find you have too many results, there’s very little inconvenience in simply refining your search to include the sender, or any other bit of information about the message you’re looking for. In other words, in Geary’s case, it’s usually much better to err on the side of showing too many results than too few. Therefore, even if the unicode61 tokenizer were always available, we wanted instead to use something with stemming capability.

unicodesn

Luckily, a good multi-lingual, Unicode-ready stemming tokenizer for SQLite is a common enough need that there are a few existing open source projects out there scratching the itch, so we didn’t have to roll our own. The one that stood out as perfect for Geary is unicodesn.

unicodesn is based on the unicode61 sometimes-built-in tokenizer, and in fact includes its basic transliteration feature, but it also stems using the Snowball stemming library. Snowball is an open source collection of modern stemming algorithms for a number of languages. In other words, unicodesn is fantastic, and it solves all the problems we needed a tokenizer to solve.

SQLite makes it fairly straightforward to drop in a custom tokenizer. From an application’s point of view, you have to compile in some symbols that implement the simple tokenizer interface, then tell SQLite where to find them. The API for telling SQLite where your symbols are is a little scary — you encode the pointer address as a hex blob and send it in an overloaded SELECT statement — but it works. See the custom tokenizer docs for more information.

unicodesn straight from its repo will build as a shared object with a magic function to register itself as a tokenizer when loaded by SQLite. We had to add a little glue code to instead export a registration function callable from Vala as Geary initializes its database layer, plus the CMake incantations to build it statically as part of Geary.

Note that once an FTS table is created with a particular tokenizer, that tokenizer must be loaded every time the table is queried, because, as I mentioned earlier, search queries must get passed through the same tokenizer as the corpus. This makes examining your Geary database on the command line a bit trickier. If you run sqlite3 ~/.local/share/geary/YOU/geary.db then try to SELECT anything from MessageSearchTable, you’ll get Error: unknown tokenizer: unicodesn. To fix this, you can install unicodesn in its original shared object form, which you can do from the Geary source. It’s in src/sqlite3-unicodesn. After running make && make install, the somewhat cumbersome way to fire up your database on the command line becomes:

sqlite3 -cmd '.load unicodesn.sqlext' ~/.local/share/geary/YOU/geary.db

As I mentioned before, Snowball comes with many stemming algorithms: a few English ones, and one each for several other languages. Because it has no way of determining ahead of time which language the user’s email will be in, we have to pick for it. We try to match the user’s locale to a language for which we have a stemming algorithm, falling back on the more advanced English stemmer if we can’t. Even people who set their computer to one language might send and receive emails in another language or many languages, so an approximation is the best we can hope for. Since the same algorithm is applied to both emails and search queries, and words in other languages aren’t too likely to trigger one language’s stemming rules, getting it wrong isn’t usually that bad. English may be a selfish default, but it seems like a good guess when you don’t now what language an email might be in.

In practice, choosing the stemming algorithm this way seems to work pretty well. The main issue with our stemmer, aside from potentially over-stemming, is that the language is set once on first run, and can’t be changed subsequently. We have a ticket to allow the user to set the stemming algorithm, maybe including disabling it entirely, but changing it means rebuilding the entire search index, so it’s not a task to undertake lightly.

Query Preprocessing

Even with stemming in place, displaying the search results as you type can be jarring. As I said before, tokens must usually match the search tokens exactly. If you’re typing even a short word like hello, the results for he, hell, and hello will likely all be very different. It’s not a great experience to have these results pop in and out when you haven’t even typed the full word you actually want to search for.

SQLite has a feature called token prefix queries that allows us to fix this problem. When you append an asterisk to a word in the query, SQLite will match any token starting with that string. Because the results for hell* are a subset of the results for he*, the search results will then appear to slim down as you finish typing. There’s a nice symmetry to this where the results become more exact as you hone down your search query.

Of course, we can’t put the onus on the user to end every word with an asterisk as they type. Instead, there’s a bit of rudimentary preprocessing code that runs each time the user types a character in the search box, modifying the query before we pass it off to SQLite. The preprocessor approximates finding the end of what the user probably meant as a word and adding an asterisk so SQLite will match it as a prefix. The code may not be formal or particularly precise, but it works well enough.

With this preprocessor in place, seeing the search results as you type is much smoother. Like stemming, though, you don’t always want your words turned into prefixes. Imagine if you wanted to find your one email about hell, but Geary kept showing you lots of emails containing hello. Unlike stemming, because the preprocessor is something we do only before sending the string to SQLite, we can turn it off easily. The code checks for words and phrases surrounded by double-quotes, and passes them unmolested.

Normally, having SQLite do a prefix match takes some extra cycles because it has to check a greater number of index entries for matching instead of simply doing an exact comparison. However, there’s a feature where at table creation, you can tell SQLite to index substrings of each token so that prefix matches are again exact comparisons or very close to it. We make use of this feature to speed up our common prefix match use case. Unfortunately, this comes at a cost of a greater table size on disk, since each word now has a number of entries in the index. The bloated size is problematic for the obvious reasons, plus one I’ll discuss later that might not be obvious.

We also use the preprocessor as a hook to fix a few other user experience nits and keep things running smoothly. It makes sure all quoted strings are terminated, because unbalanced quotes seem to cause SQLite problems when parsing the MATCH operand. It strips out problematic characters like % entirely, whose placement in the query can also trigger SQLite errors. We use it to strip out a few characters and words whose significance to SQLite we want to nullify, like parentheses and boolean operators. Because we eventually want to support server search from the same search box, we disallow these parts of the full MATCH syntax that we couldn’t easily duplicate on the server. Perhaps they’ll make an appearance later, but for now we want to support only simple search phrases, not boolean logic. Basically, the preprocessor acts as a buffer between what’s typed and what SQLite sees, making sure the query is valid and only has the syntax we allow.

Finally, we use the preprocessor to make searching for things like email addresses work like you’d expect, something of absolute necessity inside an email client. Dots and at-signs are considered punctuation characters in the Unicode standard, so unicodesn won’t treat them as part of any token. Searching for john.doe@example.com looks the same to SQLite as searching for john doe example com, which isn’t exactly what you meant. Fortunately, SQLite allows double-quotes in MATCH operands to signify that the surrounded tokens must appear adjacent to one another in order, which as I said above, we piggy-back on to disable prefix matching. We take further advantage of this to surround any unquoted, whitespace-delimited word in quotes. After the preprocessor spits it out, your search would end up looking like "john doe example com*" (the asterisk appears because you didn’t surround the phrase in quotes) to SQLite, which is about as close as we can get to what you meant. An alternative would be to build our own tokenizer that treats email addresses (and URLs, and anything else we might want to special-case) as separate tokens, but our simple solution fixes 90% of the problems with about 1% of the effort and isn’t as error-prone as a custom tokenizer would be.

Search Corpus

Emails are sent around the internet in a structured text format defined by RFC 5322, which used to be RFC 2822 and, even further back, RFC 822. We don’t want to simply insert the full email text to be searched against, though. For example many headers in an email contain information the user doesn’t usually see or care about, and thus wouldn’t expect to see matches from.

SQLite also lets us insert text into multiple searchable columns. When you run a search, you can specify column:string inside the MATCH argument. SQLite searches for that string in the given column only. We don’t take advantage of this yet, but we wanted to split up the emails into sensible columns anyway, so we can easily implement some Gmail-like search operators like from:sender in the next release.

To solve these problems, we define a few logical columns to hold information we may want to search separately, and break each part down into a purely textual representation before inserting it. The bit that handles the email body is particularly complicated. Many emails are sent with both plain text and HTML body parts, where the client is supposed to display the “best” representation, usually HTML. We want the search corpus to consist of the parts of the email visible to the user, which means we want to pick the HTML part for search, too. We can’t insert raw HTML, though: we don’t want the tags themselves to be matched, so we have to strip them out. Additionally, Geary displays attached emails inline after the original message’s body, so we try to include them too, recursively.

Matching

With everything I’ve discussed so far, we’re getting pretty close to a working search solution. The only piece left to discuss is specifically how we find matches and pull them out of the database.

Luckily, Geary doesn’t need to do any ranking of search results. Either an email matches what you typed or it doesn’t, and we sort matches chronologically, the same as if you were looking at a real folder. This makes life a little bit easier for us, as ranking isn’t built into SQLite.

The problem is, we need to index all of your mail, including messages in your spam and trash folders, for example. In the future, we want to allow searching in specific folders, including ones you normally wouldn’t look in. However, when you just do a regular search, you probably don’t want to find spammy or discarded messages, so we have to exclude messages in those certain folders from the results.

This wouldn’t be so bad, but Geary allows messages to exist in multiple folders simultaneously to take advantage of Gmail’s “unique” flavor of IMAP. Traditionally, a message exists only in one folder, but Gmail allows for any number of labels to be attached to a message, each label showing up as a folder containing that message in IMAP. This makes our SQL query more complicated: we have to blacklist every message that appears in any undesirable folder instead of simply joining on the bad folders and adding a WHERE condition against their ids.

There are two other complicating factors worth mentioning. First, because Geary won’t yet pick up moves instantly, we have to also blacklist messages that aren’t in any folder, in case it’s been moved to the trash and Geary hasn’t caught up yet. Also, because we want Geary’s search results to update with any new mail you receive that matches your query, we need to be able to run the search against a small subset of mail instead of the full database, and add or subtract results from an existing list.

The final product of all this is a complicated SQL query that involves the use of a couple helper functions. This complication, necessary as it may be to get the results you’d expect, comes at the cost of some speed. Just doing a MATCH is super quick in SQLite, but when you have to combine that with all the complications of folder blacklisting, we found performance taking a serious hit. In my test runs with a database on the order of 80K emails, it was taking SQLite over half a second just to run our SELECT when matching something extremely common like a*. That’s unacceptable in an application showing results as you type.

Tuning

SQLite gives us a few tools to help combat slow queries. After running a bunch of permutations of equivalent queries through EXPLAIN QUERY PLAN, it became apparent that SQLite was somehow not choosing the right index to pull out and order results, resulting in an unnecessary full table scan. I’m not sure why SQLite seemingly dropped the ball here, since the right index seems obvious just by glancing at the query, but we found that adding an INDEXED BY on the appropriate index sped things up dramatically. In my same test with the new query, our SELECT running time was cut in half, to around a quarter of a second. That’s still not as fast as we’d like, but it’s in the ballpark, and it was the best we could find without significant re-architecting.

There’s lots of code between the SELECT and the user seeing search results in a folder in Geary. We put a lot of work into tuning how we keep track of the list of matches and load relevant data only as it’s necessary to display. The final result works pretty well, but these parts too can occasionally be slower than ideal. There’s a tenth of a second delay after pressing a key before we run the updated search so we don’t needlessly thrash quite as much while you’re typing. Including that delay, in the same testing environment I mentioned above, it can take upwards of three quarters of a second after you type a until you’re looking at all your mail in the results folder.

We’d obviously like to cut that time down, but again, given the time constraints for this milestone, it’s the best we could do in our initial implementation. It just means there’s some room for improvement in future versions. I’ll also mention that searching for something saner that doesn’t occur in every single one of your emails will run much more quickly. After you type a full word in the search box, you can see results being updated almost as fast as you can type, so it’s not noticeably slow in practice.

There’s one more performance issue we had to work through. I mentioned before that the size of our database was problematic for a reason not immediately obvious. We discovered that with our database size being somewhat bloated with a large number of prefix-indexed emails, starting Geary cold from a reboot could be slow. On our development machines we noticed delays of up to 30 seconds when starting Geary before you could see your email.

After doing some strace debugging to see what happens during the delay, it seemed SQLite tries to read the entire search index, the largest part of the database by far, into memory the first time it’s accessed. Due to the internal structure of a table, this involves a huge number of seeks and reads from disk, an especially slow operation when non-sequential. SQLite lets you control the alignment of seeks and size of reads with the page_size pragma. By changing that value from its default of 1K to 4K, which is the most common modern filesystem block size, the delay went away almost entirely. I didn’t have time to look too deeply into what was going on, but I suspect a huge number of non-sequential seeks and small reads was thrashing the Linux IO subsystem and buffer cache, which operate on filesystem block-sized chunks.

I Should Wrap This Up

For such a seemingly simple feature, implementing as-you-type local search was a much longer haul than I expected. It takes a surprising amount of work to get exactly the right emails to show up quickly in every case. In the end, I’m quite proud of what we accomplished in Geary this release, especially with the first iteration of our search feature. After using it regularly for a couple months, I’ve been pleased with its performance. The only times I’ve had to turn to Gmail proper to find something were because the email I was looking for was so old it wasn’t in my Geary database.

As I’ve mentioned, there are still a number of improvements we’d like to make to Geary’s search feature in future versions. In addition to more speed improvements, we’d like to offer Gmail-like search operators such as in:folder or from:sender. We’d like to bring back boolean and set operators so you can more precisely specify what you want to find. Probably the biggest amount of planned work is for running a server-side search when the local results are incomplete.

Until then, I encourage you to take Geary out for a spin. This new version has a number of added features that add up to a surprisingly usable, if young, email client. Here’s to many more versions with many improvements.