Full-text search in Geary
Updated 8 December 2013 to add the Search Corpus section detailing how we turn emails into searchable text.
It’s been a few weeks since we pushed out a new release of Geary, the email client I’ve been working on at Yorba. One of the big features this version is fast-as-you-type local full-text search. I spent quite a bit of time this cycle implementing the search feature, and now that I’ve had some breathing room, I’d like to write about some of the technical problems I ran into, how I solved them, and what I learned in the process. Bear with me: there’s a lot of complexity hidden by the simple search interface.
For the lazy, here’s a TL;DR:
- SQLite’s full-text search capabilities are great.
- SQLite’s built-in search tokenizers are not great.
- We wanted Geary’s search algorithm to incorporate stemming.
- There’s a great open source stemming tokenizer we use called unicodesn.
- What you type in the search box gets preprocessed before being handed off to SQLite, to fix a number of potential issues.
- We break down emails into a number of plain text fields in the search table.
- The SQL query to run the full-text search is more complicated than you might expect.
- Lots of work went into making sure the search runs quickly and correctly.
- The end result is great, though there’s of course room for improvement.
Let’s dive in. First, some background. Geary uses SQLite as its data store. Geary is meant to be relatively light-weight, so we don’t pull all of an account’s email down immediately, but we grab at least a couple weeks’ worth of messages to have enough context for things like local search to work as expected. On top of that, we don’t currently have any code in Geary to trim the database as it grows, so especially if you’ve been running it regularly for a few months, your mail database can become quite large.
After briefly considering some alternatives like Lucene and Xapian, we settled on simply making use of SQLite’s built-in full-text search capabilities. The potential gain of using a different search library that’s maybe slightly more optimized or has a few extra features seemed small, and SQLite’s FTS4 engine is pretty fast even for large data sets, and easier to start using from an existing SQLite application than another project would be. See these two blog posts for some great summaries of some of the technologies that were up for consideration.
FTS Overview
In SQLite, the FTS system works through a virtual table you create
similarly to any other table, and then you simply insert the corpus of text
data you want to search over. Behind the scenes, SQLite breaks down the text
into individual words, a process called tokenization, and then stores those
tokens in an index optimized for fast lookup. To run a full-text search, you
simply SELECT
data back out of the table using a MATCH
operator.
There are different ways to break down text into tokens, so SQLite comes with a few different tokenizers. Unfortunately, most of them are inadequate for use in modern, nontrivial applications.
- simple: outside the ASCII range, all characters are included as literal word characters. This means you lose case-insensitivity for almost everything but English, and some punctuation marks like angled quotes are considered part of the adjacent word. No good for the soup of languages and Unicode characters encountered in email.
- porter: similar to simple, but additionally stems English tokens using the Porter stemming algorithm, which doesn’t help for non-English words. I’ll discuss stemmers in more detail below.
- icu: splits tokens at locale-dependent word boundaries, but then includes punctuation as tokens. The docs say this works in some locales, but breaks down in others. Also, this tokenizer is only optionally included at compile time, and for example the Ubuntu SQLite package doesn’t include it, which is a deal-breaker for Geary.
- unicode61: close to ideal: uses full Unicode 6.1 rules to separate tokens, and even optionally removes diacritics, a form of transliteration that can make searching for non-English words more convenient. Unfortunately, like icu, it’s not built by default in Ubuntu’s SQLite package, so we can’t simply turn it on in Geary.
Stemming
Stemming involves searching based on the stem or root of words instead of the
literal words themselves. For example, if you search for follow, you
probably want to find tokens like follows, followed, following, and maybe
even follower. Stemming accomplishes this, in this example, by cutting the
suffixes off, storing something like follow
in place of the original words.
There are many different stemming algorithms, each necessarily tailored to a
particular language. See the wikipedia article for a suitably dry
discussion about the topic.
Search stemming can be tricky to get right. Too much stemming and you feel like you’re finding all kinds of words you didn’t want. We got comments from a user searching for eventful and finding the much more common word event, which, the user argued, they would have typed if they’d wanted to find. Too little stemming, however, and you’re unable to find documents you know contain certain words but maybe you can’t remember the exact tense or phrasing.
For this reason, it would be handy to be able to disable stemming entirely in
certain cases, for example if the user surrounds the word in quotes.
Unfortunately, this is impossible in SQLite. Stemming in SQLite happens during
tokenization, before the word is committed to the lookup table. It must happen
at this phase, because queries against the lookup table must be exact (or exact
prefixes, as I’ll discuss later). The trick SQLite pulls here is that the
search query is also run through the same tokenization and thus stemming
process, so if you search for follows
, it transforms that into a lookup on
follow
in the index and you find the above examples just the same. It works
when you want stemming, but it means there’s no way to disable it without
rebuilding the entire search index.
Stemming is a natural choice for Geary’s search feature, even if it means potentially finding some matches you don’t really want. Geary presents a list of search results ordered chronologically, so if you know approximately when you received a message, it’s easy to find among some noise. Also, if you find you have too many results, there’s very little inconvenience in simply refining your search to include the sender, or any other bit of information about the message you’re looking for. In other words, in Geary’s case, it’s usually much better to err on the side of showing too many results than too few. Therefore, even if the unicode61 tokenizer were always available, we wanted instead to use something with stemming capability.
unicodesn
Luckily, a good multi-lingual, Unicode-ready stemming tokenizer for SQLite is a common enough need that there are a few existing open source projects out there scratching the itch, so we didn’t have to roll our own. The one that stood out as perfect for Geary is unicodesn.
unicodesn is based on the unicode61 sometimes-built-in tokenizer, and in fact includes its basic transliteration feature, but it also stems using the Snowball stemming library. Snowball is an open source collection of modern stemming algorithms for a number of languages. In other words, unicodesn is fantastic, and it solves all the problems we needed a tokenizer to solve.
SQLite makes it fairly straightforward to drop in a custom tokenizer. From an
application’s point of view, you have to compile in some symbols that implement
the simple tokenizer interface, then tell SQLite where to find them. The API
for telling SQLite where your symbols are is a little scary — you encode the
pointer address as a hex blob and send it in an overloaded SELECT
statement
— but it works. See the custom tokenizer docs for more
information.
unicodesn straight from its repo will build as a shared object with a magic function to register itself as a tokenizer when loaded by SQLite. We had to add a little glue code to instead export a registration function callable from Vala as Geary initializes its database layer, plus the CMake incantations to build it statically as part of Geary.
Note that once an FTS table is created with a particular tokenizer, that
tokenizer must be loaded every time the table is queried, because, as I
mentioned earlier, search queries must get passed through the same tokenizer as
the corpus. This makes examining your Geary database on the command line a bit
trickier. If you run sqlite3 ~/.local/share/geary/YOU/geary.db
then try to
SELECT
anything from MessageSearchTable
, you’ll get Error: unknown
tokenizer: unicodesn
. To fix this, you can install unicodesn in its
original shared object form, which you can do from the Geary source. It’s in
src/sqlite3-unicodesn
. After running make && make install
, the somewhat
cumbersome way to fire up your database on the command line becomes:
sqlite3 -cmd '.load unicodesn.sqlext' ~/.local/share/geary/YOU/geary.db
As I mentioned before, Snowball comes with many stemming algorithms: a few English ones, and one each for several other languages. Because it has no way of determining ahead of time which language the user’s email will be in, we have to pick for it. We try to match the user’s locale to a language for which we have a stemming algorithm, falling back on the more advanced English stemmer if we can’t. Even people who set their computer to one language might send and receive emails in another language or many languages, so an approximation is the best we can hope for. Since the same algorithm is applied to both emails and search queries, and words in other languages aren’t too likely to trigger one language’s stemming rules, getting it wrong isn’t usually that bad. English may be a selfish default, but it seems like a good guess when you don’t now what language an email might be in.
In practice, choosing the stemming algorithm this way seems to work pretty well. The main issue with our stemmer, aside from potentially over-stemming, is that the language is set once on first run, and can’t be changed subsequently. We have a ticket to allow the user to set the stemming algorithm, maybe including disabling it entirely, but changing it means rebuilding the entire search index, so it’s not a task to undertake lightly.
Query Preprocessing
Even with stemming in place, displaying the search results as you type can be
jarring. As I said before, tokens must usually match the search tokens
exactly. If you’re typing even a short word like hello
, the results for
he
, hell
, and hello
will likely all be very different. It’s not a great
experience to have these results pop in and out when you haven’t even typed the
full word you actually want to search for.
SQLite has a feature called token prefix queries that allows us to fix this
problem. When you append an asterisk to a word in the query, SQLite will match
any token starting with that string. Because the results for hell*
are a
subset of the results for he*
, the search results will then appear to slim
down as you finish typing. There’s a nice symmetry to this where the results
become more exact as you hone down your search query.
Of course, we can’t put the onus on the user to end every word with an asterisk as they type. Instead, there’s a bit of rudimentary preprocessing code that runs each time the user types a character in the search box, modifying the query before we pass it off to SQLite. The preprocessor approximates finding the end of what the user probably meant as a word and adding an asterisk so SQLite will match it as a prefix. The code may not be formal or particularly precise, but it works well enough.
With this preprocessor in place, seeing the search results as you type is much smoother. Like stemming, though, you don’t always want your words turned into prefixes. Imagine if you wanted to find your one email about hell, but Geary kept showing you lots of emails containing hello. Unlike stemming, because the preprocessor is something we do only before sending the string to SQLite, we can turn it off easily. The code checks for words and phrases surrounded by double-quotes, and passes them unmolested.
Normally, having SQLite do a prefix match takes some extra cycles because it has to check a greater number of index entries for matching instead of simply doing an exact comparison. However, there’s a feature where at table creation, you can tell SQLite to index substrings of each token so that prefix matches are again exact comparisons or very close to it. We make use of this feature to speed up our common prefix match use case. Unfortunately, this comes at a cost of a greater table size on disk, since each word now has a number of entries in the index. The bloated size is problematic for the obvious reasons, plus one I’ll discuss later that might not be obvious.
We also use the preprocessor as a hook to fix a few other user experience nits
and keep things running smoothly. It makes sure all quoted strings are
terminated, because unbalanced quotes seem to cause SQLite problems when
parsing the MATCH
operand. It strips out problematic characters like %
entirely, whose placement in the query can also trigger SQLite errors. We use
it to strip out a few characters and words whose significance to SQLite we want
to nullify, like parentheses and boolean operators. Because we eventually want
to support server search from the same search box, we disallow these parts of
the full MATCH
syntax that we couldn’t easily duplicate on the server.
Perhaps they’ll make an appearance later, but for now we want to support only
simple search phrases, not boolean logic. Basically, the preprocessor acts as
a buffer between what’s typed and what SQLite sees, making sure the query is
valid and only has the syntax we allow.
Finally, we use the preprocessor to make searching for things like email
addresses work like you’d expect, something of absolute necessity inside an
email client. Dots and at-signs are considered punctuation characters in the
Unicode standard, so unicodesn won’t treat them as part of any token.
Searching for john.doe@example.com
looks the same to SQLite as searching for
john doe example com
, which isn’t exactly what you meant. Fortunately,
SQLite allows double-quotes in MATCH
operands to signify that the surrounded
tokens must appear adjacent to one another in order, which as I said above, we
piggy-back on to disable prefix matching. We take further advantage of this to
surround any unquoted, whitespace-delimited word in quotes. After the
preprocessor spits it out, your search would end up looking like "john doe
example com*"
(the asterisk appears because you didn’t surround the phrase
in quotes) to SQLite, which is about as close as we can get to what you meant.
An alternative would be to build our own tokenizer that treats email addresses
(and URLs, and anything else we might want to special-case) as separate tokens,
but our simple solution fixes 90% of the problems with about 1% of the effort
and isn’t as error-prone as a custom tokenizer would be.
Search Corpus
Emails are sent around the internet in a structured text format defined by RFC 5322, which used to be RFC 2822 and, even further back, RFC 822. We don’t want to simply insert the full email text to be searched against, though. For example many headers in an email contain information the user doesn’t usually see or care about, and thus wouldn’t expect to see matches from.
SQLite also lets us insert text into multiple searchable columns. When you run
a search, you can specify column
:
string
inside the MATCH
argument.
SQLite searches for that string in the given column only. We don’t take
advantage of this yet, but we wanted to split up the emails into sensible
columns anyway, so we can easily implement some Gmail-like search operators
like from:
sender
in the next release.
To solve these problems, we define a few logical columns to hold information we may want to search separately, and break each part down into a purely textual representation before inserting it. The bit that handles the email body is particularly complicated. Many emails are sent with both plain text and HTML body parts, where the client is supposed to display the “best” representation, usually HTML. We want the search corpus to consist of the parts of the email visible to the user, which means we want to pick the HTML part for search, too. We can’t insert raw HTML, though: we don’t want the tags themselves to be matched, so we have to strip them out. Additionally, Geary displays attached emails inline after the original message’s body, so we try to include them too, recursively.
Matching
With everything I’ve discussed so far, we’re getting pretty close to a working search solution. The only piece left to discuss is specifically how we find matches and pull them out of the database.
Luckily, Geary doesn’t need to do any ranking of search results. Either an email matches what you typed or it doesn’t, and we sort matches chronologically, the same as if you were looking at a real folder. This makes life a little bit easier for us, as ranking isn’t built into SQLite.
The problem is, we need to index all of your mail, including messages in your spam and trash folders, for example. In the future, we want to allow searching in specific folders, including ones you normally wouldn’t look in. However, when you just do a regular search, you probably don’t want to find spammy or discarded messages, so we have to exclude messages in those certain folders from the results.
This wouldn’t be so bad, but Geary allows messages to exist in multiple folders
simultaneously to take advantage of Gmail’s “unique” flavor of IMAP.
Traditionally, a message exists only in one folder, but Gmail allows for any
number of labels to be attached to a message, each label showing up as a folder
containing that message in IMAP. This makes our SQL query more complicated: we
have to blacklist every message that appears in any undesirable folder instead
of simply joining on the bad folders and adding a WHERE
condition against
their ids.
There are two other complicating factors worth mentioning. First, because Geary won’t yet pick up moves instantly, we have to also blacklist messages that aren’t in any folder, in case it’s been moved to the trash and Geary hasn’t caught up yet. Also, because we want Geary’s search results to update with any new mail you receive that matches your query, we need to be able to run the search against a small subset of mail instead of the full database, and add or subtract results from an existing list.
The final product of all this is a complicated SQL query that involves
the use of a couple helper functions. This
complication, necessary as it may be to get the results you’d expect, comes at
the cost of some speed. Just doing a MATCH
is super quick in SQLite, but
when you have to combine that with all the complications of folder
blacklisting, we found performance taking a serious hit. In my test runs with
a database on the order of 80K emails, it was taking SQLite over half a second
just to run our SELECT
when matching something extremely common like a*
.
That’s unacceptable in an application showing results as you type.
Tuning
SQLite gives us a few tools to help combat slow queries. After running a bunch
of permutations of equivalent queries through EXPLAIN QUERY PLAN
, it
became apparent that SQLite was somehow not choosing the right index to pull
out and order results, resulting in an unnecessary full table scan. I’m not
sure why SQLite seemingly dropped the ball here, since the right index seems
obvious just by glancing at the query, but we found that adding an INDEXED
BY
on the appropriate index sped things up dramatically. In my
same test with the new query, our SELECT
running time was cut in half, to
around a quarter of a second. That’s still not as fast as we’d like, but it’s
in the ballpark, and it was the best we could find without significant
re-architecting.
There’s lots of code between the SELECT
and the user seeing search results in
a folder in Geary. We put a lot of work into tuning how we keep track of the
list of matches and load relevant data only as it’s necessary
to display. The final result works pretty well, but these parts too can
occasionally be slower than ideal. There’s a tenth of a second delay after
pressing a key before we run the updated search so we don’t needlessly thrash
quite as much while you’re typing. Including that delay, in the same testing
environment I mentioned above, it can take upwards of three quarters of a
second after you type a
until you’re looking at all your mail in the results
folder.
We’d obviously like to cut that time down, but again, given the time constraints for this milestone, it’s the best we could do in our initial implementation. It just means there’s some room for improvement in future versions. I’ll also mention that searching for something saner that doesn’t occur in every single one of your emails will run much more quickly. After you type a full word in the search box, you can see results being updated almost as fast as you can type, so it’s not noticeably slow in practice.
There’s one more performance issue we had to work through. I mentioned before that the size of our database was problematic for a reason not immediately obvious. We discovered that with our database size being somewhat bloated with a large number of prefix-indexed emails, starting Geary cold from a reboot could be slow. On our development machines we noticed delays of up to 30 seconds when starting Geary before you could see your email.
After doing some strace
debugging to see what happens during the delay, it
seemed SQLite tries to read the entire search index, the largest part of the
database by far, into memory the first time it’s accessed. Due to the internal
structure of a table, this involves a huge number of seeks and reads from disk,
an especially slow operation when non-sequential. SQLite lets you control the
alignment of seeks and size of reads with the page_size
pragma.
By changing that value from its default of 1K to 4K, which is
the most common modern filesystem block size, the delay went away almost
entirely. I didn’t have time to look too deeply into what was going on, but I
suspect a huge number of non-sequential seeks and small reads was thrashing the
Linux IO subsystem and buffer cache, which operate on filesystem block-sized
chunks.
I Should Wrap This Up
For such a seemingly simple feature, implementing as-you-type local search was a much longer haul than I expected. It takes a surprising amount of work to get exactly the right emails to show up quickly in every case. In the end, I’m quite proud of what we accomplished in Geary this release, especially with the first iteration of our search feature. After using it regularly for a couple months, I’ve been pleased with its performance. The only times I’ve had to turn to Gmail proper to find something were because the email I was looking for was so old it wasn’t in my Geary database.
As I’ve mentioned, there are still a number of improvements we’d like to make
to Geary’s search feature in future versions. In addition to more speed
improvements, we’d like to offer Gmail-like search operators such as
in:
folder
or from:
sender
. We’d like to bring back boolean and set
operators so you can more precisely specify what you want to find. Probably
the biggest amount of planned work is for running a server-side search when the
local results are incomplete.
Until then, I encourage you to take Geary out for a spin. This new version has a number of added features that add up to a surprisingly usable, if young, email client. Here’s to many more versions with many improvements.