In ZeroDB, it’s easy to find encrypted data by the value of one field simply by traversing an index from the client. But what about more complex queries? In particular, full text search, where a user finds documents matching one or multiple keywords in a query, ordered by a relevancy score.
Originally we started with a standard Okapi BM25 scoring function, which ZODB used for full text search, and tried optimizing the bucket sizes of the trees used in the indexes. We quickly discovered that this approach isn’t scalable. It worked well enough searching over an encrypted copy of the Linux Kernel Mailing List, but with larger datasets we ended up downloading far too much data to the client.
We’ve managed to fix that. And our fix may be equally applicable to making Elasticsearch more scalable. Briefly, we’ve reduced average bandwidth requirements for a multi-keyword search (with a limit) from
O(n_query_terms)*sum(O(n_docs_per_term) for all query_terms) down to
O(n_query_terms)*constant, in an incremental manner. Feel free to test zerodb >= 0.97 / zerodb-server >= 0.1.16 with this new full text search and give us your thoughts on Slack. It works with the new version of
Here’s a more detailed description for those who are curious. Or you can read a less wordy version in comments to the code.
What’s wrong with Okapi? Let’s first look at how it works. Document score is determined as:
where f(q, D) is how many times we see term q in the document D, |D| is the length of document D in words and avgdl is an average document length in words, b and k1 are constants and IDF is the inverse document frequency. For simplicity, let’s assume that there is only one keyword q, so IDF is just a constant multiplier.
What’s the problem with this formula? The way we work with large datasets in ZeroDB requires having references to documents pre-ordered in the index, so that limit queries are easy (e.g. just return first 10 docids in the index). But the problem here is that the score depends on both document length and average document length (avgdl) as well. This avgdl changes every time we add a new document to the dataset. And that changes the scores unpredictably, so we have to re-calculate them every time we query.
That means that we have to download document lengths and f(q) for all documents which contain term q. That’s not only not scalable (especially when there are many documents) but also very insecure. The size of the data we’ve just downloaded is proportional to the number of documents containing q, which, gives the server a good clue what the keyword q probably is, given natural language statistics.
So for our purposes, Okapi BM25 is not good. We need to try something else.
Lucene’s practical scoring function
Fortunately, there is a more practical way of calculating document scores used by Lucene/Elasticsearch. It appears that in many cases it is just enough to introduce some non-linearity in the scoring function.
In Lucene’s scoring, for the simple case of one search term q, the score is just proportional to the square root of the number of occurrences of q in the document D divided by the square root of the total number of unique terms in document D.
In this case, the document score can be pre-calculated and docids per term can be ordered by this score. So, if we do a limit query, we just take first documents from the set pre-ordered by -score. The server could guess how many documents we’ve downloaded, but not how many documents contain this term. This is secure enough!
Multi-keyword search and incremental weighted union
Ok, that was easy enough. But what happens when there are multiple keywords in a search query? In this case, for each term q_i we effectively already have a list of docids and scores:
scores(q_i) = [(score_i(docid), docid), …]
ordered by -score_i. Documents in the result set will effectively have a total score:
score(docid) = sum(w_i * score_i(docid) for i in query_terms),
And these docids should be ordered by -score. Most queries are limit queries (e.g. top 20 docs with highest score). Can we do these without downloading all the score_i values?
In Lucene, it’s easier to actually download all the scoreis into memory when a query is performed. But when we want everything to stay encrypted on the server, we don’t have such a luxury! So, how do we produce this list of docids ordered by score without downloading scorei for all terms?
In order to do that, we first download a small number of document scores and docids for every keyword qi. After that, based on the scores and weights we know, and on the knowledge that we first read higher scores, we estimate a possible range (scoremin and scoremax) for the total score for every document we’ve seen so far. That is, if for a docid, we’ve already downloaded scores i in iknown with scores i in i_unknown still to be downloaded, the possible range of scores for this docid would be:
min_score(docid) = sum(score_i for i in i_known),
max_score(docid) = min_score(docid) + sum(min(known score_i) for i in i_unknown).
We calculate these scores for all documents we’ve seen so far, determine their order (sorting by minscore) and check if there could be any order violations (e.g. minscore(docid1) > minscore(docid2) but minscore(docid1) < maxscore(docid2)). Also we check if minscore(lastdocument) > maxpossiblescore(documentsleft_unread). These tests determine whether we need to read more scores and docids from the server or if we already have enough information to determine the relative ordering of docids.