Combining search expressions

When regular expression syntax is not enough #

Regular expressions are very powerful. However, sometimes their syntax becomes too complex and cumbersome. This is the case when you want to search for “this, but not that”. Writing single regular expression for such pattern requires some look-ahead magic. Things become a lot easier if search pattern can be combined from several simple regular expressions using boolean logic operators and, or and not. Compare

(?=.*word1)(?=.*word2)(?!.*word3)

with

word1 and word2 and not(word3)

Matching boolean combination of regular expressions has several steps. Firstly, a pattern has to be split into sub-patterns and transformed into symbolic expression that can be repeatedly evaluated. Pattern that looks like this:

"pattern 1" and ("pattern 2" or "pattern 3") and not("pattern 4")

is translated into the following symbolic boolean expression

p1 and (p2 or p3) and not(p4)

where p1, p2, p3 and p4 are variables that correspond to sub-patterns.

Then for each line of text all sub-patterns are run through regex engine to determine which patterns match the input string. This can be expressed as a table:

p1 = true
p2 = false
p3 = true
p4 = false

After that boolean expression is evaluated to get final verdict for the line.

First two steps were easy to implement both for Hyperscan and Qt regular expressions engines. Hyperscan has a big advantage here as it compiles a set of regular expressions into a single database and matches them all at once, so it does not take too much time to match all sub-patterns.

Although that could be a fun experiment, I decided not to write general boolean expressions parser and evaluator from scratch. For the first prototype of the whole idea I used LibBoolEE. It is very easy to integrate and use. Code looks something like this:

    LibBoolEE::Vals vals = { { "A", true }, { "B", false } };
    return LibBoolEE::resolve("A|B&B", vals); // returns 1

This was good enough for the proof of concept. However, LibBoolEE::resolve parses expression each time, that causes big performance degradation. Expression is evaluated for each line from the opened file, so it has to be very fast. To achieve best results an expression has to be parsed only once and transformed into some form that can be repeatedly executed with different arguments.

Next thing to try was a scripting language that can be embedded into C++ application. First one was javascript engine provided by Qt library. The results were better but still not very impressive. Then I tried LuaJit with the help of sol2 and Jinx that claimed to be suitable for realtime applications like games. Both of them performed better than QJsEngine but still not great.

After some more research I found The Great C++ Mathematical Expression Parser Benchmark. This is a benchmark of different implementations of open source math expression parsers and evaluators written in C++. Exprtk library seemed to be a good choice. And indeed for my task it performed much better than general purpose embedded scripting engines.

However, for some reason it felt that search speed could be improved. I enjoy making Klogg as fast as possible, so I looked at reports of perf tool. First thing that caught my attention was a lot of calls to tolower function. Exprtk is case-insensitive by default and uses several maps and sets for its internal state. Switching to case-sensitive mode provided visible speedup. After that top functions in perf report were std::map insert and lookup functions. I tried switching from std::map to std::unordered_map. That did make things better, but still considerable amount of time was being spent in std::unordered_map implementation. So next natural thing to try was switching to some faster hash table implementation. I tried robin_hood unordered map and was quite impressed by it. That was the final step of performance tuning.

So now Klogg has quite fast boolean expression combination mode!