Regex is a mini-language and my library compiles it to Rust. Writing it was surprisingly difficult. While working on one bug, I had a moment of doubt about my ability to solve it. I've had such experiences only a few times in my life. Luckily, I woke up the next morning with the solution in my mind.
The work was satisfying. I also wrote a safe Rust async library, like a mini-tokio without all the scary bits: https://crates.io/crates/safina . And I started writing a safe Rust TLS 1.3 implementation and a safe Rust HTTP library. Working on a challenging side project can bring you satisfaction, knowledge, and confidence, even if you never finish the project.
I'm not the author, but it looks like he's using the same general approach as other similar libraries: Rust has a fairly advanced macro system that can be used to parse input strings at compile time and output hygienic Rust code. There are safe database libraries for example that check that the SQL query and the supplied parameters match.
safe-regex uses a Rust Procedural Macro [0] to convert expressions like `regex!(br"ab?c+")` into a Rust code block that defines a new type and returns an instance of it. The type implements a `Matcher` interface with a function that takes a byte array, executes the regular expression against it, and returns true/false and the capture groups. Examples of the generated code are in [1].
The generated function stores capture group info on the stack. It is not recursive. It makes one pass through the input, using an NFA algorithm. It is plain Rust code that starts instantly. This instant startup time makes `safe-regex` faster on small inputs than `regex` [3] which must compile or retrieve from cache at runtime.
The `regex` library has highly optimized code which implements Aho-Corasick with SIMD instructions to find substrings and then matches the regex forward or backward from each substring match, while also tracking utf-8 glyph boundaries. It's amazing. On large inputs, it is 20-500 times faster than `safe-regex` in benchmarks.
I expect that the simpler and shorter code from `safe-regex` puts less pressure on the processor instruction caches, improving performance of the rest of the program and other processes running on the machine. I believe Rust's benchmark library does not measure this.
The Rust compiler optimizes the generated rust code. This means that `safe-regex` gets faster as the Rust compiler gets better. It also exercises the Rust compiler like few hand-written programs do. A previous version of `safe-regex` would compile expressions like 'a?a?a?a?a?' to Rust code that rustc could not compile in reasonable time. I let it compile for an hour once and it didn't finish. I think the rustc optimizer was traversing the potential execution paths with an inefficient algorithm. The workaround was to add extra useless intermediate function calls, so the optimizer would reach some internal limit earlier and give up, allowing the compiler to continue and complete. I saved the code that showed this problem [2]. A refactoring eliminated the problem entirely.
The `regex!(br"ab?c+")` syntax is intuitive for Rust users. An invalid expression fails at compile time with a decent error message. The library does not allocate on the heap. The matcher object consumes a single byte when not matching. This is all very idiomatic. Macros are difficult to fuzz test because the Rust compiler must run for each test case. Writing tests by hand is tedious. I wrote a test helper that generates input strings. It uncovered some subtle bugs.
Here's how safe-regex works: Every matcher has a variable for each byte (b0, b1, etc.) representing states in the NFA. We iterate through the input bytes. In each iteration, we get the next input byte `b` and calculate the transition of every state for consuming that byte.
A matcher with no capturing groups uses the type Option<()> to represent each state. That means each state can be Option::Some(()) "input matched up to this point" or Option::None "no match". When the loop finishes, if the `accept` state is Option::Some, then the input matched. We use `Option::filter` to copy the preceding state only if the byte matches.
The core of the matcher for "a" [1]:
let mut start = Some(());
let mut b0: Option<()> = None;
let mut data_iter = data.iter();
loop {
let prev_b0 = b0.clone();
if let Some(b) = data_iter.next() {
b0 = start.clone().filter(|_| *b == 97u8);
start = None;
} else {
return prev_b0;
}
}
When matching with capturing groups, we must keep track of every capturing group for every state. A capturing group match is two integers representing the start and end of the matching portion of the input. We store that in a Range<usize> struct. There can be multiple capturing groups, so we store the ranges in a tuple. The state variables become type Option<(Range<usize>,)>.
When the input enters a capturing group, we set the Range for that group to the input byte: `.map(|(r0,)| (n..n,))`. Technically, it's the zero-length position between the current byte and the previous byte.
Whenever a byte matches inside a group, we expand the range for that group to include the byte: `.map(|(r0,)| (r0.start..n + 1,))`.
Here's the core of the matcher for "(a)" [1]:
let mut start = Some((usize::MAX..usize::MAX,));
let mut b0: Option<(core::ops::Range<usize>,)> = None;
let mut accept: Option<(core::ops::Range<usize>,)> = None;
let mut data_iter = data.iter();
let mut n = 0;
loop {
let prev_b0 = b0.clone();
accept = prev_b0.clone();
if let Some(b) = data_iter.next() {
b0 = start
.clone()
.map(|(r0,)| (n..n,))
.clone()
.filter(|_| *b == 97u8)
.map(|(r0,)| (r0.start..n + 1,));
start = None;
} else {
break;
}
n = n + 1;
}
The matcher for "(abc)d" [2] is a more thorough example.
I don't get it? In a DFA there's no way to store "the value of a capture group" because each state may have multiple values for any capture group, depending on how it was reached.
WAIT, I think what you've built is NOT a DFA, but an NDFA! Your code has multiple states active at once, and it just tests them all! Ok, now it all makes sense.
In a DFA, it's common for a state to have multiple values for capture groups. For example in /(a*)a*/, state 1 is "in the first loop, or in the second loop with \1={0,0}" and state 2 is "in the first loop, in the second loop with \1={0,0}, or in the second loop with \1={0,1}". So the second state has multiple simultaneous values for group 1; this is why tagged-DFAs usually need more "tags" than there are capture groups and that's why I was asking.
I never filed the rustc bug report. I suspect that the problem is actually in LLVM.
Byte strings are much easier to use than UTF-8. Also, I only needed to execute regular expressions against ASCII HTTP headers. I think `safe-regex` will get String support eventually, or somebody will write another safe regular expression library.
Safe Rust has a bright future. Eventually, modern OSes will simply refuse to run software built from unsafe code, unless it has attached correctness proofs. This will mostly solve the IT security dumpster fire. To get us there, we need to write more useful things in safe Rust or other memory-safe languages.
https://crates.io/crates/safe-regex
Regex is a mini-language and my library compiles it to Rust. Writing it was surprisingly difficult. While working on one bug, I had a moment of doubt about my ability to solve it. I've had such experiences only a few times in my life. Luckily, I woke up the next morning with the solution in my mind.
The work was satisfying. I also wrote a safe Rust async library, like a mini-tokio without all the scary bits: https://crates.io/crates/safina . And I started writing a safe Rust TLS 1.3 implementation and a safe Rust HTTP library. Working on a challenging side project can bring you satisfaction, knowledge, and confidence, even if you never finish the project.