I recently built a regex library in safe Rust: https://crates.io/crates/safe-reg...

ridiculous_fish · on May 27, 2021

Can you say more about "my library compiles it to Rust?" Does your library run at compile time and emit Rust code?

How did you implement capture groups without allocating? For example a capture group in a loop?

jiggawatts · on May 27, 2021

I'm not the author, but it looks like he's using the same general approach as other similar libraries: Rust has a fairly advanced macro system that can be used to parse input strings at compile time and output hygienic Rust code. There are safe database libraries for example that check that the SQL query and the supplied parameters match.

mleonhard · on May 27, 2021

safe-regex uses a Rust Procedural Macro [0] to convert expressions like `regex!(br"ab?c+")` into a Rust code block that defines a new type and returns an instance of it. The type implements a `Matcher` interface with a function that takes a byte array, executes the regular expression against it, and returns true/false and the capture groups. Examples of the generated code are in [1].

The generated function stores capture group info on the stack. It is not recursive. It makes one pass through the input, using an NFA algorithm. It is plain Rust code that starts instantly. This instant startup time makes `safe-regex` faster on small inputs than `regex` [3] which must compile or retrieve from cache at runtime.

The `regex` library has highly optimized code which implements Aho-Corasick with SIMD instructions to find substrings and then matches the regex forward or backward from each substring match, while also tracking utf-8 glyph boundaries. It's amazing. On large inputs, it is 20-500 times faster than `safe-regex` in benchmarks.

I expect that the simpler and shorter code from `safe-regex` puts less pressure on the processor instruction caches, improving performance of the rest of the program and other processes running on the machine. I believe Rust's benchmark library does not measure this.

The Rust compiler optimizes the generated rust code. This means that `safe-regex` gets faster as the Rust compiler gets better. It also exercises the Rust compiler like few hand-written programs do. A previous version of `safe-regex` would compile expressions like 'a?a?a?a?a?' to Rust code that rustc could not compile in reasonable time. I let it compile for an hour once and it didn't finish. I think the rustc optimizer was traversing the potential execution paths with an inefficient algorithm. The workaround was to add extra useless intermediate function calls, so the optimizer would reach some internal limit earlier and give up, allowing the compiler to continue and complete. I saved the code that showed this problem [2]. A refactoring eliminated the problem entirely.

The `regex!(br"ab?c+")` syntax is intuitive for Rust users. An invalid expression fails at compile time with a decent error message. The library does not allocate on the heap. The matcher object consumes a single byte when not matching. This is all very idiomatic. Macros are difficult to fuzz test because the Rust compiler must run for each test case. Writing tests by hand is tedious. I wrote a test helper that generates input strings. It uncovered some subtle bugs.

[0] https://doc.rust-lang.org/reference/procedural-macros.html

[1] https://gitlab.com/leonhard-llc/safe-regex-rs/-/blob/main/sa...

[2] https://gitlab.com/leonhard-llc/safe-regex-rs/-/tree/main/sa...

[3] https://crates.io/crates/regex

ridiculous_fish · on May 27, 2021

Very interesting, thanks. I still don't understand how you handle capture groups in loops though?

My interest was for high trip counts but I think there's just a bug for any trip count:

    let matcher: Matcher1<_> = regex!(br"(.){3}");
    println!("{:?}", matcher.match_ranges(b"abc"));

I expect this to capture the 'c' but instead it captures the 'a'.

I filed https://gitlab.com/leonhard-llc/safe-regex-rs/-/issues/1 for you

mleonhard · on May 27, 2021

Thank you for filing the issue!

Here's how safe-regex works: Every matcher has a variable for each byte (b0, b1, etc.) representing states in the NFA. We iterate through the input bytes. In each iteration, we get the next input byte `b` and calculate the transition of every state for consuming that byte.

A matcher with no capturing groups uses the type Option<()> to represent each state. That means each state can be Option::Some(()) "input matched up to this point" or Option::None "no match". When the loop finishes, if the `accept` state is Option::Some, then the input matched. We use `Option::filter` to copy the preceding state only if the byte matches.

The core of the matcher for "a" [1]:

    let mut start = Some(());
    let mut b0: Option<()> = None;
    let mut data_iter = data.iter();
    loop {
        let prev_b0 = b0.clone();
        if let Some(b) = data_iter.next() {
            b0 = start.clone().filter(|_| *b == 97u8);
            start = None;
        } else {
            return prev_b0;
        }
    }

When matching with capturing groups, we must keep track of every capturing group for every state. A capturing group match is two integers representing the start and end of the matching portion of the input. We store that in a Range<usize> struct. There can be multiple capturing groups, so we store the ranges in a tuple. The state variables become type Option<(Range<usize>,)>.

When the input enters a capturing group, we set the Range for that group to the input byte: `.map(|(r0,)| (n..n,))`. Technically, it's the zero-length position between the current byte and the previous byte.

Whenever a byte matches inside a group, we expand the range for that group to include the byte: `.map(|(r0,)| (r0.start..n + 1,))`.

Here's the core of the matcher for "(a)" [1]:

    let mut start = Some((usize::MAX..usize::MAX,));
    let mut b0: Option<(core::ops::Range<usize>,)> = None;
    let mut accept: Option<(core::ops::Range<usize>,)> = None;
    let mut data_iter = data.iter();
    let mut n = 0;
    loop {
        let prev_b0 = b0.clone();
        accept = prev_b0.clone();
        if let Some(b) = data_iter.next() {
            b0 = start
                .clone()
                .map(|(r0,)| (n..n,))
                .clone()
                .filter(|_| *b == 97u8)
                .map(|(r0,)| (r0.start..n + 1,));
            start = None;
        } else {
            break;
        }
        n = n + 1;
    }

The matcher for "(abc)d" [2] is a more thorough example.

[0] https://gitlab.com/leonhard-llc/safe-regex-rs/-/blob/79106c0...

[1] https://gitlab.com/leonhard-llc/safe-regex-rs/-/blob/79106c0...

[2] https://gitlab.com/leonhard-llc/safe-regex-rs/-/blob/79106c0...

ridiculous_fish · on May 27, 2021

I don't get it? In a DFA there's no way to store "the value of a capture group" because each state may have multiple values for any capture group, depending on how it was reached.

WAIT, I think what you've built is NOT a DFA, but an NDFA! Your code has multiple states active at once, and it just tests them all! Ok, now it all makes sense.

In a DFA, it's common for a state to have multiple values for capture groups. For example in /(a*)a*/, state 1 is "in the first loop, or in the second loop with \1={0,0}" and state 2 is "in the first loop, in the second loop with \1={0,0}, or in the second loop with \1={0,1}". So the second state has multiple simultaneous values for group 1; this is why tagged-DFAs usually need more "tags" than there are capture groups and that's why I was asking.

EE84M3i · on May 27, 2021

I'd be interested to read the rustc bug report on that compiler hang you found.

Also, why do you use byte strings and not regular UTF-8 strings?

mleonhard · on May 27, 2021

I never filed the rustc bug report. I suspect that the problem is actually in LLVM.

Byte strings are much easier to use than UTF-8. Also, I only needed to execute regular expressions against ASCII HTTP headers. I think `safe-regex` will get String support eventually, or somebody will write another safe regular expression library.

Safe Rust has a bright future. Eventually, modern OSes will simply refuse to run software built from unsafe code, unless it has attached correctness proofs. This will mostly solve the IT security dumpster fire. To get us there, we need to write more useful things in safe Rust or other memory-safe languages.

EE84M3i · on May 27, 2021

If you can still reproduce the hang, I suspect they would be interested in a report, even if the problem is rooted in LLVM.