83 comments
  • mananaysiempre1d

    Worth mentioning (haven’t checked if the paper talks about this) that while the industry mostly forgot about derivatives and extended REs (i.e. REs with intersection and negation), academia did not. Unfortunately, there have been some pretty discouraging results: the DFA for an extended RE (including a lazy DFA implemented using derivatives, as here) is worst-case doubly exponential in the length of the expression[1], not just exponential as for normal REs. So there is a potential reason not to support intersections in one’s RE matcher, even if they are enticingly easy to implement in terms of derivatives (and even if I personally like to see experimentation in this direction).

    [1] https://www.sciencedirect.com/science/article/pii/S030439751...

    • someplaceguy1d

      > the DFA for an extended RE (including a lazy DFA implemented using derivatives, as here) is worst-case doubly exponential in the length of the expression

      The authors seem to claim linear complexity:

      > the result is RE#, the first general-purpose regex engine to support intersection and complement with linear-time guarantees, and also the overall fastest regex engine on a large set of benchmarks

      • ieviev1d

        We refer to this in the paper as well,

        The standard way to do intersection / complementation of regexes with NFAs requires determinization, which causes a huge blowup, whereas for us this is the cost of a derivative.

        It is true that we cannot avoid enormous DFA sizes, a simple case would be (.*a.*)&(.*b.*)&(.*c.*)&(.*d.*)... which has 2^4 states and every intersection adds +1 to the exponent.

        How we get around this in the real world is that we create at most one state per input character, so even if the full DFA size is 1 million, you need an input that is at least 1 million characters long to reach it.

        The real argument to complexity is how expensive can the cost of taking a lazy derivative get? The first time you use the engine with a unique input and states, it is not linear - the worst case is creating a new state for each character. The second time the same (or similar) input is used these states are already created and it is linear. So as said in the article it is a bit foggy - Lazy DFAs are not linear but appear as such for practical cases

        • btown1d

          > The second time the same (or similar) input is used these states are already created and it is linear.

          Does this imply that the DFA for a regex, as an internal cache, is mutable and persisted between inputs? Could this lead to subtle denial-of-service attacks, where inputs are chosen by an attacker to steadily increase the cached complexity - are there eviction techniques to guard against this? And how might this work in a multi-threaded environment?

          • ieviev1d

            Yes, most (i think all) lazy DFA engines have a mutable DFA behind a lock internally that grows during matching.

            Multithreading is generally a non-issue, you just wrap the function that creates the state behind a lock/mutex, this is usually the default.

            The subtle denial of service part is interesting, i haven't thought of it before. Yes this is possible. For security-critical uses i would compile the full DFA ahead of time - the memory cost may be painful but this completely removes the chance of anything going wrong.

            There are valid arguments to switch from DFA to NFA with large state spaces, but RE# intentionally does not switch to a NFA and capitalizes on reducing the DFA memory costs instead (eg. minterm compression in the post, algebraic simplifications in the paper).

            The problem with going from DFA to NFA for large state spaces is that this makes the match time performance fall off a cliff - something like going from 1GB/s to 1KB/s as we also show in the benchmarks in the paper.

            As for eviction techniques i have not researched this, the simplest thing to do is just completely reset the instance and rebuild past a certain size, but likely there is a better way.

            • layer81d

              > Multithreading is generally a non-issue, you just wrap the function that creates the state behind a lock/mutex, this is usually the default.

              But you also have to lock when reading the state, not just when writing/creating it. Wouldn’t that cause lock contention with sufficiently concurrent use?

              • ieviev1d

                No, we do not lock reading the state, we only lock the creation side and the transition table reference stays valid during matching even if it is outdated.

                Only when a nonexistent state is encountered during matching it enters the locked region.

                • layer81d

                  Ah, I see, so it’s basically the Racy Single-Check Idiom.

      • mananaysiempre1d

        These claims are compatible. For instance, lex, re2c, ragel, etc. are exponential in the length of the automaton description, but the resulting lexers work in linear time[1] in the length of the string. Here the situation is even better, because the DFA is constructed lazily, at most one state per input character, so to observe its full enormity relative to the size of the needle you need an equally enormous haystack. “One state per input character” somewhat understates things, because producing each state requires a non-constant[2] amount of work, and storing the new derivative’s syntax tree a non-constant amount of space; but with hash-consing and memoization it’s not bad. Either way, derivative-based matching with a lazy DFA takes something like O(needle + f(needle) × haystack) time where I’m guessing f(n) has to be O(n log n) at least (to bring large ORs into normal form by sorting) but in practice is closer to constant. Space consumption is less of an issue because if at any point your cache of derivatives (= DFA) gets too bloated you can flush it and restart from scratch.

        [1] Kind of, unless you hit ambiguities that need to be resolved with the maximal munch rule; anyways that’s irrelevant to a single-RE matcher.

        [2] In particular, introductions to Brzozowski’s approach usually omit—but his original paper does mention—that you need to do some degree of syntax-tree simplification for the derivatives to stay bounded in size (thus finite in number) and the matcher to stay linear in the haystack.

      • 1d
        [deleted]
  • noelwelsh1d

    I love regular expression derivatives. One neat thing about regular expression derivatives is they are continuation-passing style for regular expressions. The derivative is "what to do next" after seeing a character, which is the continuation of the re. It's a nice conceptual connection if you're into programming language theory.

    Low-key hate the lack of capitalization on the blog, which made me stumble over every sentence start. Great blog post a bit marred by unnecessary divergence from standard written English.

    • spankalee1d

      It's so uncomfortable to read.

      Why do people do this? They capitalize names, so clearly their shift key works. Do they do it feel special or like some sort of rebel?

      • trashface1d

        Maybe they drafted it on a phone where capitalization is harder. My guess is the all-lowercase world is mostly people who do most of their text creation on phones and similar, not keyboards.

        • layer81d

          I don’t really see how capitalization is harder on phones, I do it all the time.

          • chii8h

            if you turned on the autocorrects and spell checks etc, would automatically capitalize your sentences. Even when you don't want them to!

      • gfody16h

        some folks prefer the aesthetic of lowercase and find capitalization unnecessary and ugly

    • u_sama1d

      in what is it different ?

      • murkt1d

        Starts of sentences are not capitalized, which makes it a bit harder to read. English language prescribes capitalization after a period.

    • ieviev1d

      While i completely understand it, the lack of capitalization is just an indication that a human wrote this, it has to be imperfect

      i see enough slop and Look At Me on a daily basis. i don't want it to look like an ad or a LinkedIn post in 2026.

      • voxl6h

        It's your personal style. Researchers have their quirks, don't listen to the industry suits saying dumb shit like "it's unprofessional" you can mask if you're looking for a job at Google in the future, but for now enjoy being yourself and say fuck you to the lazy socially imposed dogma of this particular community

      • noelwelsh1d

        No one will mistake your posts for LinkedIn slop. You actually have something to say, with coherent arguments presented in paragraphs containing multiple sentences.

        If you want sentences without capitalization to be your thing, then go for it. It's just a weird hill to die on, taking away from the readability of your posts for no real reason.

        • ieviev1d

          In all honesty it's just never bothered me before and i've havent met many people bothered by it either

          It's the same thing with dark mode as default, i chose it because it's my own preference and i'd love it everywhere, but i'm constantly being flashbanged by phone apps because someone decided #FFFFFF is a good background color while the app is loading.

      • UltraSane6h

        Refusing to use standard capitalization just looks unprofessional.

      • layer81d

        I’m sorry, but omitting capitalization is slop as well, just not AI slop. I can’t read text like that for any length of time, it’s just super crappy.

  • balakk1d

    Finally, an article about humans programming some computers. Thank you!

  • agnishom1d

    The author mentions that they found Mamouras et al. (POPL 2024), but not the associated implementation. While the Rust implementation is not public, a Haskell implementation can be found here: https://github.com/Agnishom/lregex

  • sourcegrift1d

    I've had nothing but great experience with F#. If it wasn't associated with Microsoft, it'd be more popular than haskell

    • raincole1d

      I think if it weren't a 'first class' member of .NET ecosystem[0], no one would know F#. After all Haskell and Ocaml already exist.

      [0]: my very charitable take, as MS obviously cares C# much much more than F#.

      • 1d
        [deleted]
      • pjmlp1d

        Management has always behaved as if they repent having added F# to VS 2010, at least it hasn't yet suffered the same stagnation as VB, even C++/CLI was updated to C++20 (minus modules).

        In any case, those of us that don't have issues with .NET, or Java (also cool to hate these days), get to play with F# and Scala, and feel no need to be amazed with Rust's type system inherited from ML languages.

        It is yet another "Rust but with GC" that every couple of months pops up in some forums.

        • 1d
          [deleted]
      • Copyrightest1d

        [dead]

      • sourcegrift1d

        > But they all just skip the press releases and go straight to the not using it part

        Lol

      • brabel1d

        Did they ever get the full extra person who gives a shit?

    • lynx971d

      You realize that Microsoft Research employed Simon for many many years?

      • 1d
        [deleted]
  • gbacon1d

    That’s beautiful work. Check out other examples in the interactive web app:

    https://ieviev.github.io/resharp-webapp/

    Back in the Usenet days, questions came up all the time about matching substrings that do not contain whatever. It’s technically possible without an explicit NOT operator because regular languages are closed under complement — along with union, intersection, Kleene star, etc. — but a bear to get right by hand for even simple cases.

    Unbounded lookarounds without performance penalty at search time are an exciting feature too.

  • andriamanitra1d

    This is very interesting. I'm a bit skeptical about the benchmarks / performance claims because they seem almost too good to be true but even just the extended operators alone are a nice improvement over existing regex engines.

    The post mentions they also have a native library implemented in Rust without dependencies but I couldn't find a link to it. Is that available somewhere? I would love to try it out in some of my projects but I don't use .NET so the NuGET package is of no use to me.

    • ieviev1d

      There's currently only a string solver with the same core library, but not a full regex engine https://github.com/ieviev/cav25-resharp-smt

      I will open source the rust engine soon as well, some time this month.

      As for the benchmarks, it's the fastest for large patterns and lookarounds, where leftmost-longest lets you get away with less memory usage so we don't need to transition from DFA to NFA.

      In the github readme benchmarks it's faster than the exponential implementations of .NET Compiled so the 35 000x can be an arbitrary multiplier, you can keep adding alternatives and make it 1000000x.

      for a small set of string literals it will definitely lose to Hyperscan and Rust regex since they have a high effort left-to-right SIMD algorithm that we cannot easily use.

      • burntsushi1d

        > for simple string literals it will definitely lose to Hyperscan and Rust regex since they have a high effort left-to-right SIMD algorithm that we cannot easily use

        I think "simple string literals" undersells it. I think that description works for engines like RE2 or Go's regex engine, but not Hyperscan or Rust regex. (And I would put Hyperscan in another category than even Rust regex.) Granted, it is arguably difficult to be succinct here since it's a heuristic with difficult-to-predict failure points. But something like: "patterns from which a small number of string literals can be extracted."

        • ieviev1d

          yes, that is correct. also Rust's engine matches the full unicode spec as individual characters, whereas .NET's will chop emojis into two sometimes, so Rust at a disadvantage here.

          something i've been also wondering is how does Harry (https://ieeexplore.ieee.org/document/10229022) compare to the Teddy algorithm, it's written by some of the same authors - i wonder if it's used in any engines outside of Hyperscan today.

      • feldrim1d

        Would SearchValues<char> help there for a fallback to a SIMD optimized simple string literal search rather than the happy path?

        • ieviev1d

          Yes, that's exactly what we did to be competitive in the benchmarks.

          There's a lot of simple cases where you don't really need a regex engine at all.

          integrating SearchValues as a multi-string prefix search is a bit harder since it doesn't expose which branch matched so we would be taking unnecessary steps.

          Also .NET implementation of Hyperscan's Teddy algorithm only goes left to right.. if it went right to left it would make RE# much faster for these cases.

          • feldrim1d

            So, there's still room for significant improvement.

            • ieviev24h

              There is plenty still to do.

              One part of this is SIMD algorithms to better compete with Hyperscan/Rust, another is the decades of optimizations that backtracking engines have for short anchored matches for validation.

              There's analysis to do for specific patterns so we can opt for specialized algorithms, eg. for fixed length patterns we skip the left-to-right pass entirely since we already know the match start + match length.

              Lots of opportunistic things like this which we haven't done. Also there are no statistical optimizations in the engine right now. Most engines will immediately start looking for a 'z' if there is one in the pattern since it is rare.

    • sgc1d

      I second this request. It would be wonderful to be able to test the rust implementation since it easier to call from other languages in my typical setup. I have a couple uses cases I have never fully resolved, just implemented partial work arounds and accepted a restricted feature set. This would probably allow me to deal with them correctly.

  • masfuerte1d

    This is very impressive.

    > how does RE# find the leftmost-longest match efficiently? remember the bidirectional scanning we mentioned earlier - run the DFA right to left to find all possible match starts, then run a reversed DFA left to right to find the ends. the leftmost start paired with the rightmost end gives you leftmost-longest. two linear DFA scans, no backtracking, no ambiguity.

    I'm pretty sure that should say "the leftmost start paired with the leftmost end".

    This also implies that the algorithm has to scan the entire input to find the first match, and the article goes on to confirm this. So the algorithm is a poor choice if you just want the first match in a very long text. But if you want all matches it is very good.

    • mananaysiempre1d

      > I'm pretty sure that should say "the leftmost start paired with the leftmost end".

      I’m pretty sure it shouldn’t, that would give you the leftmost shortest match instead of leftmost longest.

      • masfuerte1d

        As originally written, doesn't it go from the start of the first match to the end of the last match? I feel like I'm missing something.

        • ieviev1d

          It goes from start of the first match to the longest "alive" end, in practice it will go to a dead state and return after finding the match end.

          there's an implicit `.*` in front of the first pass but i felt it would've been a long tangent so i didn't want to get into it.

          so given input 'aabbcc' and pattern `b+`,

          first reverse pass (using `.*b+`) marks 'aa|b|bcc'<-

          the forward pass starts from the first match:

          'aa->b|b|cc' marking 2 ends

          then enters a dead state after the first 'c' and returns the longest end: aa|bb|cc

          i hope this explains it better

          • masfuerte1d

            Cheers. I was more confused by how you were doing multiple matches. So I read the paper, which describes the AllEnds algorithm. If I understand correctly, the reverse pass captures all of the match starts and these need to be remembered for the forward pass. Which is what you were showing above, but I didn't follow it.

            So, once it gets going, a traditional engine can produce matches iteratively with no further allocation, but RE# requires allocation proportional to the total number of matches. And in return, it's very much faster and much easier to use (with intersection and complement).

            • ieviev1d

              Yes, exactly correct

              It's also beneficial to merge some of the matching locations into ranges where possible, so when `a*` matches a long sequence of '|a|a|a|a|a|', it can be represented as a range of (0,5), we do this to keep the lookaround internal states smaller in the engine.

        • mananaysiempre1d

          Right, the explanation seems to be a bit oversimplified, but I don’t think it’s difficult to fix it up: you need to collect non-overlapping starts (with an RTL scan) and ends (with an LTR scan) and zip them together. The non-overlapping matches are the last ones you see before you need to reset the matcher (traverse a failing edge). This feels like it should work.

          (I tried to write some pseudocode here but got annoyed dealing with edge cases like zero-length matches at EOF, sorry.)

  • klibertp1d

    If you claim it's the fastest, how does it compare to one-more-re-nightmare?

    - https://github.com/telekons/one-more-re-nightmare

    - https://applied-langua.ge/posts/omrn-compiler.html

    OMRN is a regex compiler that leverages Common Lisp's compiler to produce optimized assembly to match the given regex. It's incredibly fast. It does omit some features to achieve that, though.

    • mananaysiempre1d

      As a potential user (not the author), what jumps out at me about the two is:

      OMRN: No lookaround, eager compilation, can output first match

      RE#: No submatches, lazy compilation, must accumulate all matches

      Both lookaround and submatch extraction are hard problems, but for practical purposes the lack of lazy compilation feels like it would be the most consequential, as it essentially disqualifies the engine from potentially adversarial REs (or I guess not with the state limit, but then it’s questionable if it actually counts as a full RE engine in such an application).

  • sieep1d

    Cool stuff. Reminds me of the content you used to see on here all the time before AI took over

    • andix1d

      I'm sometimes wondering if the AI content here really starts trending organically, or if it is somehow pushed by AI companies.

      Not necessarily with bots, just posting a few links in a company Slack with the request for everyone to upvote it from their personal account could be enough.

  • anentropic1d

    Fantastic stuff!

    FYI some code snippets are unreadable in 'light mode' ("what substrings does the regex (a|ab)+ match in the following input?")

    • ieviev1d

      ah thank you for letting me know, fixed it now!

  • feldrim1d

    You got me at TalTech. Great job and the paper is high quality. I'll have to learn F# but I believe it is worth it.

  • systems1d

    I am a bit worried about the state of F# thought,

    Don Syme seem to no longer be acting as the project lead, and I didn't hear of any successor

    Compared to most actively developed languages F# look very stale currently

  • cognisent1d

    One thing I don't understand is what does _* mean? It seems like the paper refers to .* (which I understand) and _* (which I don't) in sometimes the same context? Normally _* would mean "an underscore zero or more times".

    • shmolyneaux1d

      That's noted further down the page:

      - `_*` = any string

      • cognisent1d

        I guess _ is trying to be like, "No, really, anything," while . has some limitations?

        • viega11h

          .* does NOT match newlines, so always will stop a match at the end of a line.

  • andix1d

    It should be possible to directly compile this library to native code and use it in any other language. Maybe adding some C-style wrappers will be needed.

  • dejongh1d

    Cool article. I wonder why they decided to start sentences with lower case? Free association!?

  • FrustratedMonky1d

    F# is one of the biggest 'What could have beens'. Great language, that just didn't hit the right time, or reach critical mass of the gestalt of the community.

    • nobleach1d

      I convinced one boss to let me spike out a project with it. I was in love with OCaml at the time. OCaml's docs are... I'm just going to say it, they're terrible. F# on the other hand, has fantastic docs. In the end, the only real gripe I had was the significant whitespace. I'm just not a fan.

    • nbevans1d

      It uses far less tokens than C#, so watch this space...

      • delta_p_delta_x1d

        Ml-family languages (and frankly, all natively functional languages) are just incredibly terse and information-dense second only to stuff like APL. And yet when written idiomatically and with good object and type naming they are surprisingly readable and writeable.

        'Twas a bad idea to train LLMs on the corpus of leaky, verbose C and C++ first instead of on these strict, strongly-typed, highly structural languages.

      • nodra1d

        Care to explain? Pattern matching, type inference, etc.?

  • meindnoch1d

    @burnsushi is that true?

    • keybored1d

      Tentative doubt until/if he confirms. (but specifically in F# though. Edit: No, comparisons to Rust etc. are made in TFA)

  • sourcegrift1d

    [flagged]

  • nanoxide1d

    Calling this RE# (resharp), when there is a much more popular and established product already named R# (ReSharper, by JetBrains) in the. NET world will probably hurting your SEO and/or could potentially cause some legal grief.