antirez
Apparently I reinvented almost the same code several years after Pike:

https://github.com/redis/redis/blob/99ebbee2b260b3466672b4fa...

jll29
I like the article that this post refers to about efficient implementation (https://swtch.com/~rsc/regexp/regexp1.html). It mentions a debate between J. Friedl and comp. sci. purists about whether NFAs and regular expressions should really be called what they are in the face of backreferences and other notation. Friedl suggests another name should be used, because de facto "NFAs" are no longer equivalent to DFAs.

Since Xerox Research Europe's invention of xfst, the term "extended regular expression" is already taken for a specific superclass of expressions that describe regular relations.

So I propose two alternatives, semi-regular expressions and super-regular expressions, for strings that look like regular expressions but that use non-regular extensions like backreferences that push the complexity beyond what can be done with NDAs/DFAs.

compsciphd
So fun story (to me in retrospect).

I had an interview at google years ago where I basically I was asked to come up with this on my own. In my opinion, a terrible interview Q (per Kernighan's blog, linked in original post, it took pike an hour or two alone in his office), and I bombed it. It was either my just before or just after lunch interview and it stuck with me for the rest of the day and I couldn't give my entire focus to the rest of the sessions.

anyways, after it was done, I found kernighan's blog post and decided I should try to implement this in java as it will allow me to even get some practice with things like junit and more experience with object oriented programming as I had been living in the C world at that time). so I did, but I then found this regular expression page ( https://www.regular-expressions.info/) and learned many more things about "regular expressions" that i hadn't learned in school (because they aren't regular) and wondered if I could extend pike's simplicity to them in a relatively simple manner. So I did.

which I created this https://github.com/sjpotter/regex.

Not meant to be "performant" but meant to be educational to me and perhaps others.

Then when I was learning go, I rewrote it in Go. I did things that are not "normal" go patterns (i.e. using panic and recover to make error handling cleaner (in my opinion, if an error path is simply if (err) { return err } all the way up, I personally think panics/recover at the top can be cleaner, but it seems most disagree), but it was educational on how to rewrite my java code to an object oriented style in Go and to explore the limitations of interfaces and how one might get around them (also perhaps in ways that go against what might be considered proper go programming)

https://github.com/sjpotter/regex-go

though now that I look at that repo, wondering if all the work was done elsewhere and then I just pushed a single commit there, might need to go back and look at it

mkovach
I love this bit of code. It saved my butt in '99 when I had to implement searches for potential "Y2K" coding issues. Some where, collecting dust in CVS repository, probably saved on tape is some storage bunker, is a modified version of this that searched millions of lines of code for various potential "double digit year" issues.

And this also taught me an elegant way to implement recursion in C, that I shamelessly used in interviews to look way smarter than I actually am.

hddqsb
> horrible run times on craftily-constructed regexes

Here is a simple patch to make the runtime O(len(pattern) * len(text)) instead of exponential. It adds 5 lines and a new parameter. The idea is to memoize (cache) the results, specifically the failures (there is no need to cache the successes, because all the functions return immediately on success).

     // Match reports whether regexp matches anywhere in text.
     func Match(regexp, text string) bool {
    +    seen := make(map[[2]int]bool)
         if regexp != "" && regexp[0] == '^' {
    -        return matchHere(regexp[1:], text)
    +        return matchHere(regexp[1:], text, seen)
         }
         for {
    -        if matchHere(regexp, text) {
    +        if matchHere(regexp, text, seen) {
                 return true
             }
             if text == "" {
                 return false
             }
             text = text[1:]
         }
     }
     
     // matchHere reports whether regexp matches at beginning of text.
    -func matchHere(regexp, text string) bool {
    +func matchHere(regexp, text string, seen map[[2]int]bool) bool {
         switch {
         case regexp == "":
             return true
         case regexp == "$":
             return text == ""
         case len(regexp) >= 2 && regexp[1] == '*':
    -        return matchStar(regexp[0], regexp[2:], text)
    +        return matchStar(regexp[0], regexp[2:], text, seen)
         case text != "" && (regexp[0] == '.' || regexp[0] == text[0]):
    -        return matchHere(regexp[1:], text[1:])
    +        return matchHere(regexp[1:], text[1:], seen)
         }
         return false
     }
     
     // matchStar reports whether c*regexp matches at beginning of text.
    -func matchStar(c byte, regexp, text string) bool {
    +func matchStar(c byte, regexp, text string, seen map[[2]int]bool) bool {
         for {
    -        if matchHere(regexp, text) {
    +        if seen[[2]int{len(regexp), len(text)}] {
    +            return false
    +        }
    +        if matchHere(regexp, text, seen) {
                 return true
             }
             if text == "" || (text[0] != c && c != '.') {
                 return false
             }
    +        seen[[2]int{len(regexp), len(text)}] = true
             text = text[1:]
         }
     }

Demo: https://go.dev/play/p/aD9vzXwTHGE
rurban
I also revived my old dynamic string matchers recently, which don't need any costly compilation step: https://github.com/rurban/tiny-matcher (in-work)

And with compilation https://github.com/rurban/tiny-regex-c (done)

returningfory2
> I ran benchmarks [...] matching the regex Ben.*H over 100 concatenated repeats of the King James Bible.

Is there a concern with these kinds of micro-benchmarks, where you repeatedly do the same small operation (matching the same regex), that your results will be skewed by the branch predictor, CPU caches, etc.?

commandersaki
Excellent post. I remember reading the exegesis years ago but completely forgot about it. Really impressed how well the Go version turned out.

Btw always love your writing Ben, it’s very engaging.

tgv
It's a left-shortest matching back-tracker. Not to be used in practice. I'm sure this implementation features in various text books.
djbusby
Ok. This was educational and fun. Thanks!
adrianmsmith
Good to see the term "regexp" (with a "p") in use. I always used to call it that, then at some point realized everyone was writing "regex", and wondered if I'd spelled the abbreviation wrong the whole time. I guess it's just a function of time, it used to be spelled with a "p" and now mostly isn't.
nsonha
I don't see anything "beautiful" here, it works and is performant but seems pretty average in term of readability
metadat
I wonder what the non-recursive redesign would look like, without using the simulate-recursion-via-callstack translation technique.

I slept on it and the solution still isn't clear to me.

As an aside: It'd be nice if the go unit tests logged an indication of whether or not the ./matchC exec tests ran.

unixbane
okay now post the standard academic version of elegant simple regex in algol, lisp, ml whatever and see how many upvotes it gets
user5678
kubb
It's a a nicely written function... but it has the difficulty of a warmup interview question or a CS101 homework.
sr.ht