Since Xerox Research Europe's invention of xfst, the term "extended regular expression" is already taken for a specific superclass of expressions that describe regular relations.
So I propose two alternatives, semi-regular expressions and super-regular expressions, for strings that look like regular expressions but that use non-regular extensions like backreferences that push the complexity beyond what can be done with NDAs/DFAs.
I had an interview at google years ago where I basically I was asked to come up with this on my own. In my opinion, a terrible interview Q (per Kernighan's blog, linked in original post, it took pike an hour or two alone in his office), and I bombed it. It was either my just before or just after lunch interview and it stuck with me for the rest of the day and I couldn't give my entire focus to the rest of the sessions.
anyways, after it was done, I found kernighan's blog post and decided I should try to implement this in java as it will allow me to even get some practice with things like junit and more experience with object oriented programming as I had been living in the C world at that time). so I did, but I then found this regular expression page ( https://www.regular-expressions.info/) and learned many more things about "regular expressions" that i hadn't learned in school (because they aren't regular) and wondered if I could extend pike's simplicity to them in a relatively simple manner. So I did.
which I created this https://github.com/sjpotter/regex.
Not meant to be "performant" but meant to be educational to me and perhaps others.
Then when I was learning go, I rewrote it in Go. I did things that are not "normal" go patterns (i.e. using panic and recover to make error handling cleaner (in my opinion, if an error path is simply if (err) { return err } all the way up, I personally think panics/recover at the top can be cleaner, but it seems most disagree), but it was educational on how to rewrite my java code to an object oriented style in Go and to explore the limitations of interfaces and how one might get around them (also perhaps in ways that go against what might be considered proper go programming)
https://github.com/sjpotter/regex-go
though now that I look at that repo, wondering if all the work was done elsewhere and then I just pushed a single commit there, might need to go back and look at it
And this also taught me an elegant way to implement recursion in C, that I shamelessly used in interviews to look way smarter than I actually am.
Here is a simple patch to make the runtime O(len(pattern) * len(text)) instead of exponential. It adds 5 lines and a new parameter. The idea is to memoize (cache) the results, specifically the failures (there is no need to cache the successes, because all the functions return immediately on success).
// Match reports whether regexp matches anywhere in text.
func Match(regexp, text string) bool {
+ seen := make(map[[2]int]bool)
if regexp != "" && regexp[0] == '^' {
- return matchHere(regexp[1:], text)
+ return matchHere(regexp[1:], text, seen)
}
for {
- if matchHere(regexp, text) {
+ if matchHere(regexp, text, seen) {
return true
}
if text == "" {
return false
}
text = text[1:]
}
}
// matchHere reports whether regexp matches at beginning of text.
-func matchHere(regexp, text string) bool {
+func matchHere(regexp, text string, seen map[[2]int]bool) bool {
switch {
case regexp == "":
return true
case regexp == "$":
return text == ""
case len(regexp) >= 2 && regexp[1] == '*':
- return matchStar(regexp[0], regexp[2:], text)
+ return matchStar(regexp[0], regexp[2:], text, seen)
case text != "" && (regexp[0] == '.' || regexp[0] == text[0]):
- return matchHere(regexp[1:], text[1:])
+ return matchHere(regexp[1:], text[1:], seen)
}
return false
}
// matchStar reports whether c*regexp matches at beginning of text.
-func matchStar(c byte, regexp, text string) bool {
+func matchStar(c byte, regexp, text string, seen map[[2]int]bool) bool {
for {
- if matchHere(regexp, text) {
+ if seen[[2]int{len(regexp), len(text)}] {
+ return false
+ }
+ if matchHere(regexp, text, seen) {
return true
}
if text == "" || (text[0] != c && c != '.') {
return false
}
+ seen[[2]int{len(regexp), len(text)}] = true
text = text[1:]
}
}
Demo: https://go.dev/play/p/aD9vzXwTHGE
And with compilation https://github.com/rurban/tiny-regex-c (done)
Is there a concern with these kinds of micro-benchmarks, where you repeatedly do the same small operation (matching the same regex), that your results will be skewed by the branch predictor, CPU caches, etc.?
Btw always love your writing Ben, it’s very engaging.
I slept on it and the solution still isn't clear to me.
As an aside: It'd be nice if the go unit tests logged an indication of whether or not the ./matchC exec tests ran.
https://github.com/redis/redis/blob/99ebbee2b260b3466672b4fa...