17 cycles still seems pretty slow compared to the methods explored in this article, which got the result down to 0.75ns (~3 cycles?) (for integer strings of a known length):

Most of the overhead is found in dealing with variable length input, minus signs, and overflow checking.

This has been floating around in my random stuff directory for ages, while I think it's technically somewhere in the repo of a former employer if that recollection is correct it was copied without attribution from some random website so, fair game. I've always found it useful for thinking about this family of algorithm:

Does this handle the case where you try to parse a short string which is abutting the end of a page, and the next page is unmapped? The problem is you over-read and crash when you try to read the unmapped page. Looking at the code it doesn't seem to take account of that case. (On the other hand, maybe the AVX instructions don't do an actual read if the field is masked? In which case it would be fine.)

We had a similar case with some AVX-optimized string operation in glibc.

If you're parsing lots of numbers and they are often of similar length, then could you go for GPU-style SIMT with AVX-512 and parse 64 numbers at once?
Considered as a naieve problem, inescapably you want to perform a series of multiply and addition phases. For short strings, "just do it" might be slower than "look it up in a table". For longer strings, divide-and-conquer would be to chop it into short strings and do the table lookup.

I don't know this is faster than simply doing the "mutiply by ten and add" sequence. I'd have to understand modern ALU behaviour, parallelism, shortcuts, chip cache dynamics you-name-it.

It would all have been so much simpler if we'd made decimal be hex instead. Octal doesn't work as well for me, no idea why. That said, it's much easier to ignore thumbs and count on 4 fingers of each hand than grow another 6

(I read the article. It descended into arcanum of instruction sets and assumptions which post date my DEC-10 ISA lessons, although I recall Digital had BCD instruction handling and a 6 bit byte model in a 36 bit word accordingly. DEC-10 instructions could take many, many clock cycles and have 5 components of from, to, via, because, maybe attached to them)

> I am going to assume that you know the beginning and the end of sequence of digits.

First, that's a strong assumption. But...

> We check whether some value exceeds 9, in which case we had a non-digit character.

Huh? You already know where the end of the digits is. At least, I thought you did? So why would this be necessary?

Do library atoi() actually do SIMD?
Readers can try their hand at a similar problem using AVX2 here:
I know this was done mostly for the sake of it, but is there a reasonable scenario where parsing integers be the bottleneck with the current method?