The big claim turns out to be a little overstated. The claim:

> AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.

reduces to:

> AlphaTensor discovers algorithms that outperform the Strassen-square algorithm, which is a fast algorithm for large square matrices31,32. Although the discovered algorithm has the same theoretical complexity as Strassen-square, it outperforms it in practice, as it is optimized for the considered hardware. Interestingly, AlphaTensor finds algorithms with a larger number of additions compared with Strassen-square (or equivalently, denser decompositions), but the discovered algorithms generate individual operations that can be efficiently fused by the specific XLA33 grouping procedure and thus are more tailored towards the compiler stack we use. The algorithms found by AlphaTensor also provide gains on matrix sizes larger than what they were optimized for. Finally, Fig. 5c shows the importance of tailoring to particular hardware, as algorithms optimized for one hardware do not perform as well on other hardware.

So they improved on Strassen’s 50-year-old algorithm by optimizing it for hardware that has only existed for a few years and de-optimizing it for human use? There is so much cool stuff here without claiming to do something you really didn’t.

The paper is a great read -- an interesting approach, fun theoretical comparisons, plus benchmarks for optimized algorithms for specific hardware (CPU and GPU, allowing for differing instruction sets).

But it doesn't stop there; from the "Discussion" section:

> One important strength of AlphaTensor is its flexibility to support complex stochastic and non-differentiable rewards (from the tensor rank to practical efficiency on specific hardware), in addition to finding algorithms for custom operations in a wide variety of spaces (such as finite fields). We believe this will spur applications of AlphaTensor towards designing algorithms that optimize metrics that we did not consider here, such as numerical stability or energy usage.

I have a gut feeling that there is a faster way to compute logarithms going from least to most significant bit. How would I go about using ML to find it?

[Edit] I think Feynman's algorithm might do it:

"Consider the problem of finding the logarithm of a fractional number between 1 and 2. (The algorithm can be generalized without too much difficulty.) Feynman observed that any such number can be uniquely represented as a product of numbers of the form 1 + 2^(-k), where k is an integer. Testing for the presence of each of these factors in a binary representation is simply a matter of a shift and a subtraction. Once the factors are determined, the logarithm can be computed by adding together the precomputed logarithms of the factors. The algorithm fit the Connection Machine especially well because the small table of the logarithms of 1 + 2^(-k) could be shared by all the processors. The entire computation took less time than doing a division."

Quoting myself on twitter:

I'm quite suspicious about their hardware benchmarks. They're not writing custom kernels, they're relying on a graph compiler like XLA to automatically fuse their decomposed matmuls (and my guess is that XLA will not be very good at this).

Moreover, as far as I can tell, they don't report absolute performance numbers anywhere. In other words, I suspect that a naive N^3 matrix multiplication would absolutely smoke them in performance.

Now, were these algorithms discovered, or invented?

I.e., have they always been there, just sitting in Platonic space waiting for a conscious mind to stumble across them, or have they just now popped into existence?

> Leveraging this diversity, we adapted AlphaTensor to specifically find algorithms that are fast on a given hardware, such as Nvidia V100 GPU, and Google TPU v2. These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware, which showcases AlphaTensor’s flexibility in optimising arbitrary objectives.

10-20% performance improvement in matrix multiplications is pretty amazing[0]!


this is cool. i suppose it's only a matter of time until we see optimizing compilers that use transformer-rl style searches for subsets of their optimization and codegen.

> ... AlphaTensor finds an algorithm for multiplying 4×4 matrices using 47 multiplications in Z_2 , thereby outperforming Strassen’s two-level algorithm, which involves 7^2 = 49 multiplications. By applying this algorithm recursively, one obtains a practical matrix multiplication algorithm in Z_2 with complexity O(N^2.778).

> Moreover, AlphaTensor discovers efficient algorithms for multiplying matrices in standard arithmetic; for example, AlphaTensor finds a rank-76 decomposition of T_{4,5,5}, improving over the previous state-of-the-art complexity of 80 multiplications.

Can someone explain the approach here?

I understand that they transform the problem into a gamified 3-D matrix decomposition, but what exactly is the motivation for using RL to beat this decomposition game ? Why not just use, for example, an evolution algorithm to find incrementally better decompositions ?

There's commentary here which seems to be talking about matrix multiplication in general and ignoring real implementations, which are dominated by considerations of the memory hierarchy. They actually introduce operations, like packing for sufficiently large dimensions but not for smaller, and prefetch. Some of that seems only to be tunable empirically for the micro-architecture too. (I don't know how the considerations vary between CPU and GPU.) Fairly recent work on Strassen implementation, talks about that and the normal Goto-type algorithm:
Always caution someone who wants to follow in this research:

When somebody promotes a fast algorithm there is often a catch. In this case the issue is numerical stability. There is a theorem that states that any algorithm for n-by-n matrix-matrix multiplication that is componentwise forward stable (as good as it gets in this situation) much necessarily use n^3 scalar multiplications. The authors will therefore waste their time if they carry out their plans and try to optimize for stability. The standard algorithm has the nice property and no faster algorithm can have this property. The question of fast matrix multiplication was raised recently on, see and the answers given there.

In general to speed up matrix operations how much is some sort of result caching applied?

For example if you profile calculations done when training a model, is there any significant repetition happening that would allow some kind of benefit from table lookups of certain solutions?

Machine learning for finding matrix multiplication algorithms is not so new. The problem was solved before with other methods. For example:

Can someone knowledgeable tell me if it discovered an algorithm we can understand and re-implement ourself, like is the pseudo-code for it known? Or is it kind of stuck in the infered function of the ML model?
Why the heck is a math paper in Nature? I'd put Nature somewhere just below vixra in terms of math credibility.
It's begun. AI is improving itself.
The new 4x4 matrix multiplication over F_2 has practical applications as many matrix operations over F_2 can be reduced to matrix multiplication.

For anyone looking for the algorithm itself, it is actually given in in one of the extended data sections at the very end of the paper.

Reinforcement learning is usually introduced as a technique to train an agent in simulation that would then be let loose in a real environment. Interesting that they choose to treat it is a search strategy.
Never heard of the "matrix multiplication tensor" before. Is there some digestible blog post or article somewhere explaining what it is and why it is useful?
Wasn't this based on some other paper? I have read something like that a year+ (?) ago also about deep learning based matrix multiplication boost.
maybe they can use this thing to figure out universal basic income before we all become homeless
50% fewer operations for skewsymmetric matrix vector product is pretty big, IMO.
Will this be efficient for sparse matrices ?
This is completely besides the matter, but reading "provably" in the abstract is a frank reminder of how terrible English spelling/pronounciation is. I can't imagine I'm the only well-read native English speaker who read this as "prov-ably" on first take. I don't know about most languages, but you just don't get nonsense like this in French, at least.