The conclusion, that a low-complexity statistical ensemble is almost as good as a (computationally) complex Deep Learning model, should not come as a surprise, given the data.

The dataset[1] used here are 3003 time series from the M3 competition ran by the International Journal of Forecasting. Almost all of these are sampled at the yearly, quarterly or monthly frequency, each with typically 40 to 120 observations ("samples" in Machine Learning lingo), and the task is to forecast a few months/quarters/years out of sample. Most experienced Machine Learners will realize that there is probably limited value in fitting high complexity n-layer Deep Learning model to 120 data-points to try to predict the next 12. If you have daily or intraday (hourly/minutely/secondly) time series, more complex models might become more worthwhile, but such series are barely represented in the dataset.

To me the most surprising result was only how bad AutoARIMA performed. Seasonal ARIMA was one of the traditional go-to methods for this kind of data.

[1] https://forecasters.org/resources/time-series-data/m3-compet...

It is something that bothers me about the ML literature is that they frequently present a large number of evaluation results such as precision and AUC but these are not qualified by error bars. Typically they make a table which has different algorithms on one side and different problems on the other side and the highest score for a given problem gets bolded.

I know if you did the experiment over and over against with different splits you'd get slightly different scores so I'd like to see some guidance as to significance in terms of ① statistical significance, and ② is it significant on a business level. Would customers notice the difference? Would it make better decisions that move the needle for revenue or other business metrics?

This study is an example where a drastically more expensive algorithm seems to produce a practically insignificant improvement.

I've done some work in this area and have indeed found that simpler statistical models often out perform ML/DL.

This is true for single time series, where we are predicting P(x_t+1 | x_0..t)

DL has advantages when you

a) have additional context at each time step, or

b) you have multiple related time series.

For example, consider Amazon who predicts demands for all of their products. At each time step, they know about inventory, marketing efforts, and could even model higher dimensional attributes like persuasiveness of the item's description with NLP.

It's also true they have items that are highly correlated. Skis, Snowboards, and Ski jackets all likely have similar sales patterns. Leveraging this correlation can increase accuracy, and is especially useful when you have items with limited history.

Including all of that context is hard with a statistical model, and whatever equation a human can come up with to combine them is probably worse than a learned, embedding-based DL model.

Statistical models are a great starting point & baseline for most problems, but as you add real world complexity beyond the general case time-series that's not as true.

I might not be aware of it, but I wish there were more benchmarks/research on higher complexity problems.

Timeseries data can sometimes be deceptive, depending on what you are trying to model.

I have been hacking on a peroneal research project to predict hurricane tracks furcating using deep learning. Only given track and intensity data at different points in time (every 6 hours) and some simple feature engineering, you will not get any good results close to the official NHC forecast, no matter what model you use.

In hindsight, this is a little obvious. Hurricane forecasting time series models depend more on other factors than time itself. A sales forecast can depend on seasonal trends and key events in time, but a hurricane forecast is much more dependent on long-range spatial data like the state atmosphere and ocean that are very non-trivial to model simply using just track data.

However, deep leading models and techniques in this scenario are helpful because they can allow you to integrate multiple modalities like images, graphs, and volumetric data into this one model, which may not be possible with statistical models alone.

I'm heavily involved in this area of research (getting deep learning competitive with computationally efficient statistical methods), and I'd like to note a couple things I've found:

1. Deep learning doesn't require thorough understanding of priors or statistical techniques. This opens the door to more programmers in the same way high level languages empower far more people than pure assembly. The tradeoffs are analogous - high human efficiency, loss of compute efficiency.

2. Near-CPU deep learning accelerators are making certain classes of models far easier to run efficiently. For example, an M1 chip can run matrix multiplies (DL primitive composed of floating point operations) 1000x faster than individual instructions (2TFlops vs 2GHz). This really changes the game, since we're now able to compare 1000 floating point multiplications with a single if statement.

This readme lands to me like this: "People say deep learning killed stats, but that's not true; in fact, DL can be a huge mistake."

Ok, I fully agree with their foundational premise: Start simple.

But: They've overstated their case a bit. Saying that deep learning will cost $11,000 and need 14 days on this data set is not reasonable. I believe you can find some code that will cost that much. The readme suggests that this is typical of deep learning, which is not true. DL models have enormous variety. You can train a useful, high-performance model on a laptop CPU in a seconds-to-minutes timeframe; examples include multilayer perceptrons for simple classification, a smaller-scale CNN, or a collaborative filtering model.

While I don't endorse all details of their argument, I do think the culture of applied ML/data science has shifted too far toward default-DL. The truth is that many problems faced by real companies can be solved with simple techniques or pre-trained models.

Another perspective: A DL model is a spacecraft (expensive, sophisticated, powerful). Simple models like logistic regression are bikes and cars (affordable, efficient, less powerful). Using heuristics is like walking. Often your goal is just a few blocks away, in which case it would be inefficient to use a spacecraft.

Nice article and interesting comparison. Yet, I have a minor issue with the title: Deep Learning are also statistical methods ... "univariate models vs. " would be a better title.
I wish we could start moving to better approaches for evaluating time series forecasts. Ideally, the forecaster reports a probability distribution over time series, then we evaluate the predictive density with regard to an error function that is optimal for the intended application of the forecast at hand.
This isn’t surprising for those of us who grew up with “Elements of Statistical Learning” (book).

Similar in vein:


Extremely simple PCA based defect detection significantly beats orders of magnitude more complex segmentation network.

Comparison of several Deep Learning models and ensembles to classical statistical models for the 3,003 series of the M3 competition.
What is the point of this kind of comparison? It is completely dependent on the 3000 datasets they chose to use. You're not going to find that one method is better than another in-general or find some type of time series for which you can make a specific methodological recommendation (unless that series is specifically constructed with a mathematical feature, like stationarity).

What matters is "which method is better for MY data?" but that's not something an academic can study. You just have a test a few different things.

What deep learning could instead be used for in this case is to incorporate more data, like text describing events that affects macroeconomics when doing macroeconomic predictions.
Hmmm. Not sure why they use M3 data when there is already M4 where a deep learning model won. I know because I reimplemented it as a toy version in python here: https://github.com/leanderloew/ES-RNN-Pytorch

It was actually very cool because the model was a melt of exponential smoothing and dl.

I can have some interest in, hope for, etc. machine learning. One reason is, for the curve fitting methods of classic statistics, i.e., versions of regression, the math assumptions that give some hope of some good results are essentially impossible to verify and look like they will hold closely only rarely. So, even when using such statistics, good advice is to have two steps, (1) apply the statistics, i.e., fit, using half the data and then (2) verify, test, check using the other half. But, gee, those two steps are also common in machine learning. Sooo, if can't find much in classic math theorems and proofs to support machine learning, then, are just put back into the two steps statistics has had to use anyway.

So, if we have to use the two steps anyway, then the possible advantages of non-linear fitting have some promise.

So, to me, a larger concern comes to the top: In my experience in such things, call it statistics, optimization, data analysis, whatever, a huge advantage is bringing to the work some understanding that doesn't come with the data and/or really needs a human. The understanding might be about the real problem or about some mathematical methods.

E.g., once some guys had a problem in optimal allocation of some resources. They had tried simulated annealing, run for days, and quit without knowing much about the quality of the results.

I took the problem as 0-1 integer linear programming, a bit large, 600,000 variables, 40,000 constraints, and in 900 seconds on a slow computer, with Lagrangian relaxation, got a feasible solution guaranteed, from the bounding, to be within 0.025% of optimality. The big advantages were understanding the 0-1 program, seeing a fast way to do the primal-dual iterations, and seeing how to use Lagrangian relaxation. My guess is that it would be tough for some very general machine learning to compete much short of artificial general intelligence.

One way to describe the problem with the simulated annealing was that it was just too general, didn't exploit what a human might understand about the real problem and possible solution methods selected for that real problem.

I have a nice collection of such successes where the keys were some insight into the specific problems and some math techniques, that is, some human abilities that would seem to need machine learning to have artificial general intelligence to compete. With lots of data, lots of computing, and the advantages of non-linear operations, at times machine learning might be the best approach even now.

Net, still, in many cases, human intelligence is tough to beat.

Seems like these guys just wasted $11k to erroneously claim, “deep learning bad! Simple is better!”

There’s definitely use for these classical, model-based methods, for sure. But a contrived comparison claiming they’re king is just misinformation.

Eg, here are a number of issues with classical techniques where dl succeeds (‘they’ here refers to classical techniques):

- they often don’t support missing/corrupt data

- they focus on linear relationships and not complex joint distributions

- they focus on fixed temporal dependence that must be diagnosed and specified a priori

- they take as input univariate, not multiple interval, data

- they focus on one-step forecasts, not long time horizons

- they’re highly parameterized and rigid to assumptions

- they fail for cold start problems

A more nuanced comparison would do well to mention these.

why are middle-ground (but SOTA) techniques like guassian processes and GBM regression not in this comparo