The dataset used here are 3003 time series from the M3 competition ran by the International Journal of Forecasting. Almost all of these are sampled at the yearly, quarterly or monthly frequency, each with typically 40 to 120 observations ("samples" in Machine Learning lingo), and the task is to forecast a few months/quarters/years out of sample. Most experienced Machine Learners will realize that there is probably limited value in fitting high complexity n-layer Deep Learning model to 120 data-points to try to predict the next 12. If you have daily or intraday (hourly/minutely/secondly) time series, more complex models might become more worthwhile, but such series are barely represented in the dataset.
To me the most surprising result was only how bad AutoARIMA performed. Seasonal ARIMA was one of the traditional go-to methods for this kind of data.
I know if you did the experiment over and over against with different splits you'd get slightly different scores so I'd like to see some guidance as to significance in terms of ① statistical significance, and ② is it significant on a business level. Would customers notice the difference? Would it make better decisions that move the needle for revenue or other business metrics?
This study is an example where a drastically more expensive algorithm seems to produce a practically insignificant improvement.
This is true for single time series, where we are predicting P(x_t+1 | x_0..t)
DL has advantages when you
a) have additional context at each time step, or
b) you have multiple related time series.
For example, consider Amazon who predicts demands for all of their products. At each time step, they know about inventory, marketing efforts, and could even model higher dimensional attributes like persuasiveness of the item's description with NLP.
It's also true they have items that are highly correlated. Skis, Snowboards, and Ski jackets all likely have similar sales patterns. Leveraging this correlation can increase accuracy, and is especially useful when you have items with limited history.
Including all of that context is hard with a statistical model, and whatever equation a human can come up with to combine them is probably worse than a learned, embedding-based DL model.
Statistical models are a great starting point & baseline for most problems, but as you add real world complexity beyond the general case time-series that's not as true.
I might not be aware of it, but I wish there were more benchmarks/research on higher complexity problems.
I have been hacking on a peroneal research project to predict hurricane tracks furcating using deep learning. Only given track and intensity data at different points in time (every 6 hours) and some simple feature engineering, you will not get any good results close to the official NHC forecast, no matter what model you use.
In hindsight, this is a little obvious. Hurricane forecasting time series models depend more on other factors than time itself. A sales forecast can depend on seasonal trends and key events in time, but a hurricane forecast is much more dependent on long-range spatial data like the state atmosphere and ocean that are very non-trivial to model simply using just track data.
However, deep leading models and techniques in this scenario are helpful because they can allow you to integrate multiple modalities like images, graphs, and volumetric data into this one model, which may not be possible with statistical models alone.
1. Deep learning doesn't require thorough understanding of priors or statistical techniques. This opens the door to more programmers in the same way high level languages empower far more people than pure assembly. The tradeoffs are analogous - high human efficiency, loss of compute efficiency.
2. Near-CPU deep learning accelerators are making certain classes of models far easier to run efficiently. For example, an M1 chip can run matrix multiplies (DL primitive composed of floating point operations) 1000x faster than individual instructions (2TFlops vs 2GHz). This really changes the game, since we're now able to compare 1000 floating point multiplications with a single if statement.
Ok, I fully agree with their foundational premise: Start simple.
But: They've overstated their case a bit. Saying that deep learning will cost $11,000 and need 14 days on this data set is not reasonable. I believe you can find some code that will cost that much. The readme suggests that this is typical of deep learning, which is not true. DL models have enormous variety. You can train a useful, high-performance model on a laptop CPU in a seconds-to-minutes timeframe; examples include multilayer perceptrons for simple classification, a smaller-scale CNN, or a collaborative filtering model.
While I don't endorse all details of their argument, I do think the culture of applied ML/data science has shifted too far toward default-DL. The truth is that many problems faced by real companies can be solved with simple techniques or pre-trained models.
Another perspective: A DL model is a spacecraft (expensive, sophisticated, powerful). Simple models like logistic regression are bikes and cars (affordable, efficient, less powerful). Using heuristics is like walking. Often your goal is just a few blocks away, in which case it would be inefficient to use a spacecraft.
Similar in vein:
Extremely simple PCA based defect detection significantly beats orders of magnitude more complex segmentation network.
What matters is "which method is better for MY data?" but that's not something an academic can study. You just have a test a few different things.
It was actually very cool because the model was a melt of exponential smoothing and dl.
So, if we have to use the two steps anyway, then the possible advantages of non-linear fitting have some promise.
So, to me, a larger concern comes to the top: In my experience in such things, call it statistics, optimization, data analysis, whatever, a huge advantage is bringing to the work some understanding that doesn't come with the data and/or really needs a human. The understanding might be about the real problem or about some mathematical methods.
E.g., once some guys had a problem in optimal allocation of some resources. They had tried simulated annealing, run for days, and quit without knowing much about the quality of the results.
I took the problem as 0-1 integer linear programming, a bit large, 600,000 variables, 40,000 constraints, and in 900 seconds on a slow computer, with Lagrangian relaxation, got a feasible solution guaranteed, from the bounding, to be within 0.025% of optimality. The big advantages were understanding the 0-1 program, seeing a fast way to do the primal-dual iterations, and seeing how to use Lagrangian relaxation. My guess is that it would be tough for some very general machine learning to compete much short of artificial general intelligence.
One way to describe the problem with the simulated annealing was that it was just too general, didn't exploit what a human might understand about the real problem and possible solution methods selected for that real problem.
I have a nice collection of such successes where the keys were some insight into the specific problems and some math techniques, that is, some human abilities that would seem to need machine learning to have artificial general intelligence to compete. With lots of data, lots of computing, and the advantages of non-linear operations, at times machine learning might be the best approach even now.
Net, still, in many cases, human intelligence is tough to beat.
There’s definitely use for these classical, model-based methods, for sure. But a contrived comparison claiming they’re king is just misinformation.
Eg, here are a number of issues with classical techniques where dl succeeds (‘they’ here refers to classical techniques):
- they often don’t support missing/corrupt data
- they focus on linear relationships and not complex joint distributions
- they focus on fixed temporal dependence that must be diagnosed and specified a priori
- they take as input univariate, not multiple interval, data
- they focus on one-step forecasts, not long time horizons
- they’re highly parameterized and rigid to assumptions
- they fail for cold start problems
A more nuanced comparison would do well to mention these.