Foundation Models for Times Series
Generative AI, and in particular foundation models for language and vision (LLMs, LLVMs, etc.) have made an enormous contribution to tasks in NLP and Computer Vision in the last few years. This has been due to our ability to scale these models and there being a dependable relationship between scale and performance — overly simplified, as training data and model size increase, error rates on a number of tasks decrease. Moreover, researchers have succeeded in combining different modalities in the same foundation model, such as text and image.
The technology underlying foundation models, the Transformer architecture, has been shown to apply successfully to machine learning tasks across a number of modalities outside of language and vision. This begs the question, do similar neural scaling laws exist for Transformers that are trained on big data in other modalities and will they exhibit similar “emergent behaviors” like zero-shot prediction?
Time series data is widely available in business, finance, engineering, science, and manufacturing, to name a few, and has been a source of valuable data for traditional Machine Learning tasks like forecasting with regression models. Think of applications like forecasting monthly revenue, future trading prices, seismic activity, or when an important machine part in a factory is about to fail. Can we learn a foundation model for time series and interrogate them with a chatbot, reason over them with intelligent agents, and perform other useful applications of Generative AI?
Foundation models are a core component when building RAG or Agent applications with a vector database like Milvus, whether it is for generating a response or calculating an intermediate value. In this post and accompanying notebook, we examine recent work on foundation models for time series, focusing on one model in particular: TimesFM (Das et al., 2024). Details are very similar for related work that happened simultaneously: see Lag-Llama (Rasul et al., 2024), TimeGPT-1 (Garza et al., 2024), Tiny Time Mixers (Ekambaram et al., 2024), Morai (Woo et al., 2024), and MOMENT (Goswami et al., 2024).
We explain how this model adapts the standard LLM architecture to time series. We also explain one of the most important components, large time series data sets, and how they are assembled. Finally, we illustrate applications of these models and discuss their limitations.
Applications
Why exactly would we train a time series foundation model? The most immediate answer is to be able to perform effective zero-shot and few-shot prediction tasks, like forecasting, imputing missing data, and classification. Zero- and few-shot prediction is useful for a number of reasons:
First, it is useful when we have little data to train a time series model on. Also, as discussed in TimesFM (Das et al., 2024), and quite surprisingly, foundation models with zero-shot learning can outperform non-foundation models trained for a specific task in some cases. Moreover, I posit that a foundation model fine-tuned to a specific task will outperform a model trained from scratch for that task.
Second, it is useful as a means of being a quick and robust tool for agents to perform time series tasks. We can of course give our agents access to model training as a tool, although this is inadequate for real-time agents. We explore the idea of zero-shot learning as an agent tool in the accompanying notebook along with some of its challenges.
Model
Architecture
TimesFM adapts the standard decoder-only transformer commonly used in language models, for instance, Open’s GPT, Meta’s Llama, and Google’s Gemma. In the figure above we compare the (simplified) architectures and their analogy to the decoder-only transformer.
TimesFM works on consecutive time series patches rather than discrete tokens. The time series is split into N patches of fixed length that are embedded via a small feed-forward network. A language model, in contrast, simply looks up a learnable embedding from a table for each token. A positional encoding is then added to the patch embedding using the same method as language models. The combined embedding for each patch of the image is then fed into N causal self-attention layers to output context-dependent embeddings for each patch. Finally, each context-dependent embedding is passed through a second small feed-forward network to produce an array of point predictions for a fixed forecasting horizon window after the patch in question. A language model, in contrast, passes the context-dependent embeddings through a linear layer, mapping to the logits for the following token.
Data
The authors of TimesFM construct a new dataset of time series, intending to represent a range of domains, trend and seasonality patterns, and granularities. They make use of time series data measuring traffic for Google search on specific trends, traffic for Wikimedia page views, synthetic data constructed from classical regression models, and a small amount of the pre-existing open-source time series data. The dataset contains 300B time points, with around 250B being from Wikimedia traffic, 6B from synthetic data, 5B from Google Trends traffic, and the remainder from preexisting real-world datasets.
Training
TimesFM is trained with a different loss from language modeling. The model is trained to minimize the mean squared error between the point estimate of the forecast points (given the context patch) and their true values. Generative pretraining of language models, on the other hand, minimizes the cross entropy between the estimated next-token logits and their true values.
TimesFM has been trained in sizes of 100B, 200B, and 500B parameters on 300B time-series points. As a point of comparison, GPT-1 has 120M parameters, and GPT-2 has 1.5B parameters. Also, GPT-1 was trained on around 1B tokens, and GPT-2 on around 10B tokens. It is difficult to draw an analogy between model and data sizes between modalities, however, we sense that time series foundation models are in an early stage of development.
Discussion
TimesFM (and related work such as Lag-Llama and TimeGPT-1) successfully demonstrates that it is possible to train a foundation model for time series from scratch — analogous to a large language model — that is capable of zero-shot learning and exhibits neural scaling laws. See the papers for more details on their experimental setup and results. This contrasts with previous attempts that involve fine-tuning a pretrained language model. (Tan et al., 2024 provide evidence that fine-tuned language models do not exhibit any transfer learning for time series.)
Challenges remain, of course. The most important component of foundation models is big data, and this is difficult to collect for time series. As we have seen, the authors of TimesFM and Lag-Llama use a little ingenuity to construct new datasets of time series for this purpose. However, we sense that these models would benefit from being scaled up a few orders of magnitude — recall the leap from GPT-2 (1.5B parameters) to GPT-3 (175B). It will be a challenge to obtain such larger time series datasets for this purpose.
There are many extensions to the basic forecasting models presented here that would increase their usefulness for business applications. For instance, forecasting multivariate rather than just univariate time series, including uncertainty estimates, conditioning on exogenous variables, and exploiting recent advances in Deep Learning. Also, a truly useful foundation model for time series would be multimodal being able to input, for example, text. On the architectural side, future models will likely incorporate ideas from structured state-space models and xLSTMs for longer, more accurate forecasting with reliable uncertainty estimates.
In summary, it is early days for foundational models for time series and an exciting space to keep an eye on! You can experiment with these open-source models right now. See the accompanying notebook where I show you how to build a proof-of-concept for an agent workflow with zero-shot time-series prediction as a tool. One thing is for certain, the explosion of foundation models in diverse modalities will only increase the utility of RAG and agentic systems, and thereby increase the importance of vector databases. Milvus for the win!
To watch a live demo of this notebook, watch the on-demand replay here.
Resources
- Notebook: “Zero-shot Time Series Prediction as an Agentic Tool”
- TimesFM paper: “A decoder-only foundation model for time-series forecasting”
About the Author for Foundation Models for Times Series:
Stefan Webb is a Developer Advocate at Zilliz, where he advocates for the open-source vector database, Milvus. Prior to this, he spent three years in the industry as an Applied ML Researcher at Twitter and Meta, collaborating with product teams to tackle their most complex challenges.
Stefan holds a PhD from the University of Oxford and has published papers at prestigious machine learning conferences such as NeurIPS, ICLR, and ICML. He is passionate about generative AI and is eager to leverage his deep technical expertise to contribute to the open-source community.