Showcasing the Future of Time Series Forecasting with Foundation Models
Foundation models have revolutionized natural language processing and computer vision by enabling powerful generalization across tasks with minimal fine-tuning. Now, a similar transformation is underway in the time series domain. In a recent session, Stefan Webb, Developer Advocate at Zilliz, spotlighted the growing potential of foundation models for time series forecasting. Zilliz, the company behind the open-source vector database Milvus, is closely following this evolution as it intersects with cutting-edge AI infrastructure.
The promise is clear: By adapting the large language model (LLM) paradigm to time series data, we could unlock new forecasting capabilities with zero-shot learning, scalable architectures, and cross-domain generalization.
Why Time Series Data Matters
Time series data underpins critical operations across sectors. It captures how values evolve over time — an essential factor in trend analysis, anomaly detection, and forecasting.
From financial tick data and industrial IoT sensors to search trends and staffing metrics, time series data powers decision-making in:
- Finance: trading algorithms and risk forecasting
- Manufacturing: predictive maintenance and process optimization
- Marketing and Sales: demand forecasting and churn modeling
- Scientific Research: climate trends, epidemiology, and more
- Business Intelligence: operational planning, revenue prediction
- Regardless of your domain, chances are you interact with time series data regularly.
Why Foundation Models for Time Series?
The traditional approach to time series forecasting often requires training a new model for each use case — an inefficient and resource-intensive process. Foundation models present a game-changing alternative through:
- Zero-shot learning: Predict outcomes without retraining on new tasks
- Transfer learning: Leverage patterns across varied datasets
- Reduced computational load: Skip daily retraining cycles
- Few-shot capabilities: Learn from minimal examples in context
This represents a significant leap from classical models like ARIMA or Gaussian Processes, which must be customized per task and rarely generalize across domains.
Adapting the LLM Paradigm to Time Series
The core idea is simple yet profound: Use a decoder-only transformer architecture — similar to models like GPT — but adapt it from scratch for time series data. This differs from retrofitting pre-trained LLMs, which have shown limited effectiveness on time-based benchmarks.
The architecture mimics LLM design but processes sequential numerical data rather than text:
- Uses residual blocks to embed time series patches
- Applies positional encodings to retain temporal order
- Feed embeddings into transformer layers
- Outputs multi-point forecasts from the decoder head
Importantly, these models are trained directly on time series data and do not inherit any language-specific structure.
Challenges and Open Questions
Early skepticism centered on key issues:
- No tokens or grammar: Can models learn without linguistic structure?
- Data availability: Unlike text, time series data is fragmented and often proprietary
- Cross-domain relevance: Can models trained on weather patterns predict stock behavior?
Initial results suggest the answer is cautiously optimistic. Shared temporal dynamics and inclusion of exogenous variables improve generalization. Moreover, performance scaling — akin to LLMs — indicates emergent capabilities with larger datasets and model sizes.
Case Study: TimesFM by Google Research
Google Research’s TimesFM exemplifies this new frontier. Built on a decoder-only transformer, the model showcases strong performance across a wide range of forecasting benchmarks.
Architecture Highlights:
- Tokenization: Time series are divided into “patches,” like image or text tokens
- Embedding: Patches transformed using residual blocks, with positional encodings
- Prediction: Decoder head forecasts multiple time steps ahead
Dataset:
- 300 billion time points, including data from Google Trends, Wikipedia traffic, weather, and more
- Synthetic data created via ARIMA-based models
- Limited representation of financial data — an area for future expansion
Training and Results:
- 200M parameter model trained using mean squared error
- Surpassed task-specific models in zero-shot mean absolute error (MAE)
- Validated scaling laws and optimal input patch size (e.g., 32 points)
Beyond Forecasting: Versatility of Time Series FMs
The applications extend well beyond prediction:
Classification: Generate embeddings of time series histories for downstream ML tasks
Regression: Feed outputs into specialized models for tailored predictions
By converting time series into high-quality representations, foundation models open the door to broader AI use cases, including anomaly detection and signal classification.
The Road Ahead
As promising as this field is, several challenges remain:
- Exogenous data integration: Enhancing accuracy via cross-attention, LoRA, or residual regression
- Multimodal models: Combining time series with video, text, or sensor data
- Probabilistic outputs: Improving uncertainty quantification for risk-sensitive domains
- New architectures: Exploring alternatives like Mamba or XLSDM for long-range dependencies
- Data scale and tooling: Growing datasets and simplifying access through frameworks like Hugging Face Transformers
These next steps will be essential in moving from research to production-grade systems.
Time Series FMs in Broader AI Pipelines
As time series FMs mature, their integration into broader AI systems becomes increasingly feasible. In particular, frameworks like Retrieval-Augmented Generation (RAG) can leverage time series embeddings for enhanced decision-making. Milvus, the open-source vector database developed by Zilliz, offers a powerful platform for building these systems, making it easier to retrieve, index, and deploy time series representations in real-time applications.
Conclusion
Foundation models are poised to redefine time series forecasting. Inspired by the success of LLMs, decoder-only transformers show that time-based data can benefit from large-scale generalization, zero-shot learning, and scaling laws. While still early in development, tools like TimesFM demonstrate that these models can already outperform task-specific alternatives.
As datasets grow and architectures improve, expect foundation models for time series to play a pivotal role in future AI systems, offering new capabilities for data scientists and engineers ready to experiment today.