Building the Future of Generative AI: Compound AI Systems

ODSC - Open Data Science
3 min readNov 1, 2024

--

Single, monolithic models are out– compound AI systems are in. The next generation of generative AI will be dynamic, agentic workflows — systems of many models, modalities, and external knowledge sources that work together to solve business tasks.

LLMs have impressive capabilities, but they are limited by their training data and access to real-time information. Compound AI systems don’t have to rely on the technical capabilities of a single model. Instead, they can:

  • Leverage the unique strengths of each system component
  • Incorporate real-time information from external databases and knowledge sources
  • Function at faster inference speeds and with lower latency
  • Produce higher-quality results for end users.

This is a more functional, stable, and interactive solution for enterprise production-scale AI, overcoming the probabilistic limitations of the single generative AI model.

However, transitioning the entire AI industry towards compound AI systems requires radical new tools and design approaches. Compound AI systems need to be steerable to fit into unique workload patterns of individual use cases. Automatically customizing compound AI systems is the way to scale to a large number of production deployments.

Here are a few examples of automatic customization we invented at Fireworks:

Adaptive speculative execution. This approach improves model inference by customizing a technique called ‘speculative decoding’ for specific workloads. Rather than having one LLM generate tokens one by one, speculative decoding brings in a smaller “draft” model. The draft model predicts possible token sequences while the main LLM runs as usual. The tokens predicted by the draft model are then verified by the main LLM for accuracy. Adaptive speculative execution takes speculative decoding one step further for compound systems– automating optimization across every layer of the deployment stack (from hardware, to software, to models).

Multi-LoRA techniques. Personalized AI product experiences are critical for customer satisfaction and retention. However deploying hundreds of fine-tuned models to thousands of users is expensive. Low-rank adaptation (LoRA) is a popular solution to this scaling cost problem. It improves model performance by only updating a small subset of model parameters. The Multi-LoRA approach takes this one step further, allowing developers to serve hundreds of personalized AI models at the same inference cost as a single base model.

Quantization techniques. In order for compound systems to work well, they need to be fast. Developers can use various quantization techniques like SmoothQuant, GPTQ, Hadamard, and SpinQuant to maximize model performance and speed. But quantization also introduces performance risk. Developers must monitor the impact of quantization using task-based accuracy metrics.

About the Author/ODSC West 2024 Speaker on Compound AI Systems:

Lin Qiao is the CEO and co-founder of inference platform Fireworks AI, which makes it simple, fast, and cost-effective for enterprises like Uber, Quora, and DoorDash to build and scale genAI products. Prior to founding Fireworks, Lin was the head of Meta’s PyTorch.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

No responses yet