Transformer neural networks have become a focal point, demonstrating remarkable efficacy in natural language processing and emerging applications such as computer vision, robotics, and autonomous driving. But, the increasing scale of these models poses challenges, primarily in terms of costs associated with computing and inference latency.
This is creating a demand for innovative solutions to enhance scalability without impractical computational burdens. Enter Google AI’s AltUp, a new method designed to augment token representation without amplifying computational overhead.
While models like Switch Transformer, Expert Choice, and V-MoE have made strides in efficiently scaling network parameters, a research gap remained concerning the scaling up of token representation dimensions. This is where AltUp shines.
What makes AltUp stand out is how it can partition a widened representation vector into equal-sized blocks, processing only one block at each layer. Its effectiveness lies in a prediction-correction mechanism, allowing for the inference of outputs for non-processed blocks.
According to Google AI’s blog, by maintaining model dimensions and avoiding a quadratic increase in computation, AltUp emerges as a promising solution to the challenges posed by larger transformer networks.
The mechanics of AltUp delve into the intricacies of token embeddings, demonstrating how they can be widened without triggering a surge in computational complexity. The method involves invoking a 1x width transformer layer for one block, termed the “activated” block, and concurrently employing a lightweight predictor.
This predictor computes a weighted combination of all input blocks, corrected through a lightweight corrector, facilitating the update of inactivated blocks based on the activated ones. Both prediction and correction steps involve minimal vector additions and multiplications, making them significantly faster than conventional transformer layers.
AltUp’s evaluation of T5 models across benchmark language tasks showcases its consistent ability to outperform dense models with the same accuracy. A T5 Large model augmented with AltUp achieves notable speedups of 27%, 39%, 87%, and 29% on GLUE, SuperGLUE, SQuAD, and Trivia-QA benchmarks, respectively.
Notably, AltUp’s relative performance improvements become more pronounced with larger models, underscoring its scalability and enhanced efficacy as model size increases. The researchers’ extension of AltUp, known as Recycled-AltUp, further showcases the adaptability of the proposed method.
Recycled-AltUp, by replicating embeddings instead of widening the initial token embeddings, demonstrates strict improvements in pre-training performance without introducing perceptible slowdown.
Overall, the goal of this paper and the team’s contribution goes a long way toward making large-scale Transformer models more practical and accessible to a greater scale of applications.
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.