PyTorch Lightning: From Research to Production, Minus the Boilerplate

6 min readMar 16, 2021

The following post introduces PyTorch Lightning, outlines its core design philosophy, and provides inline examples of how this philosophy enables more reproducible and production-capable deep learning code.

What is PyTorch Lightning?

PyTorch Lighting is a lightweight PyTorch wrapper for high-performance AI research.

Simply put, PyTorch Lightning is just organized PyTorch code.

Organizing PyTorch code with Lightning enables seamless training on multiple-GPUs, TPUs, CPUs and the use of difficult to implement best practices such as model sharding and mixed precision.

Losing the Boiler Plate PyTorch Lightning Design Philosophy Explained

1. Self Contained Models and Data

One of the traditional bottlenecks to reproducibility in deep learning is that models are often thought of as just a graph of computations and weights.

Example PyTorch Computation Graph from the PyTorch AutoGrad Docs

In reality, reproducing Deep Learning requires mechanisms to keep track of components such as initializations, optimizers, loss functions, data transforms, and augmentations.

A core design philosophy of PyTorch Lightning is that all the components and code related to reproducibility should be self-contained. A good test to see how self-contained your model is to ask yourself this question: “Can someone drop this file into a Trainer without knowing anything about the internals?”

The lightning module contains all the default initialization parameters needed for reproducibility.

2. Modular Code

PyTorch Lightning provides a modular framework to decouple research and data code resulting in faster iteration and more reproducible code.

Visualized Modularization of Deep Learning Code with Lighting

The modular nature of this code increases readability. For example, if I want to understand how a module trains or inferences instead of guessing where in the code this is implemented, I can look at the modules training_step and forward functions.

Similarly, if I want to know how data was preprocessed, transformed, or split, I can look at the Data Modules init, prepare_data, and setup functions.

3. Reduce the Boiler Plate

After decoupling your research and data code, the remaining boilerplate is managed by Lightning, providing implementations of proven deep learning best practices to reduce error and training time.

The Lightning Trainer standardizes the boilerplate with best practices to reduce ~80% of the most common Deep Learning errors.

Otherwise, state-of-the-art methods are often abandoned because boilerplate code such as ensuring that the model is not accidentally configured to evaluation mode on fine-tuning was not properly configured.

Lightning manages boilerplate code, such as device optimization, logging, process rank management, and more so that researchers can focus on building the best models possible. If users have special use cases requiring additional abstraction, they can create and share their own callbacks.

4. Maximum Flexibility

For research to flourish, tools must be flexible. Lightning’s standardized best practices are accessible to the end-user as overridable hooks enabling maximum flexibility for those who want to experiment with crazy ideas that stray off the standard path.

An example Auto Encoder training_step override that demonstrates full access to the underlying boilerplate when needed.

Since hooks for processes such as loss configuration are standardized, Lightning makes it much easier to experiment with custom losses for domain-specific applications and combine different models for more complex Multi-Input Modeling scenarios.

From Reproducible Research to Production Deep Learning

Now that we have a better understanding of the core design philosophy of underlying PyTorch Lightning, let’s look at some of the cool features this enables out the box from Multi GPU and TPU training to one line Onnx and Torch Script export.

If you want to get started quickly, Lightning also provides an example implementation of common Deep Learning Tasks from Text Summarization to Object Detection as part of the PyTorch Lightning Flash repo.

PyTorchLightning/lightning-flash

Read our launch blogpost Pip / conda pip install lightning-flash -U Pip from source pip install…github.com

Production Scale Training with Grid

While Lightning helps keep your PyTorch code organized, reproducible and scalable there is one remaining barrier towards production that we have yet to discuss, managing infrastructure. Often orchestrating compute and data pipelines to train and serve models at scale requires extensive configuration orcode modification to accomplish. That is where Grid comes in, Grid manages this overhead for you enabling PyTorch Lightning code to scale from a laptop to the cloud without changing a single line of code.

With Grid trains you can take a PyTorch Lightning script and scale and track hundreds of experiments as follows.

If you don’t have a powerful enough laptop interactive nodes provide optimized development environments that make it easier to hit the ground running developing your code enabling true production scale training of applied state of the art deep learning models.

Conclusions on PyTorch Lightning

This post shows how Lightning’s core design principles enable more reproducible and production-ready deep learning code.

Lightning code is clearer to read because engineering code is abstracted away, and common functions such as training_steps, process_data are standardized. Lightning handles the tricky engineering preventing common mistakes while enabling access to all the flexibility of PyTorch when needed.
Lightning modules are hardware agnostic; if your code runs on a CPU, it will run on GPUs, TPUs, and clusters without requiring gradient accumulation or process rank management. You can even implement your own custom accelerators.
Each release is tested rigorously with every new PR on every supported version of PyTorch and Python, OS, multi GPUs, and even TPUs.Lightning has dozens of integrations with popular machine learning tools such as TensorBoard, CometML, and Neptune.
Grid enables you to scale production training of your PyTorch Lightning code from your laptop to the cloud without having to modify a single line of code.

Article written by Ari Bornstein & Sean Narenthiran.

Editor’s note: Learn more about PyTorch Lightning in William Falcon’s ODSC East 2021 talk, “From Research to Production, Minus the Boilerplate,” there!