How to Optimize ML Models Serving in Production

ODSC - Open Data Science
9 min read4 days ago

--

Today, the use of AI for image classification tasks has become ubiquitous. Millions of images are processed daily with increasing quality standards. However, beyond the quality of classification, optimizing other aspects such as model speed is crucial. Here, we delve into the proper optimization techniques for high-performance ML image preprocessing, focusing on 5 key methods.

Tensor and Pipeline Parallelism

While GPU parallelization is pivotal in model training, it’s essential to consider tensor and pipeline parallelism for efficient model usage post-training.

Tensor parallelism involves splitting tensors across multiple devices, while pipeline parallelism partitions the model layers across different processing units. These methods enhance inference speed and scalability, ensuring optimal performance in production environments.

When the system processes a lot of requests in parallel there is a data transfer overhead related to moving tensors from one gpu to another. Tensor parallelism may have greater total overhead compared to pipeline parallelism, whereas pipeline parallelism may result in higher latency for each particular request. Those two methods are not mutually exclusive if you have enough GPU devices, so the tradeoffs between latency and throughput could be balanced by combining the two.

Get your ODSC Europe 2024 pass today!

In-Person and Virtual Conference

September 5th to 6th, 2024 — London

Featuring 200 hours of content, 90 thought leaders and experts, and 40+ workshops and training sessions, Europe 2024 will keep you up-to-date with the latest topics and tools in everything from machine learning to generative AI and more.

REGISTER NOW

Graphics processors are specialized processors designed for parallel operation. These devices can provide significant advantages over traditional processors, including speed increases up to ×10 times. It is common to have multiple graphics processors built into a system in addition to the CPU. While CPUs can perform more complex or general tasks, GPUs can perform specific, often repetitive processing tasks.

GPUs are often clustered with nodes containing one or more GPUs. And when using GPUs with Kubernetes, you can also deploy heterogeneous clusters and specify your resources, such as memory requirements. You can also monitor these clusters to ensure reliable performance and optimize GPU usage.

Of course, parallel computation is used in model training too, but with model weight synchronization data flow added on top. In model training we prioritize throughput, so tensor parallelism is used when there is no way to fit the full tensor on one device and model pipeline parallelism and model replication is used to distribute the model weights across devices.

Data parallelism, a technique allowing the use of replicated models on different GPUs, proves advantageous in scenarios where datasets exceed the capacity of a single server or when expediting the training process is necessary. Each model copy is concurrently trained on a subset of the dataset, and the results are consolidated, facilitating seamless training continuity.

There are several platforms that allow you to achieve GPU parallelism when training your models. These include:

  • TensorFlow is one of the most popular platforms for machine learning and deep learning.
  • Keras is a deep learning API that enables fast distributed learning using multiple GPUs.
  • PyTorch — a Python-based library package and deep learning platform for scientific computing tasks.

Model Replication

Model replication is scaling the service so it has multiple copies of the same model processing the requests.

Model replicas can be hosted by the same device or by different devices on the same machine. Hosting multiple models on one device can be problematic to manage, but it can save a lot of GPU memory if the model doesn’t use the whole memory available on one device. However, hosting multiple models on one GPU can hinder GPU performance due to device locks and context switches.

Serving service software could be used to manage model replication across multiple devices, CPU or GPU based. Modern serving services provide many useful features such as model upload/offload management, multiple ML frameworks support, dynamic batching, model priority management and metrics for service monitoring.

You also can use distributed computing frameworks such as Ray or DASK, which are primarily designed for AI/ML applications. Ray helps you to

  • Schedule tasks across multiple machines
  • Transfer data efficiently
  • Recover from machine failures

Quantization

Quantization is the most widely used model compression method that reduces the size of a model by using fewer bits to represent its parameters. Typically, software packages use a 32-bit floating point representation for numbers. For a model with 100 million parameters, this requires 400 MB. Switching to a 16-bit (half-precision) representation cuts the memory requirement in half. Models can also use integers (8 bits) or even binary weights (1 bit), as seen in BinaryConnect and XNOR-Net, the latter of which led to the creation of Xnor.ai, which was acquired by Apple in 2020 for $200 million.

It is interesting that the first developments of the quantization method appeared in the 1990s — in the works of A. Choudry.

Quantization not only reduces memory consumption, but also improves computational speed by allowing larger batch sizes and faster arithmetic operations. For example, adding two 16-bit numbers is faster than adding two 32-bit numbers. However, fewer bits mean a smaller range of representable values, leading to rounding errors and potential performance degradation. Efficient rounding and scaling are complex, but are built into major frameworks.

Quantization can occur during training (quantization-aware training) or after training. Low-precision training reduces memory requirements per parameter, allowing larger models on the same hardware. Low-precision training is gaining popularity and is supported by modern hardware such as NVIDIA’s Tensor Cores and Google TPUs with Bfloat16. Fixed-point training shows promising results, but is less common. Fixed-point inference is now standard in the industry, especially for edge devices. Major frameworks such as TensorFlow Lite, PyTorch Mobile, and TensorRT offer post-training quantization with minimal code changes, making it easier to deploy efficient models on resource-constrained devices.

Image Decoding Optimisation

Optimizing image decoding starts with choosing the right libraries and tools for the job. Libraries such as OpenCV and Pillow provide various functionalities for image manipulation and processing. However, their performance characteristics can vary significantly depending on the specific task and hardware environment. Be sure to conduct benchmark tests with different libraries and configurations to help determine the most efficient solution for your use case.

For example, optimized libraries such as TurboJpeg can provide significant image decoding speed boost compared to more widely used Pillow and OpenCV. The good news is that some of the high performance features can be integrated in new versions of Pillow or OpenCV library, but it might be a good idea to check if your environment uses the right downstream libraries and recompile your dependencies if it doesn’t.

In addition, using hardware acceleration with GPUs or specialized hardware such as Tensor Processing Units (TPUs) can further improve decoding performance.

For example, nvJPEG in DALI framework by Nvidia allows to process and decode much more images using GPU, however if you don’t have too much GPU resources you may want to preserve them for ML models and focus on CPU based distributed frameworks and CPU decoding with optimized libraries.

Implementing caching mechanisms to optimize image decoding in ML-based services can be beneficial in scenarios where the input image space is small and repetitive.

Caching decoded images can effectively mitigate the computational overhead associated with repetitive decoding during pipeline processing. By storing decoded images in memory or on disk, subsequent requests for the same image can be processed much faster, thereby enhancing overall system performance and scalability.

However, it’s crucial to manage cache eviction policies meticulously to prevent excessive memory utilization or stability issues, ensuring smooth operation in production environments where input image repetition is prevalent.

Pre-processing techniques play a key role not only during model training but also in real application scenarios.

Before decoding, applying preprocessing steps such as image resizing, normalization, and enlargement can significantly reduce the amount of data to be processed. For example, cropping large images and selecting the desired portions before decoding speeds up the process and reduces additional cost.

Image format specific libraries may be used to perform image manipulation before decoding such us pyTurboJpeg or simplejpeg. Conventional libraries may not offer such functionality and will load the image in full resolution first before resizing it, which can be avoided in many cases.

Moreover, augmentation techniques such as random pruning and rotation diversify the training data without incurring much additional decoding costs. These preprocessing strategies are essential for optimizing image decoding in production environments, providing efficient and scalable performance when deploying ML models for real-world applications.

ODSC West 2024 tickets available now!

In-Person & Virtual Data Science Conference

October 29th-31st, 2024 — Burlingame, CA

Join us for 300+ hours of expert-led content, featuring hands-on, immersive training sessions, workshops, tutorials, and talks on cutting-edge AI tools and techniques, including our first-ever track devoted to AI Robotics!

REGISTER NOW

Parallel Image Fetching

Parallel image loading is emerging as a key strategy to accelerate processing time and improve system efficiency. Parallel loading involves the simultaneous acquisition of multiple images, utilizing the capabilities of modern computing architectures to reduce latency and increase throughput. One of the main approaches to parallel image loading involves the use of multithreading or asynchronous programming. By distributing image loading tasks among multiple threads or asynchronous tasks, the system can utilize the available CPU cores, enabling simultaneous image acquisition from the media.

Moreover, parallel image loading can be easily integrated with distributed computing systems to scale across multiple nodes or clusters. Technologies such as Apache Spark, Flink, and Hadoop enable distributed image loading, where images are loaded in parallel on multiple machines or nodes in a cluster. This distributed approach not only speeds up image retrieval, but also facilitates easy scalability, allowing ML-based image processing services to easily handle large workloads and datasets.

The place where you load an image is also important. If images are stored in an object storage such as Amazon S3 you may pass image URI from service to service until you actually need the image data. But if you load images using GPU machines you may introduce additional load on those machines and memory or CPU usage can skyrocket due to decoding load and memory load from downloading images in full resolution. But you don’t want to load the images into the RAM of a machine too early because you will need to pass more data from service to service increasing the network load and intermediate machine’s memory usage, even if you load and preprocess images resizing them down, it’s still more data compared to passing a URI string. So a middle-ground solution is to use distributed streaming processing with frameworks like Flink to load the images on CPU machines to preprocess them and forward the data to the serving service. In that way image preprocessing logic can be contained in CPU-based machines and the serving service can be made more general, preprocessing resources are separated from inference resources and can be monitored and scaled separately and the network load is minimal.

Guest article provided by Iaroslav Geraskin.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.