Supercharge Your LLMs: Fine-Tune and Serve SLMs with Predibase

5 min readJan 15, 2025

Editor’s note: Devvret Rishi and Chloe Leung of Predibase are speaking at the month-long AI Builders Summit starting on January 15th! Be sure to check out their talk on January 15th, “”Fine-tune Your Own Open-Source SLMs”!

Predibase is a low-code/no-code end-to-end platform that simplifies the fine-tuning, serving, and deployment of large language models (LLMs) with advanced techniques like LoRA eXchange and Turbo LoRA for efficient model optimization.

In this tutorial, we provide a detailed walkthrough of fine-tuning and serving Llama 3.1 8B Instruct model with the CoNLLpp dataset for a Name Entity Recognition (NER) use case using Predibase’s efficient fine-tuning stack. You’ll learn how to:

Fine-tune task-specific models that are on par with commercial LLMs such as GPT-4
Dynamically serve multiple fine-tuned adapters on a single GPU with LoRA eXchange, and
Supercharge inference speed using Turbo LoRA, Predibase’s proprietary optimization for 3x faster and more cost-effective model serving

Dataset Preparation

The CoNLL-2003 (CoNLLpp) dataset is a benchmark dataset for named entity recognition (NER). It contains labeled entities in English text across four categories: persons, organizations, locations, and miscellaneous. The dataset is widely used for training and evaluating models on sequence labeling tasks, especially for fine-tuning LLMs to improve their ability to identify and classify named entities in unstructured text. This makes it valuable for applications like information extraction, question answering, and document indexing where precise entity recognition is critical.

Predibase supports various fine-tuning tasks, including instruction, completion, chat, and recently added VLM fine-tuning. This task specifically focuses on instruction-based fine-tuning, which requires datasets with prompt-completion pairs.

Prompt:  Your task is a Named Entity Recognition (NER) task. Predict the category of

each entity, then place the entity into the list associated with the category in an output JSON payload. Below is an example:Input: EU rejects German call to boycott British lamb . Output: {"person":[], "organization": ["EU"], "location": [], "miscellaneous": ["German","British"]}Now, complete the task.Input: Fischler proposed EU-wide measures after reports from Britain and France that under laboratory conditions sheep could contract Bovine Spongiform Encephalopathy ( BSE ) -- mad cow disease . Output:  ----------------------------------Completion:  {"person": ["Fischler"], "organization": [], "location": ["Britain", "France"], "miscellaneous": ["EU-wide", "Bovine Spongiform Encephalopathy", "BSE"]}

Predibase Installation

!pip install -U predibase --quiet

from predibase import Predibase, FinetuningConfig, DeploymentConfigpb = Predibase(api_token="<PREDIBASE API TOKEN>")

Connect Dataset & Create a Repo

pb.datasets.from_file("{Path to local file}", name="conllpp_demo")

pb.repos.create(name="conllpp-demo", description="conllpp fine-tuning experiments repo", exists_ok=True)

Fine-tuning

You can access Predibase (sign up for our free trial) to connect the dataset and kick off your fine-tuning jobs via a few different options: no-code UI, low-code Python SDK and also CLI. We made fine-tuning super easy but also flexible for advanced users to configure different settings. With the Predibase fine-tuning stack, fine-tuning is just a few lines of code, or not a single line of code using the UI:

# Start a fine-tuning job, blocks until training is finished
adapter = pb.adapters.create(
   config=FinetuningConfig(
       base_model="llama-3-1-8b-instruct"
   ),
   dataset="conllpp_demo", # Also accepts the dataset name as a string
   repo="conllpp-demo",
   description="initial model with defaults"
)

Once the adapter is trained, we’ll evaluate the model using a custom evaluation function to compute the average multiset Jaccard similarity between two lists of JSON strings.

Multi-LoRA Inference via LoRAX

Once the adapter is ready, you can immediately deploy the model via our shared endpoints or create a private serverless deployment for your production traffic.

Shared Endpoints

# Example Prompt
prompt = """
Prompt:  Your task is a Named Entity Recognition (NER) task. Predict the category of
each entity, then place the entity into the list associated with the
category in an output JSON payload. Below is an example:

Input: EU rejects German call to boycott British lamb . Output: {"person":
[], "organization": ["EU"], "location": [], "miscellaneous": ["German",
"British"]}Now, complete the task.Input: Fischler proposed EU-wide measures after reports from Britain and France that under laboratory conditions sheep could contract Bovine Spongiform Encephalopathy ( BSE ) -- mad cow disease . Output:"""# Specify the shared endpoint by name
lorax_client = pb.deployments.client("llama-3-1-8b-instruct")
print(lorax_client.generate(prompt, adapter_id = "conllpp-demo/1", max_new_tokens=100).generated_text)

Private Serverless Deployments

You can create a private deployment with a few lines of code:

pb.deployments.create(
   name="llama-3-1-8b-instruct-conllpp",
   config=DeploymentConfig(
       base_model="llama-3-1-8b-instruct",
       # cooldown_time=3600, # Value in seconds, defaults to 3600 (1hr)
       min_replicas=0,  # Auto-scales to 0 replicas when not in use
       max_replicas=1
   )
   # description="", # Optional
)

And generate inference the same way as shared endpoints with replacing lorax_client with the deployment name:

# Grab deployment
lorax_client = pb.deployments.client("llama-3-1-8b-instruct-conllpp", force_bare_client=True)

# Warmup Deployment
generated_text = lorax_client.generate(prompt, adapter_id = "conllpp-demo/1", max_new_tokens=128, temperature=1).generated_text
generated_text

Turbo LoRA

Turbo LoRA is a proprietary fine-tuning method that combines LoRA for quality and speculative decoding for speed, achieving up to 3.5x faster inference for single requests and up to 2x for high-query-batch workloads.

During the workshop, we’ll walk you through a real demo on this use case to compare LoRA and Turbo LoRA in terms of latency and model performance. You’ll be able to train a turbo LoRA very easily and observe the performance improvements while maintaining the original LoRA quality.

Please join me at my ODSC workshop on Jan 15th for a deeper dive on turbo LoRA, as well as a few other innovative features of Predibase’s next-gen inference engine that collectively enhance the deployment of SLMs.

About the Authors/AI Builders Summit Speakers

Chloe is a Machine Learning Solutions Architect at Predibase with deep expertise in developing business-focused LLM solutions. She previously worked as a senior data scientist at Deloitte and Accenture. Chloe is passionate about making small language models accessible to businesses for their specific use cases. She holds a master’s degree in CS that focused on ML from UC Berkeley.

Devvret is the CEO and Cofounder of Predibase. Prior he was an ML product leader at Google working across products like Firebase, Google Research, and the Google Assistant as well as Vertex AI. While there, Dev was also the first product lead for Kaggle — a data science and machine learning community with over 8 million users worldwide. Dev’s academic background is in computer science and statistics, and he holds a master’s in computer science from Harvard University focused on ML.