Fine-tuning LLMs on Slack Messages

ODSC - Open Data Science
3 min readOct 12, 2023

Editor’s note: Eli Chen is a speaker for ODSC West this October 30th to November 2nd. Be sure to check out his talk, “Fine-tuning LLMs on Slack Messages,” there!

Fine-tuning LLMs is super easy, thanks to HuggingFace’s libraries. This tutorial walks you through adapting a pre-trained model to generate text that emulates chat conversations from a Slack channel. You should be comfortable with Python to get the most out of this guide.

Getting the data

First, obtain the data from Slack using the API. Before diving in, make sure you have your bot token, user ID, and channel ID handy. If you’re unsure how to acquire these, here’s a quick guide.

Initialize the Slack WebClient and define a function, fetch_messages, to pull specific chats filtered by a user ID.

token = "YOUR_SLACK_BOT_TOKEN"
channel_id = "CHANNEL_ID"
user_id = "USER_ID"
client = WebClient(token=token)
def fetch_messages(channel_id):
messages = []
cursor = None
while True:
try:
response = client.conversations_history(channel=channel_id, cursor=cursor)
assert response["ok"]
for message in response['messages']:
if 'user' in message and message['user'] == user_id:
messages.append(message['text'])
cursor = response.get('response_metadata',{}).get('next_cursor')
if not cursor:
break
except SlackApiError as e:
print(f"Error: {e.response['error']}")
break
return messages
all_messages = fetch_messages(channel_id)

The function fetch_messages returns a list all_messages containing messages from the specified user in the given channel for fine-tuning LLMs.

Fine-Tuning the Model

After collecting the messages, the next step is to fine-tune a pre-trained language model to mimic the specific language patterns of this particular user. The code below utilizes HuggingFace’s transformers library to streamline this process.

# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token # Set padding token to eos_token
model = AutoModelForCausalLM.from_pretrained("gpt2")
# Tokenize the strings and create a Dataset object
tokens = tokenizer(all_messages, padding=True, truncation=True, return_tensors="pt")
dataset = Dataset.from_dict(tokens)
# Create DataCollator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False,
)
# Setup training arguments
training_args = TrainingArguments(
output_dir="./output",
learning_rate=2e-4,
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
# Fine-tuning
trainer.train()

By running this code, you fine-tune a GPT-2 model to generate text mimicking the user’s messages. Feel free to experiment with different models and learning rates to better fit your needs.

Testing Your Model

After fine-tuning, you’ll want to test the model’s ability to mimic your user’s messages. The code below shows how to generate text based on the prompt “Hello “.

# Generate text
input_text = "Hello "
input_ids = tokenizer.encode(input_text, return_tensors="pt")
# Generate a text sample
output = model.generate(input_ids, max_length=50, num_return_sequences=1,
temperature=1.0)
# Decode and print the text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

For more rigorous evaluations, consider adding performance metrics such as BLEU score or perplexity.

Conclusion

To conclude, you’ve walked through the basic steps for fine-tuning a pre-trained language model on a user’s Slack messages. While this serves as an introduction, there are numerous paths for enhancement, including incremental downloading, fine-tuning hyperparameters, developing per-user conversational models, and incorporating more comprehensive evaluation methods for bias.

For a deeper dive, join me at my ODSC West talk. I’ll discuss real-world training experiences, interesting and weird behaviors we observed over a year, and share insights on associated risks and mitigation strategies.

About the author

Eli is the CTO and Co-Founder at Credo AI. He has led teams building secure and scalable software at companies like Netflix and Twitter. Eli has a passion for unraveling how things work and debugging hard problems. Whether it’s using cryptography to secure software systems or designing distributed system architecture, he is always excited to learn and tackle new challenges. Eli graduated with an Electrical Engineering and Computer Science degree from U.C. Berkeley.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.