Databricks Introduces Dolly 2.0: The World’s First Open Instruction-Tuned LLM
CEO & Co-Founder of Databricks, Ali Ghodsi took to LinkedIn to introduce to the world, Dolly 2.0 — the world’s first open-source LLM that is instruction-following and fine-tuned on a human-generated instruction dataset licensed for commercial use.
In a blog post, Databricks opened up about Dolly 2.0. According to their post, Dolly 2.0 is capable of following instructions, enabling organizations to build, own and customize LLMs for their specific needs. This means, that if a company wants to use an LLM for sentiment analysis of customer reviews, they don’t have to start from the foundations. With Dolly, they could start with a pre-trained LLM and fine-tune it on a data set of customer reviews.
Dolly 2.0 is a 12-billion parameter model based on the EleutherAI pythia model and has been fine-tuned exclusively on a new, high-quality human-generated instruction-following dataset, called databricks-dolly-15k. This is the first open-source, human-generated instruction dataset specifically designed for making LLMs exhibit the human-like interactivity of ChatGPT. Databricks made the dataset, the training code, and the model weights available to anyone for commercial use under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
Databricks received several requests to use its LLMs commercially after releasing Dolly 1.0, which was trained using a dataset created by the Stanford Alpaca team with the OpenAI API. However, this dataset contained output from ChatGPT, and its terms of service prevent anyone from creating a model that competes with OpenAI. Therefore, Dolly 1.0 was limited to non-commercial use. To overcome this limitation, Databricks created its dataset, crowdsourcing it among its employees during March and April 2023.
Databricks set up a contest to create a high-quality dataset, offering a big award to the top 20 labelers. Databricks employees completed seven specific tasks: Open Q&A, Closed Q&A, Extract information from Wikipedia, Summarize information from Wikipedia, Brainstorming, Classification, and Creative writing. Each task helped Databricks create an original, high-quality dataset that avoids contamination from pre-existing information.
The databricks-dolly-15k dataset contains 15,000 human-generated prompt/response pairs specifically designed for instruction-following, ranging from brainstorming and content generation to information extraction and summarization. By making Dolly 2.0 open-source, Databricks aims to democratize access to LLMs, enabling organizations to build customized models without paying for API access or sharing data with third parties.
Originally posted on OpenDataScience.com
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.