Structuring the Unstructured: Advanced Document Parsing for AI Workflows

ODSC - Open Data Science
5 min readJust now

--

Editor’s note: Cedric Clyburn is a speaker for ODSC East 2025 this May 13th-15th in Boston! Be sure to check out his talk, “Structuring the Unstructured: Advanced Document Parsing for AI Workflows,” there to learn more about Docling and other tools!

As I’ve been building AI-enabled applications, there’s always one reoccurring idea that I come back to: much of the data that’s useful to us, is also useful to an AI model. This could include documents, PDFs, web content, and really anything that provides your model with specific knowledge about your organization or helps the end user. But, particularly for enterprises, you’ll most likely be working with large amounts of data- in different formats from PDFs, Word documents, and PowerPoints- that need to be cleaned and formatted before being used by Large Language Models (LLMs) in business applications.

Since data cleanup isn’t the most enjoyable activity, it’s common to use an online service, perhaps through an API, to send, process, and receive parsed data from these documents. But, why use a 3rd party service if you could do things locally? Just as the revolution of local AI has led to the growth of tools like Ollama, there’s a rising open-source project called Docling for advanced document processing and integration into common AI developer frameworks for RAG (Retrieval-Augmented-Generation) and Agentic applications.

Specifically in this guide, you’ll learn how to get started with the Docling project, and use it from the command line in order to process both PDFs and tables. However, I encourage you to join us at OSDC East 2025 in Boston, where we’ll do a deeper dive into building Q&A systems, agentic workflows, and more to gain incredible value from business data in AI applications!

Getting Started with Document Processing using Docling

Before we get hands-on with the project, it’s important to understand the various steps that are necessary to prepare data for LLM-powered applications. This includes mitigating document duplication, removal of excess markup, context-aware text extraction, removal of PII (personally identifiable information), and tokenizing/chunking of documents into a vector store as an example. Fortunately, the folks at IBM Research understood the goal of unlocking insights locked in valuable business documents, and developed the project based on two vision models, achieving 30 times faster processing speeds compared to traditional OCR methods (highly recommend checking out the research paper on arXiv).

Let’s get started! I’ll be running these commands from my local machine, but if you’d also like to follow along via Google Collab simply click here. Let’s quickly install the Docling python CLI & API using `pip install docking` to perform a quick conversion from the command line, ingesting the above Docling research paper as a PDF document and producing a resulting markdown document (or JSON, HTML, etc) in the folder where the command was run.

pip install docling
docling https://arxiv.org/pdf/2206.01062

As shown above, we can start to understand the representations of the PDF, parsed into an LLM-readable format, with headers, paragraphs, and more. Now, the first time you interact with Docling, it may take longer to process as it downloads local extraction models. But, using the `time` command, the whole process took 18.39 seconds on my Mac M3. What’s critical in this conversion procession is keeping the structure from the original source documents, which with text can be trivial, but what about more complex data representations, such as tables?

curl -o solar-system-overview.pdf 'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/solar-system-overview.pdf'
docling --to md solar-system-overview.pdf

Here, we’re parsing an overview of the Solar System, distances between planets and the sun, and other fun details. Fortunately, Docling specifically excels at recovering the logical structure of tables, and relationships between cells. While below you’ll understand the basic conversion, there’s also a DoclingDocument document representation, as a Pydantic datatype, for advanced users looking to get more specialized results from all processed documents (for example with tables, only working with specific columns, modifying results from conversion, etc). While we’re just taking a look at the CLI in these examples, there’s also a library for integrating Docling into your applications as well.

Wrapping Up

Data processing & cleanup isn’t an easy task, but fortunately, the open-source community has been working hard to make things a bit easier. Even as a developer myself, I’ve been able to start using Docling to process my business data from PDFs, export it to markdown, and use it as training data for developing a domain-specific small language model. However, you might be curious about how this works! I encourage you to check out the integrations that the project provides, where you can get an idea of how to set up a pipeline to parse your documents into LangChain, LlamaIndex, Haystack, and many more frameworks in the AI ecosystem. Many thanks for your time, and best of luck in your journey!

About the Author & ODSC East 2025 Speaker

Cedric Clyburn (@cedricclyburn), Senior Developer Advocate at Red Hat, is an enthusiastic software technologist with a background in Kubernetes, AI, and container tools. He has experience speaking and organizing conferences including DevNexus, WeAreDevelopers, The Linux Foundation, KCD NYC, and more. Cedric loves all things open-source, and works to make developer’s lives easier! Based out of New York.

Website: www.cedricclyburn.com

LinkedIn: https://www.linkedin.com/in/cedricclyburn

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

No responses yet