An Intro to Building Knowledge Graphs
Editor’s note: Sumit Pal is a speaker for ODSC East this April 23–25. Be sure to check out his talk, “Building Knowledge Graphs,” there!
Graphs and Knowledge Graphs (KGs) are all around us. We use them every day without realizing it. GPS leverages graph data structures and databases to plot routes from point to point. Social media is modeled with graphs. Cell phone technology leverages graphs to figure out phone towers to route the call with a triangulation algorithm as one moves from one place to another.
Get your ODSC East 2024 pass today!
In-Person and Virtual Conference
April 23rd to 25th, 2024
Join us for a deep dive into the latest data science and AI trends, tools, and techniques, from LLMs to data analytics and from machine learning to responsible AI.
KGs are built on top of graph databases and are omnipresent too. The moment you use a search engine like Google, Bing, or Baidu, KGs jump in action to provide semantic and contextual search — that is based NOT on “Strings” and keywords — BUT on “Things” and concepts.
Emerging data management products — data catalogs, data fabric, etc — leverage KGs as the core linking and semantic engine. eBay, LinkedIn, BBC, Thomson Reuters, JPMC, NASA, and other Fortune 500 companies routinely leverage KGs.
What is a Knowledge Graph?
Before we discuss KGs — let us take a small detour to understand graph models. There are two types of graph models — Label Property Graph (LPG) and Resource Description Framework (RDF).
Labeled Property Graph (LPG)
LPG uses labels for nodes and edges which characterize entities and relationships. Nodes are linked uni/bi-directionally to other nodes through edges. Both nodes and edges have associated properties modeled as key values with primitive data types and are single-valued. LPGs support “index-free adjacency” which makes it ideal for graph traversals to implement graph algorithms like shortest path between nodes, clustering, and centrality.
Resource Description Framework (RDF) Model
RDF is for encoding semantic relationships between data items that are broken down into a triple structure composed of Subject, Predicate, and Object. Predicate is the graph edge connecting endpoints Subject and Object. It uses Unified Resource Identifiers (URIs) to identify logical or physical resources of the triple.
The value of RDF is in making statements and connecting concepts with relationships. It contextualizes the data with ontologies, taxonomies, and vocabularies. RDF is used for data publishing and data interchange and is based on W3C standards. It supports schema evolution and formalism in RDFs resulting in the emergence of semantics. Adherence to standards promotes alignment of meaning, unambiguous interpretation, interoperability, and semantic data integration.
Knowledge Graphs (KGs)
Think of KGs as a graph database with a knowledge toolkit. A KG models the knowledge of a domain as a graph with a network of entities & relationships connecting them. It models the facts of a domain and includes domain rules.
The knowledge model is a collection of interlinked descriptions of concepts, entities, relationships, and events. Concepts describe data, connections provide context that gives comprehension. KGs put data in context via linking and semantic metadata and provide a framework for data integration, unification, analytics, and sharing.
A KG modeled with RDF supports inferencing & reasoning — i.e. deriving new facts from existing ones. This enables entity resolution and relation extraction from structured and unstructured data.
Not every graph is a KG. The figure below shows how overlaying an ontology (shoe ontology) enhances and enriches the original graph. The enriched graph provides automated reasoning (shown in the RHS).
KG is a representation of an organization’s knowledge, domain, and artifacts that is understood by humans and machines. KGs help organizations create a knowledge model representing the business and the entities in the domain. This semantic network of facts is used for data integration, knowledge discovery and analysis.
Why Knowledge Graphs — Use Cases
KGs can be used in multiple ways — as a database that can be queried, as a graph that can be analyzed as a network, and as a knowledge base where new facts can be inferred. KGs can be used for discovering previously unknown connections, and enabling inferencing and rule-based reasoning to automate the generation of new knowledge through data relationship discovery and exploration.
Uses and applications of knowledge graphs include data and information-heavy services like contextually aware content recommendation, drug discovery, semantic search, investment market intelligence, information discovery in regulatory documents, advanced drug safety analytics, and much more.
The mind maps below show the range of capabilities of KGs.
How to Build Knowledge Graphs
A KG is not a one-off engineering project. Building a KG requires collaboration between functional domain experts, data engineers, data modelers, and key sponsors. It requires ontology, taxonomy, vocabulary, graph databases, semantic mapping tools, data mapping framework, and data extraction capabilities from heterogeneous sources.
Taxonomy is a classification scheme a knowledge map, the information model that describes and structures information in a hierarchy. It is effective for organizing content and data. Captures context and meaning making data easy to find and understand. Examples include — the Dewey Decimal System for books and the organization of living things (Kingdom, Phylum, Class, Order, Family, Genus). Taxonomy provides consistent metadata and tagging, helping to improve precision and recall, and is the foundation for building smart search/discovery applications.
2024 Data Engineering Summit tickets available now!
In-Person Data Engineering Conference
April 23rd to 24th, 2024 — Boston, MA
At our second annual Data Engineering Summit, Ai+ and ODSC are partnering to bring together the leading experts in data engineering and thousands of practitioners to explore different strategies for making data actionable.
Ontology is the schema for graph data that identifies, and distinguishes concepts and relationships. It is a shared vocabulary that describes the semantics of domain data. A lack of ontology creates ambiguity.
Well-known ontologies and taxonomies are publicly available that can be re-used and adapted for domain-specific applications.
The figure below shows the 10 steps to building KGs.
KGs and LLMs
LLMs and KGs cross-pollinate to build synergistic solutions with their convergence.
LLMs can enrich KGs with relation and event extraction from texts. LLMs can aid KG construction with ontology prompting and generate text descriptions for entities. LLMs can classify entities in a KG and help with knowledge retrieval by generating graph search queries, summarize graph query results, and explain complex queries and schema. Using the RAG pattern, LLMs can be enriched by KGs to leverage proprietary documents and metadata. LLMs accelerate KG development by bootstrapping with a given ontology/taxonomy.
KGs can improve accuracy and reduce hallucinations in LLMs by providing a factual foundation to anchor and validate responses. KGs allow LLM output to be supported by reason. With their structured domain representation, generative AI performance is enhanced by providing context, which furthers understanding. KGs facilitate knowledge retrieval and integration, enriching and integrating diverse structured and unstructured data, and incorporating relevant information into LLM responses. KGs provide explainability, transparency, and enable provenance, to understand and validate LLM responses.
What’s next?
Please join the session of Building Knowledge Graphs — Day 2–04/24/2024 @ 3.40 pm to learn more about Knowledge Graphs and how to build them
About the Author:
Sumit Pal is an ex-Gartner VP Analyst in the Data Management & Analytics space where he advised CTOs, CDOs, CDAOs, Enterprise Architects and Data Architects on Data Strategy, Data Architectures, Data Engineering for building data platforms. Sumit spans the spectrum — from formulating data strategy with CDO/CTO teams to architecting, designing and building data platforms and solutions to writing, deploying and debugging. With more than 25y of experience in data and Software Industry roles spanning companies from startups to enterprise organizations in building, managing, guiding teams to build scalable software systems across the stack from data layer, analytics using BigData, NoSQL, DB Internals, Data Warehousing, Data Modeling, Data Science and AI. Sumit has experience in building, managing and guiding teams and building scalable software systems across the stack from middle tier, data layer, analytics, ML, Data Engineering, DataOps, Data Architectures, Data Lakes, Data Lakehouses, NoSQL, DB Internals, Data Warehousing, Dimensional Modeling, Data Science and Java / J2EE aspects of the technology. Published author of a book on SQLEngines and developed MOOC course on Big Data Hiked to Mt. Everest Base Camp in Oct 2016. Blogs at https://sumitpal.wordpress.com.
Originally posted on OpenDataScience.com
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.