Why Provenance is the Key to AI Success: Knowledge Graph Ontology Design
Editor’s note: Henri is a speaker for ODSC East 2022. Be sure to check out this talk, “A Global Knowledge Graph of People, Skills, And Companies: How Ontology Design is Key to Enabling AI Solutions in HR,” there!
I’m not going to start this blog by using lots of superlatives describing how much data there is in the world. Take it as a given — there is a near limitless amount of data in the world.
In my role at Beamery, we are centralizing our understanding of the world in a knowledge graph, by aggregating data from a large variety of sources. This includes everything from Human Capital Management systems to Wikipedia pages. We are a full-talent lifecycle company — so our domain is people, companies, skills, and experiences.
We have billions and billions of facts stored as knowledge — and it’s growing all the time. All of this knowledge is stored in a graph using RDF semantic web technology. Whilst we work in the talent technology space, what follows could be applied to literally any domain — scientific, people, sales, inventory — anything that has even an ounce of business value.
Before we dive in on the provenance conundrum, let’s briefly explain what a knowledge graph is. In my own words, we are referring to:
“A highly flexible no-SQL database which represents data as “knowledge” through a graph-like structure of nodes and edges. Information is represented much like someone might draw a mindmap, or creatively relate ideas together on a piece of paper.
The nodes that refer to the knowledge are often defined in an ontology — the concepts that describe the domain. They can be traversed semantically using domain knowledge”
People often think about visualizing knowledge graphs, using diagrams like this:
Fig 1. Knowledge graph visualization
Whilst this can be quite sexy for marketing material, due to the sheer amount of data it’s often impractical. Realistically, the only benefit is to understand the classes that comprise an ontology, rather than the instances of these classes.
Now, not all knowledge graphs use the same underlying technology. In my career, I’ve almost always used Resource Description Framework (RDF) — an open standard often referred to as semantic web. We chose to adopt this because:
- The technology is widely adopted in open data circles — meaning we can make use of publicly available linked data.
- There is a strong emphasis put on ontology design, meaning we can control the concepts that describe our domain. It also means we can semantically traverse the graph.
- The nature of graph databases makes it extremely easy to add new data as knowledge.
- An open standard means we can remain database vendor agnostic.
Here is a very professional diagram I have previously created to demonstrate where RDF exists in the wider database ecosystem. Note that proprietary systems like TigerGraph and Neo4j are not RDF databases.
Fig 2. RDF in the existing database ecosystem
A discussion of why we chose RDF, and OpenLink Virtuoso was previously written by my great colleague Kasper. A full list of RDF databases can be found here.
What is provenance?
Now for some more definitions. When we discuss data provenance (often referred to as lineage) what we are referring to is metadata that describes the origin of data. My semantic web friends who authored the PROV ontology have provided a more concrete definition:
“a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.”
Crucially, data being ingested into a knowledge graph can be derived from almost any source, open or proprietary. The idea is that we aggregate disparate datasets into a single unified source — knowledge. This is one of the reasons provenance is so crucial.
The flexibility of RDF comes into its own when we consider provenance. We can easily add new entities that describe the provenance of a given core entity. The working group for PROV originally authored an ontology to describe the provenance of data in any domain using RDF:
Fig 3. PROV core concepts
I’m conscious that this can all seem a little abstract. In reality, we adapt the above abstract concepts into our wider ontology to show the lineage of a particular asset or entity in our own domain. For instance, a webpage for a company taken at a monthly cadence provides assets for a given entity. This is why having a semi-rigid ontology is key. It allows us to ensure that concepts in our ontology are effectively attributed using provenance principles.
In my current role, we use a flavor of the ADMS ontology, which was originally based on PROV. The core concepts are described below:
Fig 4. ADMS core concepts
What does maintaining provenance in ontology design enable?
- A perfect playground for data science — the beauty of RDF knowledge means that data is held in a highly flexible manner. It can be extracted at any granularity for machine learning tasks — including subgraphs for graph learning problems. All of this whilst maintaining the lineage of where the data came from.
- Ensures data quality — in data science, there is a common catchphrase — garbage in, garbage out. In a knowledge graph with such a huge amount from disparate sources, knowing where the data came from is crucial.
- Maintains context — even if the data is of high quality, it is important we understand the context behind metadata. For instance — the sectoral classification for two different company intelligence websites is not the same, even if the labels might be identical in some places.
- Entity reconciliation — one of the biggest problems in the digital world is understanding when two separate pieces of information are ultimately referring to the same instance of the same concept. This is known as entity reconciliation and can be easily enabled using provenance modeling techniques.
- Compliance and Security — understanding the origin of data that could potentially end up in the hands of a customer is crucial to ensuring compliance.
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.