Copy
View this email in your browser
Dressing in Layers
by
Paco Nathan
KGC Editor

When we talk about working with knowledge graphs, we have many detailed standards and protocols for detailed parts – such as SHACL constraints or SPARQL endpoints. However, we don’t have an overarching strategy for how to build and use a KG. We really don’t have a sense of team process either, for how people all fit into the big picture.

Consequently, we find different teams moving in many different directions, often with little common ground in which to share their experiences. We find lots of vendors making lots of interesting product claims, and while perhaps there are related benchmarks to use for testing some of those claims, we don’t have good ways to compare and contrast the different tools used in KG work.

This dilemma came up recently in Alan Morrison’s “Personal Knowledge Graph” discussion group. People in the discussion had experience with a variety of tools, although we struggled with describing those experiences to each other. We struggled even more with objective comparisons of KG tools that offer diverse features.

One outcome was a small GitHub public repo that shows and briefly describes a layered model. Let’s explore that model here and open up the floor for discussion. Comments, critiques, suggestions, and so on are highly welcomed.

Functional Models

First some important background. In the 1970s, back when internetworking was a very young field and evolving quite rapidly, the source code used for computer networking was often described as “ad hoc,” “proprietary,” “difficult to understand,” and so on. Other, less polite euphemisms applied. Early ARPANET used a protocol called NCP for some of the “middle layers” of networking, although the upper and lower parts differed wildly, depending on which vendors you were using and where you were located. I can tell you that it was a big hot mess!

On January 1, 1983, Vint Cerf coordinated the “flag day” transition from Stanford to cut over from NCP to the new TCP/IP layered suite of protocols that became the modern internet. Even though all of the servers had to be shut down and restarted across the entire internet, the long-term results appear to have turned out rather swimmingly.

That previous jumble of ad hoc or proprietary code gave way to a conceptual model of a “layered stack” in which each layer provides abstractions for some necessary work, although higher layers don’t need to fuss with the details of lower layers. In network engineering, we call this a functional model and the OSI model from the late 1970s is a good example. The protocol standards TCP, IP, and a few others are split into seven layers – at least in what ISO (International Organization for Standardization) developed. That functional model helps people understand where products could fit together, and how they could fit together – in other words, how to avoid comparing “apples and oranges.” In terms of designing large and complex distributed systems, this layered model represents manna for architects and implementers alike.

To have more common ground for comparing our experiences, it seemed like a reasonable idea to apply similar thinking to our domain to help describe where all the various parts fit in knowledge graphs.

KG Layers

Next, let’s take a tour through each of the KG layers. Please remember that this is a work in progress, subject to much discussion and iteration. To that point, what do you think? Does the following fit with your experiences? Are there parts that could be simplified or made more general?



Layer 1: Remote Storage

If you’re going to collaborate with other people, then you’ll need to share some kind of data repository. The lowest, most foundational layer is storage.

It’s no surprise that when Amazon went out to build cloud computing, one of the first – and most popular – services offered was their S3 layer for a storage grid. It's also where so much of their technology investment has continued to perform miracles over the years.

We find popular storage grids such as Amazon S3, Azure Storage, Google GCS, etc., are at Layer 1. These are amazingly robust and cost-effective, although relatively “raw” in the sense that they are neither file systems nor databases. Also, they are mostly designed for programmers (or applications) to use. Their usage implies capabilities for collaboration, publishing, disaster recovery, etc.

Layer 2: Versioning

Services such as GitHub, GitLab, etc., are at Layer 2. These typically bundle the versioning semantics of a tool such as Git along with a storage grid, then provide ways to publish (e.g., jumping all the way up to Layer 9). Graph-based data and metadata can be difficult to version – or rather, there are specialized methods that tools such as Git do not necessarily understand. Even so, Git understands how to version the text and other components that go into a KG instance.

Layer 3: Semi-Structured Data

The overall “narrative arc” here is that we’re moving from data and metadata components, which are relatively scattered and disconnected, into something that’s relatively well organized and structured in well understood ways. With storage and versioning details handled by the lower layers, we can begin to structure the content.

Let’s start with two simple, common forms: CSV for data and text files. The former is simple, at least for anyone who’s worked with spreadsheets. For the text, use of markdown at Layer 3 is one among many popular formats. It has the benefit of being relatively human-readable, even in its raw form, while also simple to parse by machines using popular libraries. It’s also native in Jupyter notebooks, as well as one of the most popular formats for documenting open source projects, and increasingly gets used among technical publishers as well.

Layer 4: Semantic Markup

The term markdown is a proper noun, and a play on words: the term markup comes from way back in print editing. The latter term got reused for the world wide web to describe how to annotate text content into HTML web pages. In other words, simple ways to add semantics.

The semantic markup at Layer 4 begins to add some semantic properties to our semi-structured content from the lower layer – for example, ways that we can describe links and other metadata.

Services such as Obsidian and Roam are largely at this layer. They provide services based on markdown content that can be versioned and shared, and in some cases bundle that with cloud-based storage.

Layer 5: Collaborative Editing

Layer 5 is what people commonly associate with online services such as Google Docs or Box. There’s a programming technique called append-only logs that makes collaborative editing feasible to manage online. This work is by definition transactional in nature – almost magical when an entire Zoom party of people watch as they edit a document together in near real time.

These services typically bundle Layer 1 storage along with some aspects of Layer 2 versioning. Generally, these services lack much awareness about markdown formats specifically and tend to be MS Word lookalikes.

Layer 6: Shared Vocabulary

The shared vocabulary in Layer 6 is where a project attempts to harmonize the semantic markup with more commonly used controlled vocabularies. Why? To make sure the metadata is referencing shared definitions that others can use for multiple purposes: queries, auditing, indexing, reporting, training machine learning models, inference, and so on. This is the stuff that databases excel at performing, with the not-so-small addition of the 800-pound gorilla in the room called inference.

Examples include DCMI and Schema.org among many others. In aggregate, this is where ontology gets described.

Layer 7: Reification

The word reification is perhaps a loaded term within the knowledge graph community, given all the arguments concerning semantic graphs and property graphs. Even so, it’s probably the most accurate term to use for Layer 7. The mix of vocabularies in Layer 6 provides structure that is sometimes called TBox, though that terminology is somewhat dated. ABox provides the flip side of this coin, where the structure gets populated with facts. That’s the definition of reification. This is where a KG instance goes beyond being a general ontology and begins to represent specifics.

An example is using persistent identifiers to “populate” the semantic markup, so that content can be referenced globally through unique identifiers. Perhaps you have metadata to describe a book: title, authors, publisher, publication year, etc. A human could search on Google, Amazon, library software, etc., and probably find it.

However, if you use an ISBN to identify the book, then machines can begin to identify. There is ISSN for periodicals, ORCID for researchers, ROR for organizations, DOI for articles, etc. Using a URL is perhaps the simplest case.

Another example at Layer 7 is taxonomy. While Layer 6 provides for vocabularies such as SKOS to organize classifications, it cannot magically organize the classifications required within your KG instance. For example, suppose you need to represent a graph about technical documents where artificial intelligence and knowledge graph are two of the popular content categories. Does one of those belong as a subcategory of the other? Where does natural language or machine learning fit?

In the book example, using title, authors and year might seem good enough. In the taxonomy example, making artificial intelligence and natural language both top-level categories might seem good enough. However, think about the impact for downstream usage. A “good enough” representation does not guarantee effective inference down the road in the KG use cases. In other words, if you’re building a KG to support a recommender system, how good do you want the results to be?

Layer 8: Knowledge Graph

Within Layer 8 is where the knowledge graph work happens. In other words, the lower layers provide all of the elements for building a KG, although not the means for manipulating one. Layer 8 is where SPARQL queries happen, SHACL rules can be applied, OWL closures can “fill in the gaps” in a graph, etc.

Keep in mind that Layer 8 has its own internal layering: RDF for triples/quads, RDFS schema for defining properties, OWL for machine interpretability, and so on. Both the Layer 6 ontology and the Layer 7 identifiers must exist as metadata “overlays” atop the content (in other words, as semantic annotation) for the Layer 8 use cases to make any sense.

Layer 9: Publishing

Finally, there’s a presentation layer at the top, roughly similar to the presentation layer in network models. Layer 9 is the proverbial “last mile” where publishing and accessing the KGs occurs. Since the W3C standards emerged from world wide web, many of their abstractions (Solid, LDP, etc.) seem to fixate at this layer, although they're not very articulate about details for some of the underlying foundations. Layer 9 is where many KG use cases provide features for public access.

Publishing may be a matter of:

  • web-based rendering
  • search and query capabilities
  • API access
Layers of Layers

Speaking of layers, the idea of a KG itself as a kind of layer (overarching all of the above) has been in the news recently.

Recent notions of data mesh are abuzz. Much of the big data community has been pushing the notion of data lakes for the past several years: put all of your data into one place. However, in practice, that approach tends to have problems at scale. The data mesh architecture moves beyond the “monolithic” data lake notion to draw from distributed architectures. As Zhamak Dehghani describes, the term data mesh follows from the popular practice of using a service mesh to organize a microservice architecture. The two layers reflect each other for data and use case, respectively.

However, there’s a growing recognition that another layer should go between these two. This entails using graph-based approaches to provide context for the data by organizing its metadata – in other words, using a KG in between to help make the data components “cohere” across an organization. This topic was explored at Metadata Day 2020.

KGC 2021 is coming up on May 3–6, and some of the people mentioned above will be presenting.
  • Zhamak Dehghani will be KGC Keynote Speaker.
  • Alan Morrison will lead a workshop about personal knowledge graphs.

Again, all feedback is welcomed. An especially great place for those discussions is on the KGC Slack board.

********

Many thanks to the people who helped shape this discussion: Chu Nnodu, Phil Taylor and Joaquin Melara.

Knowledge Graph Industry Survey
Tell us what's on your mind, and you may win a free pass to KGC 2021.

KGC and the Enterprise Knowledge Graph Foundation are partnering on a survey to learn about your use cases for knowledge graphs, your concerns, and your aspirations for KG technology.

This survey takes only 6-8 minutes to complete and we plan to unveil the results at Knowledge Graph Conference 2021 in early May. Your participation can help make KGC 2021 a more informative and meaningful experience for all. Plus you'll be entered in a raffle for free passes to the event.
Take Survey
KGC 2021 Volunteer Opportunities
     We're seeking volunteers to help stage our multi-day virtual event from
     May 3-6. Multiple shifts are available and you can volunteer for a day
     and time that works for you.
Apply Here
APRIL EVENTS

*** The first two Knowledge Espresso sessions will
discuss the upcoming KGC 2021 event. ***

  • Apr 13: Knowledge Espresso: 3 PM ET. Juan Sequeda, principal scientist at data.world and Chair of the KGC 2021 Program Committee, will join KGC co-founders Thomas Deely and Francois Scharffe to give an overview of the final KGC 2021 program. Register to join here.
  • Apr 15: Knowledge Espresso: 11 AM ET. We’ll be joined by Semantic Web Company’s Andreas Blumauer, Oracle’s Melliyal Annamalai, Fluree’s Brian Platz, Stardog’s Mike Grove, data.world’s Bryon Jacob, and RelationalAI’s Ben Humphries and Steve Herskovitz, who will give a short description of their talks. Register to join here.

  • Apr 16: Office Hours with Paco Nathan

  • Apr 29: Knowledge Espresso: KGC's Francois Scharffe with TBD

Stay up to date with the KGC Events Calendar

Mark Your Calendars for
KGC 2021 on May 3-6!
 
Join us for our third annual Knowledge Graph Conference. We have exciting talks planned on such topics as KG applications in enterprise data architecture, natural language processing, retail, biomedical research and clinical data management, and more!
 
Early Bird Tickets on Sale Now
Buy Tickets
KGC Q&A BOARD
Have a question or knowledge to share about Knowledge Graphs? Check out our Q&A Board and see what are the latest hot topics in the KG Community.
 
Recently there was an interesting discussion of Data Mesh and Knowlege Graphs between Phil Taylor and KGC's Paco Nathan.
Read It Here
Knowledge Graphs in Mental Therapy
Nariman Ammar gave a fascinating presentation at Knowledge Connexions 2020 on how knowledge graphs can be used to develop treatment plans for the effects of adverse childhood experiences. Her article in JMIR Medical Informatics covers the topic in depth.
 
Read it Here

JOIN THE COMMUNITY
Let us know what you're working on in Slack.
Twitter
Facebook
Website
Copyright © 2021 Knowledge Graphs Conference, All rights reserved.


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Email Marketing Powered by Mailchimp