Code-N is developing business applications based on the Concept Web, an approach to knowledge discovery evolved by forward-thinking academics and industry leaders in the Concept Web Alliance. This group has observed that the utility of the modern Internet is limited by fundamental differences in the way computers and humans process information. Keyword searches require humans to specify in queries exactly what they want to find—and even then, differing vocabularies or query syntax retrieve redundant, ambiguous, and irrelevant results. Entire workflows have sprung up to facilitate modern data searching because formulating queries that actually generate usable information takes time—and extracting knowledge from retrieved results takes even more.
The Concept Web builds on the “triple” semantic web standard—assertions encompassing a subject, predicate, and object that lends structure to unstructured data. The Concept Web’s innovation is the Nano-Publication—triples that aggregate collected industry knowledge without redundancy or ambiguity.
Triples can be extracted from any source, such as this journal article about malaria in Africa and Asia:
Triple #1 (TR1) is “Mosquitoes transmit Plasmodium Falciparum.” The promise of the semantic web lay in compiling these assertions into a triple store that could be searched by software. Yet the semantic web hasn’t delivered game-changing software applications because of an inherent limitation in the formation triples.
The Trouble with Triples
Triples lack provenance and context. Take TR1 above. This triple can’t really be used for reasoning because…
- We can’t tell if the assertion is true. The triple provides no context about the author. Is Osamor a noted expert in malaria or a graduate student making a hypothesis?
- The assertion provides no caveats. Do all mosquitoes transmit malaria, or just certain mosquitoes, such as those in a particular region, of a particular gender, or infected with malaria?
- We don’t know the relationship between this triple and other triples in the store making similar assertions.
- We can’t trace the triple back to the source to gather more context about the assertion.
Nano-Publications: Extracting Intelligence from Triples
Concept Web Nano-Publications do three things to put a triple in context.
- Apply a universal identifier (UUID) to each term and use an open-source identity resolution service to assign a UUID to every term representing the same “concept” across all industry databases.
Learn moreIn the example above, “mosquitoes” might get assigned UUID #1001, while Plasmodium Falciparum might get assigned #1003. If a synonym of “mosquitoes,” such as “culicidae,” is used in another triple, it would also be assigned UUID #1001. UUIDs are not assigned to words or terms, but to unique concepts. So the concept of an animal called a “jaguar” receives one UUID while the concept of an automobile called “jaguar” gets a different UUID.
- Append all the metadata associated with a triple to capture its context and provenance.
Learn moreThe Nano-Publication encompassing the triple in our example would be
NP101 = TR1 + PR1 + EV1 + CV1
This Nano-Publication includes the original triple (TR1) along with provenance (PR1) information such as
- The source journal name or database
- The title of the article or database entry and its page numbers
- The date the source was published or updated
- The author
- Any special coding used by the industry to search and find the article/entry
The Nano-Publication also includes context such as
- The Evidence Score (EV) of the author and triple, based on the author’s impact factor and whether the triple is a known fact or weak association.
- Caveats (CV) associated with the statement. In this case, the article points out that while not all mosquitoes transmit the malarial parasite, the genus “Anopheles” does.
- Fold all redundant triples into the same Nano-Publication to eliminate redundancy and ambiguity.
Learn moreFor instance, these three triples assert similar concepts:
- TR1 = Mosquitoes transmit Plasmodium Falciparum
- TR2 = Culicidae spread malaria sporozoites
- TR10001 = Anopheles cause malaria
Instead of creating new Nano-Publications for each of the tens of thousands of triples stating that mosquitoes cause malaria, the Concept Web maintains one “cardinal” assertion plus associated metadata for all the underlying triples. So a complete Nano-Publication for our example would be
NP101 = TR1 + PR1 + PR2 + PR10001 + EV1 + EV2 + EV10001 + CV1 + CV2 + CV3
In this case, the caveats surrounding TR2 point out that in order to transmit malarial sporozoites, a mosquito must be female (CV2) and must be infected with malaria (CV3), so these get added to the cardinal assertion along with TR2’s provenance (PR2) and evidence score (EV2). Triple #10001 has no relevant caveats, but we record its provenance (PR10001) and evidence score (EV10001) in NP101 so that scientists can drill down to this information while exploring the cardinal assertion.
Code-N Concept Clouds(TM)
Nano-Publications ultimately become a dense mini-file of an industry’s knowledge on a particular topic. These machine-readable concepts extend across language, vocabulary, and database syntax barriers to bridge the gap between the way humans and computers process information. The resulting network of connected concepts can be pulled together into Concept Clouds that can be searched without the massive redundancy and ambiguity that is inherent in triple stores or traditional keyword-based searches. At Code-N, we’re creating ways to interactively mine these collaboratively created Concept Clouds—targeted business applications that uncover expected information faster and make the serendipitous discoveries that drive innovation more likely.