In Novo Nordisk R&ED, we are implementing a “FAIR@source and scale” approach to data management. The FAIR@source strategy aims to expedite the response time to scientific queries while ensuring factual context and reducing time-to-value. Simultaneously, our emphasis on FAIR@scale prioritizes scalability and the development of enterprise-level solutions. This strategic emphasis is crucial for the practical utilization of solutions within a large organization, a prerequisite particularly pertinent to global organizations that span multiple geographies. Ensuring the scalability of our data management solutions is integral to their effective implementation and utility across diverse facets of our organization.
To operationalise our “FAIR@source and scale” ambitions, we employ an OBDM strategy. In this strategy, centralized ontologies serve as the Single Source of Truth (SSOT), ensuring consistency and accuracy in data representation. Employing ontology-based structures aids in the alignment and integration of data across various domains, thus streamlining the process of adhering to the FAIR principles from the point of data creation. This approach also ensures scalability within the operational framework of Novo Nordisk. To operationalise our OBDM strategy we use a combination of ontologies, taxonomies, and controlled vocabularies, each aligned to each other, in order to deliver achievable vocabulary management, even to teams with limited or no semantic background, allowing reduction in integration overhead. (Fig. 1 shows an overview, more details of each step can be found in later sections). Enforcing use of ontologies is notoriously difficult. In order to aid in this process, we embed ontologies at source registration systems where we can. We do this through providing drop downs with preferred labels where the system registers ontology URIs in the backend (see “Delivery of controlled vocabularies” in our methods section).
Fig. 1General workflow diagram representing the ontology-based data management system. This diagram shows how we utilise ontologies, converting them to taxonomies in our Ontology Management Systems (OMS), and serving it up as controlled vocabularies (CVs) through Access Point Interfaces (API). Our conversion process from ontologies to taxonomies also converts public ontology URIs to in-house Novo Nordisk (NN) URIs and maps them using Simple Standard for Sharing Ontological Mappings (SSSOM)
Building an ontology-based data management ecosystemHaving a good approach for ontology consumption is crucial for the development of an OBDM ecosystem. As of January 2024, BioPortal [9] contains 1065 published models, the Ontology Lookup Service [10] by EMBL-EBI and Ontobee [11] contain 246 and 263 ontologies respectively. Given the large corpus of work that already exsist, one should, as far as possible, utilise existing ontologies instead of creating new reference ontologies. In the spirit of FAIR principles, reusing ontologies would lead to greater interoperability. The choices for ontologies have been extensively described in literature [12, 13]. Regardless of the choice of public ontology it is likely that none of them are fully fit for purpose for the organisation. This is not surprising as most public ontologies are built as general-purpose reference ontologies, while organisational requirements tend to be specific. In order to cater to our specific requirements, our OBDM strategy advocates for the development of organisation-specific ontologies derived from public ontologies. This approach allows for the flexibility required to cater to the needs of the organisation, while remaining interoperable with external data. We ensure that concepts specific to our organisation are parented by a public ontology which allows for easier interoperability, a strategy also used by ontology extensions [14]. In order to avoid conflicts and issues arising from redundancies, our domain models are built on selected ontologies (or subsets of ontologies). Other ontologies that are needed are either mapped in or terms in them are brought in as needed in a similar fashion to enrichments. In our strategy we do not allow duplicity of concepts to avoid difficulties in integrating data down the road.
While various strategies exist for organizations to bridge the gap between the scope of public ontologies and organisational needs, our approach was shaped by several key considerations. Through multiple iterations, we have been able to continuously learn and refine our approach. Despite the multitude of ontologies and associated resources, it became apparent that no singular representation could fully satisfy our needs across all domains of interest. With this in mind, we developed what we term “domain models” in which we defined the domains which are of interest, identified relevant ontologies, and created our own internal taxonomies based on them (described below). Another key decision was based on the OBO Foundry principle [12, 15] of orthogonality which asserts that for each domain there should be convergence upon a single ontology. Based on this, we decided against bringing in multiple full ontologies with the same scope since having multiple concepts with the same definitions would lead to difficulties in integration.
Our domain models are based either directly on a public ontology or on a composite of multiple ontologies. The latter are amalgamated to form a cohesive model through a process that involves extracting subgraphs and merging them where appropriate. This process also includes the harmonization and merging of concepts from multiple ontologies within the same domain. To ensure flexibility, we mint new URIs for our domain models. This allows us to modify logical axiomatization of ontologies or append ontology terms to other ontologies. Where it is sensible (e.g. like identifying duplicate terms, refining hierarchies, or incorporating non-proprietary terms such as anatomical parts), we prioritize pushing changes/fixing at source. This has a few strategic benefits compared to fixing in-house including reducing the burden of maintenance and ensuring that our data remains interoperable with data annotated with those ontologies. We use Simple Standard for Sharing Ontological Mappings (SSSOM) [16] systems to maintain interoperability with external sources, allowing us to update our internal ontologies, and keep in sync with public ontologies, avoiding drift. This was important to us as ontologies are not static artefacts, but models that evolve alongside knowledge. Additional benefits of using shared standards include easier utilization of community efforts like biomappings [17], and availability of open source tooling (e.g. sssom-py). From here, tools like the aforementioned biomappings and OXO [18] can help naturalise annotated external data to our ontologies. Additionally, there are semiautomated systems using named entity recognition (NER), both commercial and open-source, can aid in the annotation of unannotated data. The specific NER tooling used is dependent on use case, team capabilities and preferences, and performance among other things.
As our upper ontology, we use BFO [19] which allows us to use reasoner-based coherency checks in the future. Deciding on an implementation for a middle/unifying ontology is a bit more complicated. For example, in the biomedical area, the OBO foundry has created an experimental unifying middle ontology, Core Ontology for Biology and Biomedicine (COB)(https://github.com/OBOFoundry/COB), that aims to bring together key terms from OBO ontologies. Work is also underway by Pistoia Alliance to develop a similar middle/upper ontology to unify high level concepts in the pharma space (termed Pharma General Ontology (PGO)) and we are actively contributing to the thought leadership underlying its construction. Federated solutions such as mapping of terms is an alternative to having unifying ontologies, and community efforts like biomappings [17] are already ongoing. However, since these unified solutions are in their infancy, we decided to take an approach interoperable with either of them. As of April 2024, we do not map to any middle ontology, but instead directly to BFO, with the knowledge that mappings can be made in the future.
Securing interoperable scientific metadataOur FAIR@source and scale strategy relies on metadata in all applications being interoperable and served from an SSOT. We use taxonomies derived from ontologies to act as the SSOT for scientific metadata. Since ontologies are complex and difficult to maintain, we build SKOS taxonomies based on public ontologies to maintain enrichments (terms specific to Novo Nordisk) to domain models. These taxonomies only maintain annotations, hierarchical structures, and minimal relationships between concepts (as opposed to ontologies, which contain richer and more expressive relationships). This allows us a quick turnaround required in an R&ED environment. As SKOS is not as expressive as OWL, we enforce conventions in conversion where we treat skos: narrower to be equivalent to rdfs: subClassOf (an agreement among the team rather than a logical assertion). This allows back conversion to OWL/RDFS when needed for use cases that require it such as building semantically controlled knowledge graphs described in the next section. In cases where modelling has to done using individuals rather than classes, we use rdf: type to be equivalent to skos: narrower. Relationships like part_of that can be conceptualised as narrower in a taxonomy are instead left as associative relationships if they are needed to be brought in. More details and links to snippets can be found in the methods section “Conversion to SKOS taxonomies”. Given that our enrichments are always parented by a concept that is derived from a public ontology, it affords us flexibility in our system. For example, if we decide to maintain OWL ontologies at a later date, our enrichments are already parented by concepts modelled as such, and conversion of enrichment to owl objects can be as rich or shallow as we choose. We however do acknowledge that utilising CVs comes with a risk of drift that may lead to issues down the road. For example, Roche utilised an internal CV which is mapped to a set of terms from the Gene Ontology (GO) for gene enrichment analysis. However, when converting them to OWL class expression, incomplete mappings of CV terms and unmappable CV terms to GO were found [20]. To mitigate this, we ensure that conversions are done in automated pipelines which allow us to easily update to new versions of ontologies– something we do on a regular basis, and as mentioned above, we ensure enrichments are parented by concepts from public ontologies. Delivery of these taxonomies to stakeholders takes the form of ontology governed controlled vocabularies (a structured list of concepts derived from the taxonomies) delivered in any form the stakeholders prefer– mostly APIs. Where changes in public ontologies that will affect users (e.g. deprecation of terms), we follow conventions of OBO ontologies (e.g. bringing in ‘term replaced by’ annotations) and inform our downstream users appropriately.
Integration across data silos using knowledge graphsGiven the legacy of a century-old organisation, we have diverse data sources originating from legacy and federated systems. Ensuring semantic interoperability across these silos gives us the opportunity to better leverage insights across the data landscape. One way to enable semantic interoperability is through the use of a Knowledge Graph (KG). However, the ability to integrate disparate data sources into a KG can present a formidable challenge. An effective solution to this issue can be found in the implementation of a Virtual Knowledge Graph (VKG) or Ontology-Based Data Access (OBDA) approach. In this approach, data sources such as databases are mapped to an ontology, thereby presenting a unified KG [21, 22]. Compared to materialisation, this methodology offers significant advantages, including the ability to leverage existing security measures and access controls, as the data remains in its original location, eliminating the need for duplicating and storing large volumes of data which in turn reduces storage costs and minimizes data redundancy. Additionally, this approach provides a real-time, on-demand view of the underlying data, which better supports the compliance and governance frameworks that are could be crucial in pharma. A VKG approach also provides a more scalable method for data ingestion. Maintaining scalable mappings, as opposed to the resource-intensive processes of data materialization and constant reindexing, results in a more efficient and sustainable system. This is highly crucial in a R&ED environment is decidedly dynamic, with frequent updates and revisions, and a materialised graph can quickly become outdated or inconsistent. In essence, virtualization in the context of a semantic KG offers a flexible approach to data integration, allowing for the addition or removal of data sources with minimal impact on the overall structure of the KG and has benefits over a materialised graph in compliance, maintaining data consistency, and managing security. A challenge to having a VKG approach is that it but can struggle with high query complexity and reasoning overhead at scale. There are however several techniques to mitigate this like using Large Language Model (LLM) enabled advanced query rewriting to transform high-level semantic queries into optimized database queries [23], caching of inference results to minimize redundant computations, and partitioning complex reasoning tasks across multiple nodes. While we acknowledge that scalability remains a key challenge in handling highly complex queries, we believe that VKGs are well suited to address our primary focus is on establishing a robust conceptual framework that can integrate siloed data with fine-grained access control and the reason why this method is increasingly considered a viable alternative to traditional data integration techniques [21, 22].
Developing a KG in the biomedical domain takes a lot of time, resources, and commitment due to the variety and heterogeneity of biomedical data sources [24, 25]. Legacy and disparate sources of data, for example, require huge curation effort. We therefore have adopted the strategy of incremental improvements based on concrete use cases with stakeholders that can champion it. Our KG is built with scalability, useability, and flexibility in mind. Given that we link our data through our own internal URI, any rewiring needed can be done easily. Changing ontologies can be done simply by mapping our internal URIs to the new ontology concepts’ URIs. Furthermore, if we decide to eventually maintain our own internal owl ontologies, switching can be done in very similar ways. Public ontologies are also brought in as imports in a modular fashion, this would mean that if we require slices of the ontology/KG for specific future applications, it can easily be done. All this points to a generalisable, flexible, and scalable system that fits our strategy of incremental improvements.
We take a semantic approach to our KG construction with public ontologies as its underlying structure. Our semantic KG utilises our domain models’ links to public ontologies to establish a bridge between internal URIs and those of public ontologies using equivalence assertions. Figure 2 shows a diagrammatic representation of how we built our application-specific ontology. In this example, the aim was to query the graph on any “experiment” that “involves” a “chemical entity” that has_role “anti-obesity agent”. The solid lines show explicit relationships between entities in our graph, while the dotted lines show inferred relationships. The inferred response to the query above is highlighted in red.
Fig. 2Diagrammatic representation of our knowledge graph approach which illustrates the response path to the query: any “experiment” that “involves” a “chemical entity” that has_role “anti-obesity agent”. The left side represents public ontologies with BFO (yellow) as the upper ontology, and CHEBI (green) and AFO (red) as ontologies of interest. Novo Nordisk specific nodes (blue) are linked either by owl: equivalentClass or rdfs: subClassOf. The solid lines show explicit relationships between entities in our graph, while the dotted lines show inferred relationships. The inferred response to the query is highlighted in red
An additional benefit of utilizing common public ontologies to underly our semantic KG is the ability to bring easily align with public initiatives. For example, the Monarch Initiative Graph integrates phenotypes, genes, and diseases [26]– something that is highly useful to work done in R&ED. Bringing in such a graph would be relatively simple given that we already utilize the ontologies underlying it, something we are already scoping. In order to optimally harness such alignment, we utilize public data models where possible. For example, for our single cell RNA sequencing data, we utilize CellXGene standards [27] and have harmonized our legacy data accordingly.
Metrics and evaluationIn order to measure the impact of the OBDM strategy on enabling FAIR@source in applications and understand the landscape of the impact, we measured uptake of CVs by the spread of departments, mapped to phases of the pharma value chain, which use our APIs (Fig. 3A) and the number of fields (headers) and values under them that we provide to applications (Fig. 3B). The spread of areas shows diversity in stakeholders and the increase from 2023 (when we first developed domain models) to 2024 shows an increase in uptake. In order to understand how our models have developed, we analyzed the number of concepts present in each of our models and broken down into what came from public ontologies, and what was enriched (Fig. 3C). We do note that these metrics are a snapshot in a single point in time (when this paper was written) and our models and stakeholders do continually evolve. These metrics should hence be taken only as a glimpse of our current landscape.
Fig. 3Snapshot of metrics on use of controlled vocabulary and number of concepts in domain models in Novo Nordisk R&ED. (A) shows the spread of applications across phases of the pharma value chain. (B) shows the number of fields (headers) and values under them by year. (C) shows the number of concepts present in each of our models broken down into what came from public ontologies (base concepts), and what was enriched
Comments (0)