Creating a Robust Architecture to Deliver Consistent Environmental Data to the Community.
The PREMIER project will deliver a novel database and digital assessment system (DAS) for characterising the environmental risks of medicines and making environmental data more visible and accessible to industry, academia, regulators and the public. By making this valuable knowledge available to the community, we will contribute to the responsible use of medicines. The PREMIER database and DAS are being developed by Simomics.
The novel database architecture, designed and built by Simomics, is driven by the need to bring together existing pharmaceutical and ecotoxicological data from different sources with new data generated during the PREMIER project by consortium members, while providing transparency to stakeholders, in both its origins and its relative quality.
The data we integrate in PREMIER will have resulted from different approaches to data generation and different sources (e.g. industry and academic studies or literature reviews). As a result of this, data quality assessment will play an important role in helping users of the database to understand what processes can be reliably performed using which data by communicating data quality.
To enable this, the PREMIER database is architected around four concepts: data sources, substances, properties, and tracked values.
Data sources provide the different data points available for substances that are relevant to the project, and these data points are linked as properties of the substance using tracked values. An example of a property is “Fish NOEC”, which may be available from a data source for a number of substances. An example of a tracked value is a statement to the effect of: “data source y gives property z of substance x to be 12.3 ng/L”. We represent this in the database as the value “12.3 ng/L”, with obligatory coupled identifiers of property z (Fish NOEC), substance x, data source y, and other optional data such as test design and reliability assessment that gives the data accessor knowledge of its source and e.g. study summaries, links to papers, CRED assessments.
This architecture allows us to perform a range of activities, and gives us advantages over other databases who do not use this provenance tracking system:
- Store low-quality or non-standard data alongside high-quality regulatory data. We will mitigate concerns about cross-contamination of data or accidentally using data for the wrong application by giving users the ability to run any process on all available data or on only a subset of the data, e.g. data that has been quality assessed to a particular standard.
- Combine data from different data sources that refer to the same property by a different name (e.g. “Fish NOEC”, “NOEC (fish)”, “NOEC”) while maintaining a record of the raw data and combining process. By doing this users see one consistent view of the data across all data sources, and can run queries across all the data or only specific data sources, using a consistent query system.
- Allow auditing of the data through the data provenance tracking, with visualisations in the DAS to ensure transparency when data is accessed. When users see any number on the screen, they can view its original data sources, and how it was obtained, even if it has come through calculations, decision flowcharts or in-silico models.
This architecture will ensure consistency in the database and DAS during and after the 6 year PREMIER project; bringing together measured, calculated and modelled data from various sources, with transparent and traceable data provenance to give data users confidence in and support adoption of the PREMIER project deliverables in the future.