Objectives:
- To define the functionality and design the structure of the HEALS GeoDatabase platform
- To define the technical framework and system architecture
- To develop, implement and populate the HEALS platform through an efficient integration and assimilation of the all datasets collected/developed for HEALS
Description of work and role of partners:
The main objective of WP12 is to develop and operate the HEALS GeoDatabase platform (publicly available), which will systematically support the collection of and access to all datasets collected/developed for HEALS environment-wide association studies. The platform will enable users to manage and explore spatial data (when applicable), to process these data and to effectively visualize the results of spatially resolved models. It will be linked to a number of external database modules to access datasets including environmental as well as molecular biology/biochemistry and clinical data to support the performance of Environment-Wide Association Studies on the populations studied in Stream 5. It will also be linked towards Biomonitoring/omics/PBPK/Bioinformatics data infrastructures developed in stream 2. The HEALS database platform will effectively support the HEALS methodology for the construction of the individual exposome and the derivation of environment-wide association studies linking human exposure to chemical and physical stressors over an individual’s lifetime and observed health outcomes. It will be connected to the ToxHub database of the HEROIC project to support European data integration.
The following specific tasks have been identified:
Task 12.1 Definition of functional specifications (VTT, UPMC, USTUTT, UOWM, CERETOX, IMDEC-FEUP, FMUP, NCSRD, URV)
The database will be designed to accommodate both geo-referenced and non-spatial data. Geo-referenced data (i.e. environmental, exposure, population, satellite and GPS sensor-based data) will be included to capture spatial variability of exposure information and support spatial analysis thereof using a Geographical Information System (GIS). Spatially differentiated analysis may support the development of more refined exposure and risk assessment and thus contribute to the development of more refined risk management measures. This is particularly important when policy-relevant conclusions need to be drawn. Non-spatial datasets will be also included through linkage to a number of publicly available databases to retrieve molecular biology/biochemistry and clinical data that already exist or that are produced during the project as needed to perform EWAS to the population surveys addressed in Stream 5.
First steps include:
(a) the definition of the Database functionalities;
(b) the identification of the main information sources and
(c) the incorporation of the datasets the GeoDatabase platform will include towards the implementation of the HEALS approach.
The results of this process will be analysed and discussed within the HEALS consortium during technical meetings in close collaboration with all the other Streams. Based on these deliberations the technical team of WP12 will define the HEALS platform functional specifications which will guide the development of the overall HEALS database design. Special care will be paid to compatibility with ToxHub (HEROIC), and IPCHeM (JRC). A key functionality will be the possibility to readily deliver relevant HEALS data to the IPCheM database.
Task 12.2 Definition of the technical framework and system architecture (VTT, AUTH, URV, IDMEC-FEUP, USTUTT, UOWM, UPMC, UC)
According to the conceptual framework and the information collected in Task 12.1 the technical framework of the HEALS platform will be defined. The platform will be web-based, publicly available, flexible and interactive. Structurally it will include i) a Library containing significant documents and guidelines, as well as links to a number of external database to access chemical, molecular biology/biochemistry data and clinical dataset, ii) a GeoDatabase, which will systematically support the collection of and the access to all datasets collected/developed for HEALS case-studies and population surveys. Geo-referencing and clustering of data will follow the technical specifications of the EC INSPIRE initiative, and in particular the Environment and Health cluster specifications. In addition the platform will be designed to report and display uncertainty across all computation stages and datasets composing the HEALS system.
The platform will be operationally linked to database modules incorporating internal HEALS and external datasets such as the Human Metabolome database (HMDB) which contains 40278 metabolite entries including both water-soluble and lipid soluble metabolites. Additionally, 7761 protein (and DNA) sequences are linked to these metabolite entries. Information on pathways involved in both primary and secondary metabolism will be accessed through the MetaCyc Database, data on genome sequencing will be retrieved through the KEGG,
and GenBank Databases; other Databases includes bioactivity screens of chemical substances (PubChem) and protein sequence database (PDB, Swiss-Prot). ). Further, in order to link towards cohort specific omics data generated within the HEALS project, linkage towards the bioinformatics data infrastructures developed under WP7 (dbNP and DIAMONDS) are foreseen. In addition automated link to libraries such as EpiSuite and QSAR models will be developed to support the parameterization of the generic PBPK model developed in WP6 for known and new chemicals with limited information. Clearance/elimination kinetics will be retrieved from PopGen and the publicly available data from Simcyp, while plasma protein binding will be obtained from the ToxCast Phase I chemical library. Functional on-line links with SES and EHES will allow the integration of high quality national health data at European scale into the HEALS platform in support of the EWAS studies developed in WP13. The HEALS database will support extensive text, sequence, chemical structure and relational query searches. In order to compile the individual exposome, additional information will have to be attributed related to other exogenous factors, such as the potential use of pharmaceutical drugs. Interactions elucidated by the use of these types of compounds will need to be properly interpreted, thus links to databases such as DrugBank will be established. The dissociation between metabolites identified due to use of drugs instead from environmental toxicants will be facilitated by the interpretation to Toxin and Toxin Target Database (T3DB); their interpretation will be facilitated by using the tools developed in WP7, utilizing data from the Small Molecule Pathway Database (SMPDB), an interactive, visual database containing more than 350 small molecule pathways found in humans. All the system will undergo tests for operational resilience in order to ensure that continuous service would be available to the HEALS community and the users of the HEALS methodology and data sets.
Task 12.3 Development and implementation of the HEALS platform (VTT, AUTH, NCSRD, URV, UC)
Based on the analysis and design work performed in WP12 and in WP13 the development and implementation of the HEALS platform will be performed. At first, all available spatial data sets will be geo-referenced to capture spatial variability and will be imported in the GeoDatabase. This will entail both input data to the modelling tools as well as the modelling results. Close collaboration with WP8 is foreseen to include the Environmental Management System developed in WP8 in the GeoDatabase. For each data set the respective adapter will be developed that will map the existing data to the commonly agreed model. The system will be prepared to accommodate future adapters either for locally stored or remotely stored datasets and integrate them in the HEALS platform. Close collaboration with Stream 5 partners will be ensured at this stage, so that all information collected in the various population studies are properly imported in the system. Molecular biology/biochemistry and clinical data will be retrieved from the existing databases identified in Task 12.2 and accessed through hyperlinks and query scripts.
In parallel, a development track will focus on the data integration patterns that need to be developed in order for them to be applied in the Environmental-Wide Association Studies on the populations identified in Stream 5. We will develop or use primarily open source libraries that allow merging, splitting, filtering, mapping of information, as well as basic computational functionality for on-line data handling and analysis.
At the end of the task, basic tools will allow performing operations on the complete set of available datasets, irrespective of their original source. The outcome of this integration effort would be sets of spatially resolved data used by the GIS underlying engine and other non-spatial information. The latter is a key component of the overall database design since Geo-referenced data (e.g. environmental, exposure, epi, cohort, biomonitoring) will have to be connected with molecular biology/biochemistry data to unravel individual exposome.
For the assessment and management of spatially resolved data, the HEALS platform will make use of a GIS system. The GeoDatabase platform will enable the user to manage and explore spatial data, to process these data (e.g. spatial statistical analysis such as Exploratory Spatial Data Analysis, Principal Component Analysis, Cluster Analysis, Hot Spot Analysis), and to effectively visualize the results of spatially resolved models. Several query interfaces will be developed, in order to integrate the database with tools for: a) automatic updating (update query), b) importing form other data formats or software (import query), c) exporting into other data formats (export query), d) selecting specific subsets of data (selection query), e) grouping records by means of aggregation functions (group by query).
Capturing, qualifying and quantifying uncertainty is key to the development of a robust data management system able to support effectively the association between exposure and health outcomes. To this aim the platform will support advanced mapping methods to quantitatively display the uncertainty associated to the data set stored in the GeoDatabase.
Furthermore, meta-information allowing the user to identify the origin of the data, their uncertainty levels, the spatial and temporal scales of reference, and the format requirements for communication with the other data sources available in the system will be included in the GeoDatabase platform.
Particular attention will be paid to the analysis of the spatial and temporal relations between various types of environment and health data: the definition of possible spatial relations (e.g. overlay, inclusion, proximity) and temporal relations (e.g. definition of the most suitable time interval for integrating information from different sources) will support the fusion/integration of different environment and health data. Taking into account the relevance of an effective communication of results to different end-users and stakeholders, it is suggested to implement HEALS platform on a WebGIS system (commercial or open source) enabling easy access to data and results visualization.