Objectives:
- To understand the biological functions of toxicity pathway interactions in relation to external/internal exposure
- To confirm the causative effect between exposure and disease endpoint through theoretical (computational) and data mining from supportive in vitro models
- To combine mixed data, resulted from various sources, through the utilisation of advanced data mining analysis techniques for biomarker identification
- To provide the methodological tools for integrating multiple biomarkers into a mechanistic description of biological pathways relevant to environment-wide health association studies
- To support the derivation of systems biology models for the internal exposome
- To provide data infrastructure support for storage of data, metadata and analysis pipelines for omics data emerging from cohorts (DIAMONDS and GSCF/dbNP) and integrate this into the HEALS database platform developed in WP12
Description of work and role of partners:
WP4-WP6 results (biomonitoring, omics and PBBK modeling) from cohort studies, together with environmental exposures (Stream 3), establish the exposome. Derivation of predictive biomarkers based on data together with cohort specific omics data requires an infrastructure for storage, documentation and pre-processing of the large amount of data produced, the discovery of specific data patterns and/or clusters, the creation of a data models based on a training dataset and, finally, evaluation of that model with regard to its validity and prediction capacity on the basis of test data.
In WP7 contemporary bioinformatics techniques will be applied and enhanced in order to select the most relevant omics data and derive specific data profiles for given exposure/disease pathways. WP7 mainly aims at determining predictive biomarkers based on heterogeneous datasets, resulted from human biomonitoring, omics and epigenetics analyses and PBBK modeling. This requires first the pre-processing of the large amount of data produced, the discovery of specific data patterns and/or clusters, the creation of data models based on training sets and, finally, the evaluation of the models with regard to their validity and prediction capacity on the basis of the test data. Therefore several approaches will be implemented from the fields of descriptive and predictive data mining to achieve our goals. Their results will be systematically assessed and the model that best describes our exposome data will be employed to the subsequent population surveys.
WP7 will also provide the methodological tools for integration of multiple omics biomarkers into a mechanistic description of toxicity pathway interactions, in relation to external/internal exposure. This will be achieved by developing systems biology pathway models for the endpoints identified in Stream 5 using the predictive bioinformatics approaches outlined above. Systems biology aims to understand how biological function, absent from isolated biomarkers, arises when they are components of their system. To accomplish bioinformatics data functional integration and biological pathway modelling as outlined above the following tasks are foreseen:
Task 7.1 Descriptive data mining – preprocessing, data clustering and pattern discovery (AUTH, TNO, URV)
Before applying unsupervised learning functions to HEALS datasets and identify intrinsic relations on them, it is necessary to adopt first a pre-processing step, which will consist of five essential modules:
i technology specific data pre-processing (e.g. spectra de-convolution)
ii. noise removal, to ensure the consistency and high quality of our data from possible outliers or discrepancies in the measurements
iii. data transformation, to normalize the values in our dataset and also increase their generalization,
iv. data reduction, to decrease both the apparent complexity in our data, through subset representations, and the dimensionality of the derived models, and
v. discretization, to scale the data and prepare them for further analysis by means of clustering and pattern extraction.
Clustering will be implemented with a number of available tools. Although certain schemes are considered to be more efficient, like the self-organizing maps (SOM), we will utilize also other well known approaches, like K-means and graph-based, since our goal is to study comprehensively the exposome. In this manner, several high-quality clusters will be produced in terms of low inter-cluster and high intra-cluster similarity. The proper identification of homogeneous groups in our data will greatly support the efficient design of the predictive models in Task 7.2.
Pattern discovery will be attained through association rule mining, which is a commonly used methodology for detecting local patterns in unsupervised learning systems. As a result, the derived associations will be expressed in the form of rules that represent feature-value conditions among the data in a rather straightforward manner. Following the support-confidence framework we will employ different algorithms for pattern extraction, like Apriori, FPGrowth and LPMiner, and combine their results. Moreover, we will investigate the possibility of finding rare events associations in our data, which in many cases is of greater importance than the frequently occurring patterns. For that purpose we will utilize both boosting and emerging patterns approaches.
Task 7.2 Predictive data mining – data models design and analysis (AUTH, TNO, CERETOX, UPD)
For predictive data mining we will develop a set of models to perform inference on the available (combination of) multidisciplinary datasets. Several techniques can be used for that purpose, ranging from typical approaches based on decision trees or k-nearest neighbors to more sophisticated ones that employ artificial neural networks (ANNs), support vector machines (SVMs) or Bayesian networks (BNs). We will implement and test all of the
aforementioned approaches, since our goal is not only to perform mere classification but also to study and unravel the feature attributes concealed in the exposome data. To estimate the reliability of the proposed predictive models we will employ the k-fold cross-validation schema and measure standard performance indices like sensitivity, specificity and accuracy. ROC analysis will also be carried out to test the robustness of each one of the models. Based on the obtained results we will consider the design of a mixture of models where different approaches will be used to model the diverse data in our dataset with a view to increasing the predictive capacity of the biomarkers identified in WP4 and WP5. This will not only increase the prediction performance in the subsequent environment-wide association health surveys but also incorporate more efficiently the associations and patterns derived from the previous task.
Additionally, the classification models will be further analyzed in terms of visualizing their output and interpreting their deduction mechanism. Visualization mainly refers to mapping the underlying decision hyperplanes of the models as well as depicting the differences of the individual performance indices. Interpretation will be a relatively direct process in the cases of decision trees and k-nearest neighbors while in the models based on ANNs, SVMs or BNs deduction is more complex and harder to interpret. The above analysis will result in better tuning the architecture of the proposed models and optimize their predictions.
Task 7.3 Model integration – biomarkers identification and prediction validation (AUTH, TNO, UPD)
This task will provide the methodological tools for integration of multiple omics biomarkers into a mechanistic description of toxicity pathway interactions, in relation to external/internal exposure. This will be achieved by developing biological pathway models for the endpoints identified in Stream 5 using the bioinformatics approaches described above. In particular, all the findings generated from descriptive and predictive data mining, in the form of clusters, patterns, associations and classifications, will be assessed and incorporated into a meta-modeling framework. By post-processing the meta-model as well as the prediction models proposed above, multivariate decision profiles could be determined specifically tailored for revealing the diagnostic biomarkers. The prediction accuracy of these biomarkers will be tested against an independent test set, other than that used for training on the previous two tasks. Similar to the idea of mixture models, introduced in Task 7.2, is that of biomarkers fusion, which lies in the same principles of multivariate analysis. When the problem under study is characterized by high dimensionality or complexity, as is the case of exposome data, it is advantageous to consider as many parameters as possible in order to gain more insight, instead of focusing on a couple of them. Moreover critical aspects about the biomarkers interoperability can be revealed that can lead to even better diagnostic procedures. Biomarkers fusion can be realized efficiently through an inference system that is based on fuzzy logic. Since no prior knowledge exists about the normal and pathological levels of the derived biomarkers, fuzzy logic rule sets are considered as a constructive approach to design robust clinical decision support systems.
Task 7.4 Bioinformatics data infrastructure for storage of human cohort study specific metadata in relation to omics and (bio)assay data (TNO, AUTH)
Within HEALS, data infrastructure support is needed to store and analyse omics data obtained from human cohorts. The Phenotype Database (dbNP) (http://phenotypefoundation.org/) is a bioinformatics application that can store any biological study. It contains templates which makes it possible to customize. The main module of dbNP is the Generic Study Capture Framework (GSCF). In order to allow flexibility to capture all information required within a study, and to make it possible to compare studies or study data, the system uses customizable study design templates and ontologies. It is especially designed to store complex human study designs including cross-over designs and challenges. In addition, it contains a Transcriptomics module, a Metagenomics module, a Metabolomics module and Simple assay module, which allows for the analysis and data integration with cohort specific metadata. Phenotype Database facilitates sharing of data within a research group or consortium, as the study owner can decide who can view or access the data. New studies can be based on study data within the database, as standardized storage is stimulated by the system As such, it represents an excellent data infrastructure component for omics studies applied in human cohort settings as foreseen in HEALS. Within this Task, the dbNP GSCF will be customized to accept HEALS cohort data and study designs, together with associated omics and assay data. In addition, connections concerning data export will be established with the larger HEALS infrastructure Database platform developed in Stream 4. Also, TNO has developed the ‘DIAMONDS’ (Datawarehouse Infrastructure for Applications, Models and Ontologies towards Novel Design and Safety) data and metadata infrastructure to allow for the comparison of chemical data, in vitro toxicogenomics, public in vivo toxicogenomics and metabolomics data. Integrated into DIAMONDS are tools available for quick omics-based pathway enrichment analysis (e.g. ToxProfiler) across multiple datasets from different studies. The DIAMONDS infrastructure will be used to support descriptive and predictive data mining outlined in the above tasks, on data streams from cohort studies captured in dbNP, to provide confirmatory support from public in vitro and in vivo animal genomics data related to chemical stressors.