Working Package 7: Novel bioinformatics for predictive biomarker discovery

Working Package 7: Novel bioinformatics for predictive biomarker discovery

LEADER: AUTH
PARTNERS: AUTH, UPD, TNO, CERETOX, URV
START MONTH: 1
END MONTH: 30

Objectives:

  1. To understand the biological functions of toxicity pathway interactions in relation to external/internal  exposure
  2. To confirm the causative  effect between exposure  and disease endpoint through theoretical (computational) and data mining from supportive in vitro models
  3. To combine mixed data, resulted from various sources, through the utilisation of advanced  data mining analysis techniques  for biomarker identification
  4. To provide the methodological  tools for integrating multiple biomarkers  into a mechanistic description of biological pathways  relevant to environment-wide health association  studies
  5. To support the derivation of systems biology models  for the internal exposome
  6. To provide data infrastructure support for storage of data, metadata and analysis pipelines  for omics data emerging from cohorts (DIAMONDS and GSCF/dbNP) and integrate this into the HEALS database platform developed in WP12

 


Description of work and role of partners:

WP4-WP6 results (biomonitoring, omics and PBBK  modeling) from cohort studies,  together with environmental exposures (Stream 3), establish the exposome. Derivation of predictive biomarkers based on data together with cohort specific omics data requires an infrastructure for storage,  documentation and pre-processing of the large amount of data produced, the discovery  of specific data patterns and/or clusters, the creation of a data models based on a training dataset and, finally, evaluation of that model  with regard  to its validity and prediction capacity on the basis of test data.

In WP7 contemporary  bioinformatics techniques  will be applied and enhanced in order to select the most relevant omics data and derive specific data profiles for given exposure/disease  pathways. WP7 mainly aims at determining predictive biomarkers  based on heterogeneous  datasets,  resulted from human biomonitoring, omics and epigenetics analyses and PBBK  modeling. This requires first the pre-processing  of the large amount of data produced, the discovery of specific data patterns and/or clusters, the creation of data models based on training sets and, finally, the evaluation of the models  with regard  to their validity and prediction capacity on the basis of the test data. Therefore  several approaches will be implemented from the fields of descriptive and predictive data mining to achieve our goals. Their results  will be systematically assessed  and the model that best describes our exposome  data will be employed to the subsequent  population surveys.

WP7 will also provide the methodological  tools for integration of multiple omics biomarkers  into a mechanistic description of toxicity pathway interactions,  in relation to external/internal exposure. This will be achieved by developing systems biology pathway models  for the endpoints identified in Stream 5 using the predictive bioinformatics approaches  outlined above. Systems biology aims to understand  how biological function, absent from isolated biomarkers, arises when they are components of their system. To accomplish  bioinformatics data functional integration and biological pathway modelling as outlined above the following tasks are foreseen:

 


Task 7.1 Descriptive data mining – preprocessing, data clustering and pattern discovery (AUTH, TNO, URV)

Before applying unsupervised  learning functions to HEALS datasets and identify intrinsic relations  on them,  it is necessary to adopt first a pre-processing step, which  will consist of five essential modules:

i technology  specific data pre-processing  (e.g. spectra de-convolution)

ii. noise removal,  to ensure the consistency  and high quality of our data from possible  outliers or discrepancies in the measurements

iii. data transformation,  to normalize  the values in our dataset and also increase their generalization,

iv. data reduction,  to decrease both the apparent complexity in our data, through subset representations, and the dimensionality of the derived models, and

v. discretization,  to scale the data and prepare  them for further analysis by means of clustering and pattern extraction.

Clustering  will be implemented  with a number of available tools. Although certain schemes are considered to be more efficient, like the self-organizing  maps (SOM), we will utilize also other well known approaches, like K-means and graph-based, since our goal is to study comprehensively  the exposome. In this manner, several high-quality clusters will be produced in terms of low inter-cluster  and high intra-cluster  similarity. The proper identification of homogeneous groups in our data will greatly support the efficient design  of the predictive models in Task 7.2.

Pattern discovery  will be attained through association  rule mining, which is a commonly used methodology for detecting local patterns in unsupervised learning systems. As a result, the derived associations  will be expressed in the form of rules that represent  feature-value  conditions among the data in a rather straightforward manner. Following the support-confidence framework we will employ different algorithms for pattern extraction,  like Apriori, FPGrowth and LPMiner, and combine their results. Moreover, we will investigate  the possibility of finding rare events associations in our data, which in many cases is of greater importance than the frequently occurring patterns. For that purpose  we will utilize both boosting  and emerging patterns approaches.

 


Task 7.2 Predictive data mining – data models design and analysis (AUTH, TNO, CERETOX, UPD)

For predictive data mining we will develop  a set of models  to perform inference  on the available  (combination of) multidisciplinary datasets.  Several techniques can be used for that purpose,  ranging from typical approaches based on decision trees or k-nearest  neighbors to more sophisticated  ones that employ artificial neural  networks (ANNs), support vector machines (SVMs)  or Bayesian networks (BNs). We will implement  and test all of the

aforementioned approaches,  since our goal is not only to perform mere classification  but also to study and unravel the feature attributes concealed in the exposome  data. To estimate  the reliability of the proposed predictive models we will employ the k-fold cross-validation  schema and measure standard performance indices like sensitivity, specificity and accuracy. ROC analysis will also be carried out to test the robustness  of each one of the models. Based on the obtained results  we will consider  the design of a mixture of models  where different approaches will be used to model the diverse  data in our dataset with a view to increasing  the predictive capacity  of the biomarkers  identified in WP4 and WP5. This will not only increase the prediction performance  in the subsequent  environment-wide association  health surveys but also incorporate more efficiently the associations and patterns derived from the previous  task.

Additionally, the classification  models  will be further analyzed  in terms of visualizing  their output and interpreting their deduction mechanism.  Visualization mainly refers to mapping the underlying decision  hyperplanes  of the models as well as depicting the differences  of the individual performance  indices. Interpretation  will be a relatively direct process in the cases of decision trees and k-nearest  neighbors while in the models based on ANNs, SVMs or BNs deduction is more complex and harder  to interpret. The above analysis will result  in better tuning the architecture  of the proposed models and optimize their predictions.

 


Task 7.3 Model integration – biomarkers identification and prediction validation (AUTH, TNO, UPD)

This task will provide the methodological  tools for integration of multiple omics biomarkers  into a mechanistic description of toxicity pathway interactions,  in relation to external/internal exposure. This will be achieved by developing biological pathway models  for the endpoints identified in Stream 5 using the bioinformatics approaches described above. In particular, all the findings generated  from descriptive  and predictive data mining, in the form of clusters, patterns, associations  and classifications, will be assessed  and incorporated into a meta-modeling framework. By post-processing  the meta-model as well as the prediction models  proposed above, multivariate decision  profiles could be determined specifically  tailored for revealing  the diagnostic biomarkers. The prediction accuracy of these biomarkers  will be tested against an independent test set, other than that used for training on the previous  two tasks. Similar  to the idea of mixture models, introduced in Task 7.2, is that of biomarkers fusion, which lies in the same principles of multivariate analysis. When the problem under study is characterized by high dimensionality or complexity, as is the case of exposome data, it is advantageous  to consider as many parameters as possible in order to gain more insight, instead of focusing on a couple of them. Moreover critical aspects about the biomarkers  interoperability can be revealed  that can lead to even better diagnostic procedures. Biomarkers fusion can be realized  efficiently through an inference system that is based on fuzzy logic. Since no prior knowledge exists about the normal and pathological levels of the derived biomarkers,  fuzzy logic rule sets are considered as a constructive approach  to design robust clinical decision support systems.

 


Task 7.4 Bioinformatics data infrastructure for storage of human cohort study specific metadata in relation to omics and (bio)assay data (TNO, AUTH)

Within HEALS, data infrastructure support is needed  to store and analyse omics data obtained from human cohorts. The Phenotype Database (dbNP) (http://phenotypefoundation.org/) is a bioinformatics application that can store any biological study. It contains  templates which makes it possible  to customize.  The main module of dbNP is the Generic Study Capture Framework (GSCF). In order to allow flexibility to capture  all information required  within a study, and to make it possible to compare  studies or study data, the system uses customizable study design templates and ontologies.  It is especially  designed  to store complex human study designs including cross-over designs and challenges. In addition, it contains  a Transcriptomics module, a Metagenomics module, a Metabolomics module and Simple assay module, which allows for the analysis and data integration with cohort specific metadata. Phenotype Database facilitates sharing of data within a research group or consortium, as the study owner can decide who can view or access  the data. New studies can be based on study data within the database, as standardized storage is stimulated by the system As such, it represents an excellent data infrastructure component for omics studies applied in human cohort settings as foreseen in HEALS. Within this Task, the dbNP GSCF will be customized  to accept HEALS cohort data and study designs, together with associated omics and assay data. In addition, connections  concerning  data export will be established with the larger HEALS infrastructure Database platform developed  in Stream 4. Also, TNO has developed the ‘DIAMONDS’ (Datawarehouse  Infrastructure for Applications, Models and Ontologies towards Novel Design and Safety) data and metadata infrastructure to allow for the comparison  of chemical data, in vitro toxicogenomics, public in vivo toxicogenomics  and metabolomics  data. Integrated into DIAMONDS  are tools available  for quick omics-based  pathway enrichment analysis (e.g. ToxProfiler) across multiple datasets  from different studies.  The DIAMONDS  infrastructure  will be used to support descriptive  and predictive data mining outlined in the above tasks, on data streams from cohort studies captured in dbNP, to provide confirmatory  support from public in vitro and in vivo animal genomics  data related to chemical  stressors.