Predict

Data
Science

We harness global biomedical data to predict novel therapies by: inferring the molecular drivers of disease to discover new therapeutics targets; and identifying molecular classifiers used to identify the best therapies for individual patients. We pursue these goals by 1) developing and applying advanced data analysis methods; 2) building large-scale data resources; 3) developing software tools to facilitate data integration and analysis.

A recent revolution in molecular profiling technology has made it so cost-effective that full genome sequencing, along with comprehensive profiling of other RNA, proteins, metabolites, and other molecular moieties, will soon become part of routine clinical care. Such data will soon be generated on hundreds of millions of patients across the world who will be molecularly profiled for diseases such as cancer, as well as in pre-disease healthy states. In the coming years, genomic data generated on patients alone will be larger than data currently used by Google, Amazon, and Facebook combined. In order to leverage these data we:

  1. Build massive data resources integrating clinical and molecular data from patients across the Mount Sinai health system, as well as from large-scale data generation efforts around the world.
  2. Build multiscale models of disease that integrate diverse types of molecular, imaging, sensor, and clinical data to infercomplex molecular networks that mediate key processes within each cell – such as cell growth and death – and across cells (e.g. between tumor cells and the immune system). By understanding how such functional networks become altered to cause disease, we can predict targeted, immune modulating, and combination therapy strategies designed to counteract each individual disease.
  3. Develop and apply advanced machine and deep learning methods to better classify disease at the molecular level. Such methods are used to develop new molecular taxonomies of disease that are far more accurate than traditional clinical criteria in identifying the specific therapies most likely to benefit an individual patient.

If we are able to harness these data, coupled with breakthrough technologies that now allow us to rapidly test new therapies experimentally, we can revolutionize our understanding of disease and translate this understanding to benefit patients faster.

Below are some approaches we are currently taking towards this goal.

Analysis
Data Resources
Software

Network Models

(Left) Weighted Gene Co-expression Network Analysis (WGCNA) identifies co-regulated gene modules. (Right) Multiscale Embedded Gene Co-expression Network Analysis (MEGENA) reveals multi-scale organization of co-expressed gene modules, and identifies key regulator genes of each module that can be used to inform potential therapeutic targets.

Multiscale network modeling aims to construct and characterize multiscale biological networks from diverse biomedical data. Specifically, multiscale network modeling integrates multi-Omics data such as genomic, epigenetic, transcriptomic and proteomic data into correlation, interaction and causality networks whose global and local topological structures (e.g., modules, hubs or key drivers) are further identified and linked to phenotypic outcomes like disease severity. Different types of complementary networks (e.g., correlation and causal networks) can also be integrated to more precisely reconstruct signaling pathways and determine key regulators. Multiscale molecular networks enable not only identification of novel pathways and driver genes for complex human diseases but also development of biologically plausible mechanistic disease models for subsequent experimental and clinical validations. Approaches at the Icahn Institute have been extensively used for identification of novel pathways and gene targets, as well as the development of drugs for complex human diseases such as cancer, atherosclerosis, influenza infection, depression, Alzheimer’s, obesity and diabetes.

Machine Learning

Machine learning methods have advanced large-scale cancer omics studies ranging from low-level data preprocessing, such as imputation of missing values in proteomics data to downstream integrative omics analyses to reveal important molecular mechanisms of disease initiation and progression. Specifically, the latest advances in machine learning research on regularized high dimensional regression techniques (e.g. Lasso, fused-Lasso) have been applied not only to properly model each individual type of omics data but also to efficiently characterize interactions among different biological molecules.

Other novel usages of machine learning techniques in cancer omics studies include Random Forest based network inference to achieve a system level of understanding of cell activities and disease initiation/progression. Icahn Institute scientists apply state-of-the-art machine learning methods, as well as develop novel methodologies, to integrate large scale omics data to infer the molecular drivers of disease and stratify patient sub-populations used to guide molecularly guided therapies.

Featured Scientist: Pei Wang

Machine learning methods are powerful tools to reveal important disease relevant molecular regulatory relationships

Computational Cancer Immunology

Icahn Institute researchers are studying the role of the immune system in several cancers, particularly the interaction of host tumoral RNA with the innate immune system, the role of neoantigens in the evolution of tumors both generally and in response to therapy, and the role of endogenous and exogenous viruses in cancer. In addition to shedding light on these scientific issues, we are working to translate our work into clinically impactful results in cancer immunotherapy, particularly the creation of predictive models of response.

Clones are inferred from the genealogical tree of each tumor. We predict the future effective size of the cancer cell population, relative to its size at the start of therapy by evolving clones under the model over a fixed timescale. We have shown that application of therapy can decrease the fitness of clones depending on their neoantigens, and clones with strongly negative fitness have greater loss of population size than more fit ones.

Featured Scientist: Benjamin Greenbaum

Clones are inferred from the genealogical tree of each tumor. We predict the future effective size of the cancer cell population, relative to its size at the start of therapy by evolving clones under the model over a fixed timescale. Application of therapy can decrease the fitness of clones depending on their neoantigens. Clones with strongly negative fitness have greater loss of population size than more fit ones.

Genotype to Phenotype Models

The concept of genotype to phenotype models is to joint analyze multiple types of omics data while using genetic variants as the causal anchor to identify the genetic basis of human diseases and discover the mechanism: genetic variants → molecular/cellular alternation → disease. Genotype to phenotype models marry the idea that large-scale omics studies become economically feasible and datasets are accumulating rapidly. The genotype to phenotype approach has been successfully applied to diseases such as Alzheimer’s disease (AD), atherosclerosis and cancers.

Icahn Institute’s latest research develops novel approaches to accurately call structure variants (SV) in large populations and enable SV-based GWAS and QTL studies. Large omics data (transcriptome, proteome, and methylome) are generated by Icahn Institute researchers on many tissue types. Our research demonstrates the power of genotype to phenotype in dissecting the molecular etiology of complex diseases and has led to many widely used novel analysis methods for analyzing omics data to infer genetic causal models of disease.

Featured Scientist: Judy Cho

Integrative genomics pinpoints the causal role of genetic variants on disease phenotypes, and reveals the etiological mechanisms at the ‘omics level

Psychiatric Genetics

The Pamela Sklar Division of Psychiatric Genomics: An Engine for Progress

Common neuropsychiatric disorders, such as schizophrenia, bipolar disorder and Alzheimer’s disease, carry considerable morbidity, mortality, and personal and societal cost. While recent large-scale genetic association studies have identified numerous risk loci, the mechanisms through which they contribute to disease remain largely unknown. Icahn Institute scientists lead several of the largest consortium-based efforts to generate and analyze massive amounts of molecular data from patients with psychiatric disease, including the PsychENCODE Project, Common Mind Consortium and Accelerating Medicine Partnership for Alzheimer’s Disease. Cell type-specific molecular studies provide a reference atlas to characterize the effect of genetic variation on the 3-dimensional configuration of the genome and on the complex mechanisms that regulate gene expression in those cells relevant to disease. In parallel, by applying single cell molecular approaches to the affected tissue (in this case, the human brain), novel cellular subpopulations that are associated with disease are detected. Icahn Institute scientists lead several integrative analyses of the big omics data with large-scale genetic and biobank datasets, including the Psychiatric Genomics Consortium and Million Veteran Program, in order to facilitate the clinical translation of genetic findings that will promote Precision Psychiatry.

Clinical Data Science

Schematic of the Mount Sinai Clinical Data Science pipeline showing key components and data flow

The Icahn Institute has created a unified data pipeline and computational cluster that supports a system-wide streaming clinical data platform. We ingest data from clinical and administrative systems, normalize and transform the data, and then build predictive models using modern machine learning methods. Our vision is to improve patient care using this platform to run validated algorithms that provide real-time decision support at the point of care and in the community. Examples of models that we are currently deploying include:

  • MEWS++ An enhanced predictive model for unexpected care escalation, inspired by the classic Medical Early Warning Score (MEWS). The MEWS++ model consists of laboratory, pharmacologic, assessment, and physiologic data and was created and validated using 5 years of data from Mount Sinai Healthcare patients. The performance is better than any published model of its kind and is planned to go live in fall of 2018.
  • Malnutrition An automated screening tool to identify patients who are malnourished based on laboratory and assessment data. This tool will allow our Registered Dieticians to work efficiently in screening for malnutrition and work with the unit-based team on a treatment plan.

DREAM Challenges

The structure of a typical DREAM Challenge

The Dialogue on Reverse Engineering Assessment and Methods (DREAM) – better known as the DREAM Challenges – is an open science, collaborative competition framework for hosting biomedical Challenges, led by Icahn Institute Professor Gustavo Stolovitzky. Since its founding in 2006, DREAM has hosted computational Challenges in different areas of systems biology and translational medicine. The DREAM mission is to promote rigorous benchmarking of computational biology methods, data sharing, and collaborative and open science. Today, DREAM serves as a central clearinghouse for evaluating and improving fundamental algorithms used in basic and translational research. DREAM has successfully run 50 Challenges, including prediction of drug sensitivity, drug synergy, somatic variant calling, gene essentiality, RNA isoform, and fusion detection, tumor sub-clonal reconstruction, transcription factor binding prediction, and molecular network inference. A series of cancer prognosis Challenges have led to the development of patient outcome models in breast cancer, prostate cancer, and acute myeloid leukemia, with prognostic Challenges in multiple myeloma and colorectal cancer.

In a typical Challenge, an unpublished dataset of interest to solve a biomedical problem is shared with a computational biology community. For a subset of this dataset, only input data are provided but the answers to the Challenge questions are withheld. Participants develop their algorithms and after a period of hardening their methods, their proposed solutions to the problem are submitted back to the Challenge organizers who evaluate solutions against the withheld answer. This allows for a rigorous and blind evaluation and comparison of the different algorithms designed to solve the proposed Challenge. In the process of developing the Challenge, valuable unpublished data is shared with the community and collaborations between participants are established. All DREAM outputs (data, algorithms, predictions, etc.) are open to the public with no intellectual property assignment and all Challenge results are published in top-tier, biomedical journals.

Pathogen Surveillance

The Pathogen Surveillance project utilizes an integrated workflow that combines collection of clinical isolates and data from the electronic medical record (EMR) system (green) with the latest genomic profiling and lab information system technologies (blue) and bioinformatics analysis of sequencing and genome data (red) for large-scale profiling of bacterial and viral infections (PMID: 26251049).

The Icahn Institute has established a Mount Sinai Health System (MSHS)-wide Pathogen Surveillance Program (PSP) led by a multidisciplinary investigative team with a background in infectious disease, epidemiology, statistical modeling, and analyzing pathogen genomes and microbiomes. The PSP aims to use whole-genome sequencing as a means to understand the molecular basis of evolution and transmission of infectious diseases, host-pathogen interactions, and to identify novel pathogens. As an international hub, New York City provides a snapshot of the world-wide pathogen diversity including emerging and re-emerging pathogens of pandemic potential. It represents a confluence of diverse human populations with a broad range of underlying medical conditions. It is also one of the major entry ports for infectious pathogens in general, due to extensive travel of its resident population and millions of incoming visitors (~61 million in 2016 with 48 million domestic and 13 million international visitors). As such, our PSP team is uniquely positioned to directly explore the genomic diversity of human infectious diseases using samples collected from patients seen at the MSHS – the largest health care provider of the New York metropolitan area. The primary focus of PSP is on healthcare-associated infections (HAI) and respiratory virus infections.

Healthcare-associated Infections (HAI) pose a ubiquitous, insidious, and potentially fatal threat to patients worldwide. The Centers for Disease Control estimate HAIs account for roughly 1.7 million serious infections every year in the United States and cause or contribute to 99,000 deaths annually with a greater burden in immunocompromised hosts. Important bacterial pathogens in HAI include multi-drug resistant organisms such as Staphylococcus aureus (including methicillin-resistant S. aureus or MRSA), Vancomycin-resistant enterococci (VRE), and Clostridium difficile. We anticipate that a better understanding of the role of genetic diversity in bacterial infections will result in improved patient care and outcomes.

Respiratory viruses cause a great burden to our health systems. They infect all age groups, however the young, the elderly, and those with chronic medical conditions are the most affected. Respiratory viruses not only cause acute pneumonia and exacerbate chronic conditions such as asthma but they also increase the risk of secondary bacterial infections. In addition, newly emerging or re-emerging respiratory viruses (e.g., avian influenza, pandemic H1N1, SARS) pose a constant threat to world health. Antiviral treatment options are limited and vaccines, if available, do not provide long-lasting immunity.

The goal of our program is to apply advanced whole genome sequencing to clinical isolates of pathogens to:

  • Build a baseline census or “background of diversity” for clinical pathogens of major importance.
  • Identify potential increases in infection rates to impact Infection Control Unit resource utilization.
  • Identify new and emerging bacterial and viral pathogen threats, including emerging resistance to antibiotic and antiviral treatments.
  • Combine big data and electronic health records (EHR) mining with whole genome sequencing (WGS) analysis to accurately depict epidemiological trends and improve the understanding of infection risk and outcome.
  • Partner with others throughout the hospital system to enable real time access to an integrated synopsis of relevant data to share with patients and providers and have a positive impact on patient health.
  • Provide personalized diagnosis and, ultimately, treatment options.

Whole Cell Modeling

To predict phenotype from genotype, whole-cell models must represent all of the processes inside cells including their gene expression, growth, and division.

Whole-cell (WC) dynamical models aim to predict cell phenotypes from genotype by representing all of the biochemical activity inside individual cells. WC models could help physicians personalize medicine and help bioengineers rationally design microorganisms.

Engineers routinely use computer-aided design based on mechanistic models to quickly, reliably, and efficiently design complex systems such as airplanes. In comparison, medicine is ad hoc, unreliable, and expensive because we do not have mechanistic models that predict phenotype from genotype. The goal of WC modeling is to develop a predictive foundation for medicine and bioengineering, analogous to the Newtonian foundation for mechanical engineering. This entails developing comprehensive dynamical models that predict the phenotypes of individual cells by representing all of the biochemical activity inside cells, including the function of each gene product.

The Icahn Institute is leading WC modeling by pioneering the first WC models and developing the first tools for WC modeling. In 2012, Icahn Institute scientists demonstrated the feasibility of WC models by reporting the first model of every characterized gene in Cell. Over the past few years, Icahn Institute scientists have reported the first applications of WC models to basic science, medicine, and bioengineering in Nature Methods, Computational Biology & Chemistry, and Chaos. To enable more comprehensive and more predictive models, Icahn Institute scientists are currently piloting the next generation of WC models, including the first WC model of a human cell, as well as developing the computational tools needed to build, simulate, and apply WC models to medicine and bioengineering. Recently, Icahn Institute scientists published a three-phase plan for achieving WC models of human cells in Current Opinion in Systems Biology and outlined the experimental and computational technologies needed to enable WC modeling in Current Opinion in Biotechnology. The Icahn Institute is also leading the WC modeling community by organizing annual meetings and courses.

Minerva Supercomputer

Minerva Supercomputer

In May 2012, the Icahn Institute unveiled Minerva, Mount Sinai’s first supercomputer. In February of 2014, Minerva was expanded with new state-of-the-art Intel processors to over 400 servers, comprising 12,672 cores, over 60 TB of RAM, and 11 PB of raw storage capacity. Minerva has provided an important advancement for Mount Sinai’s researchers, demonstrating the need for significant computational and data resources. These resources support the broad range of their NIH-funded science performed here with our users and their external collaborators. All nodes are connected with Infiniband, a high bandwidth low latency switching fabric. The initial Minerva nodes (spring 2012) consist of 120 Dell C6145 nodes (7,680 AMD 2.3 GHz Interlagos cores with 256 GB of memory/node) and 1.5 petabytes of DDN SFA10k. The Minerva expansion nodes (spring 2014) consist of 209 IBM NeXtScale Intel Xeon IvyBridge (E5-2643 v2) nodes (2,508 3.5 GHz cores with 64 GB/node), 3 petabytes of IBM GSS and 160 terabytes of IBM FlashSystems flash. Special purpose nodes are available with 20 NVIDIA GPUs (4 M2090s and 16 KX20x), 1 TB of shared system memory, MATLAB Distributed Computing Server, and web and database services. The storage subsystem contains 4.5 petabytes of high-performance parallel file system storage (IBM’s GPFS). An archival storage system (IBM’s Tivoli Storage Manager) encrypts and saves copies of data on tape to two geographically disparate locations. Over 900 applications are supported.

The Icahn Institute has also received NIH funding to build and deploy the Big Omics Data Engine (BODE) to support NIH-funded genomics based researchers. This machine, deployed on scope, schedule and budget is over 95 teraflops with 2,484 2.4 Ghz Intel Haswell cores in 207 Cray CS300AC nodes, 13 terabytes of memory (64 GB of memory/node), 5 petabytes of DDN SFA12KE storage connected via Infiniband. The machine entered production on February 1, 2015.

Mount Sinai Data Warehouse

Mount Sinai Data Warehouse

The Mount Sinai Data Warehouse (MSDW) is a collection of clinical, financial and operational data sourced from over 25 systems used at the Mount Sinai Hospital. MSDW provides researchers access to over 10 million patient medical records including 10 million clinical documentation (progress notes, discharge summaries and operative reports), 36 million ICD-9 coded diagnoses (>18K unique ICD-9 diagnosis), 90 million number of medication (prescription and med-admins), 175 million test results (lab measurement, radiology, and pathology report) on 15 million number of patient visits. MSDW supports the research community by: providing de-identified and identified data access for IRB approved studies; identifying cohorts and providing daily/weekly/monthly reports for these study population; providing upcoming appointment details for potential study recruits to aid in study enrollment; providing customized data marts for complex multi-dimensional studies; providing self-service Cohort Query Tool (CQT), to build cohorts/group patients based on diagnosis, medication, lab results, surgeries etc.

Imaging Research Warehouse

Imaging Research Warehouse (IRW) – Images to Knowledge! Today the daily imaging exams archived in PACS are deidentified. Current development is targeted at also de-identifying the associated Radiology Report and Clinical Information. Together, the robustly curated exams, soon to be numbering in the millions, are available for big data investigation and the Precision Medicine Initiative.

The Imaging Research Warehouse (IRW), a joint project of the Department of Radiology and the Translational and Molecular Imaging Institute (TMII), is a unique source of Imaging Big Data developed with sponsorship from our CTSA program. It is growing to include the 1 million radiology exams we generate yearly and prepare them for the research community. Most importantly it will provide a source of phenotypic data that can be correlated with omics data, providing a bridge between our traditional view of disease and the new paradigm exemplified by Mount Sinai’s Precision Medicine Initiative. The exams are de-identified and available for perusal within the enterprise without IRB approval. It is an ideal data set for collaborative investigations with other enterprises, including the NIH (IRB approval is required for outside collaboration).

Current development work is focused on deidentifying the free text narrative radiology reports and extracting medical concepts to be associated with the images. The combination of Imaging exams, report data, and clinical data, all with PHI masked will provide an extremely powerful data and toolset for knowledge discovery alongside the current work on omics data.

BioJupies

Automated RNA-seq Analysis Reports with BioJupies

RNA-sequencing (RNA-seq) is a widely applied experimental method to study the biological molecular mechanisms of cells and tissues in human and model organisms. The analysis of RNA-seq data typically requires the chaining of several bioinformatics tools into a computational pipeline, which often involves specialized hardware and bioinformatics expertise.

In order to lower the barrier of entry to RNA-seq data analysis, Icahn Institute researchers have developed BioJupies, a web server that enables automated creation, storage, and deployment of Jupyter Notebooks containing complete RNA-seq data analysis reports.

Through an intuitive interface, novice users can rapidly generate tailored reports to analyze and visualize their own raw sequencing files, their gene expression tables, or fetch data from >8,000 published studies containing >250,000 preprocessed RNA-seq samples from the Gene Expression Omnibus (GEO) repository and the Genotype-Tissue Expression (GTEx) project.

Generated notebooks have executable code of the entire pipeline, interactive data visualizations, differential expression and enrichment analyses, and rich annotations that provide detailed explanations about the results and methods. The notebooks are permanently stored in the cloud and made available online through a persistent URL. By providing an intuitive user interface for notebook generation for RNA-seq data analysis, starting from the raw reads, all the way to a complete interactive and reproducible report, BioJupies is a useful resource for experimental and computational biologists.

Featured Scientist: Avi Ma’ayan

BioJupies Workflow

Enrichr

Enrichr gene-set search engine

A Search Engine for Gene Sets and Signatures

Most genomic, transcriptomic, and epigenomic studies produce sets of genes that need to be further placed in context of prior biological knowledge. These gene sets can represent, for example, mRNAs identified as differentially expressed, or genes harboring mutations that are correlated with a phenotype. Enrichr, developed by Icahn Institute researchers, is a web-based gene-set search engine that performs comprehensive gene-set enrichment analyses querying hundreds of thousands annotated gene sets. Enrichr also contains gene-focused landing pages with all the knowledge contained in Enrichr. Hence, the Enrichr project provides a framework to integrate data from many publicly available resources into a useful tool, as well as for systematic and automated discovery of new biological, biomedical, and pharmacological basic knowledge. Beyond providing high-quality service to the community, Enrichr holds potential for integrative analyses that extend beyond a single user. The lists that are submitted to Enrichr every day may hold global knowledge about the relationships between single genes and gene modules.

Enrichr currently contains the largest collection of annotated gene sets organized into >100 gene-set libraries where many of these libraries are unique to Enrichr. Since its initial publication, Enrichr has been highly accessed by the community where millions of gene-set queries have been submitted to the system by >100,000 users. Overall, Enrichr is gradually becoming an important resource for the genomics research community. Enrichr for human and mouse and other species is freely available online.

ARCHS4

Three dimensional visualization of 130,000 human RNA-Seq samples in the ARCHS4 dataset. Samples derived from the human cell line K562, brain tissue, and macrophages are highlighted showing distinct clusters in gene expression space.

Uniform Alignment of All Publicly Available RNA-seq Data

RNA sequencing (RNA-Seq) is now the leading technology for genome-wide transcript quantification and RNA-seq data from thousands of experimental studies is made publicly available. However, publicly available RNA-seq data is currently provided mostly in raw form, a significant barrier for global and integrative retrospective analyses. ARCHS4 is a web resource that makes the majority of published RNA-seq data from human and mouse available at the gene and transcript levels. For developing ARCHS4, available FASTQ files from RNA-seq experiments from the Gene Expression Omnibus (GEO) were aligned using a cloud-based infrastructure.

ARCHS4 has a web interface that supports intuitive exploration of the processed data through querying tools, interactive visualization, and gene pages that provide average expression across cell lines and tissues, top co-expressed genes for each gene, and predicted biological functions and protein-protein interactions for each gene based on prior knowledge combined with co-expression.

In the process of building the capabilities to process data on such a large scale, Icahn Institute researchers have developed a cost-efficient cloud computing infrastructure. Such infrastructure enables us to offer free RNA-seq data processing services. The uniform processing provides a new level of interoperability for high throughput gene expression analysis and enables many researchers around the world.

Multiscale Modeling Tools

The R package, Differential Gene Correlation Analysis (DGCA), offers a suite of tools for computing and analyzing differential correlations between gene pairs across multiple conditions. This figure shows a differential correlation subnetwork specific to estrogen positive breast cancer but not present in triple negative breast cancer. Andrew T. McKenzie, Igor Katsyv, Won-Min Song, Minghui Wang, and Bin Zhang§. (2016) DGCA: A Comprehensive R Package for Differential Gene Correlation Analysis. BMC Systems Biology 10:106. PMID: 27846853.

Multiscale Embedded Gene Co-Expression Network Analysis (MEGENA) can efficiently construct and analyze large scale planar filtered co-expression networks. Two key components of MEGENA are the parallelization of embedded network construction and the identification of multi-scale clustering structures. MEGENA complements the established co-expression and causal network approaches like WGCNA and ARACNE by its capability of sparsifying densely connected co-expression networks, determining multiscale modular structures and identifying key network drivers in a seamless manner.

The R package, Differential Gene Correlation Analysis (DGCA), offers a suite of tools for computing and analyzing differential correlations between gene pairs across multiple conditions. To minimize parametric assumptions, DGCA computes empirical p-values via permutation testing. To understand differential correlations at a systems level, DGCA performs higher-order analyses such as measuring the average difference in correlation and multiscale clustering analysis of differential correlation networks.

We developed a theoretical framework for computing the statistical distributions of multi-set intersections based upon combinatorial theory, and then accordingly designed a procedure to efficiently calculate the exact probabilities of multi-set intersections. We further developed multiple efficient and scalable techniques to visualize multi-set intersections and the corresponding intersection statistics. We implemented both the theoretical framework and the visualization techniques in a unified R software package, SuperExactTest.

Network Visualization

iCAVE: an open source tool for visualizing biomolecular networks in 3D, stereoscopic 3D and immersive 3D.Liluashvili V, Kalayci S, Fluder E, Wilson M, Gabow A, & Gümüş ZH; Gigascience, 6(8), 1-13, 2017. Kalayci S and Gümüş ZH; Current Protocols in Bioinformatics 61(1):8.27.1-8.27.26; 2018.

Visualizations of biomolecular networks assist in systems-level data exploration in myriad cellular processes. Data generated from high-throughput experiments increasingly inform these networks, yet current tools do not adequately scale with the concomitant increase in their size and complexity. Interactome-CAVE (iCAVE) is an open source software platform for visualizing large and complex biomolecular interaction networks in 3D. Users can explore networks (i) in 3D using a desktop, (ii) in stereoscopic 3D using 3D-vision glasses and a desktop, or (iii) in immersive 3D within a CAVE environment. iCAVE introduces 3D extensions of known 2D network layout, clustering, and edge-bundling algorithms, as well as new 3D network layout algorithms. Furthermore, users can simultaneously query several built-in databases within iCAVE for network generation or visualize their own networks (e.g., disease, drug, protein, metabolite). iCAVE has a modular structure that allows rapid development by addition of algorithms, datasets, or features without affecting other parts of the code. Overall, iCAVE is the first freely available open source tool that enables 3D (optionally stereoscopic or immersive) visualizations of complex, dense, or multi-layered biomolecular networks. While primarily designed for researchers utilizing biomolecular networks, iCAVE can assist researchers in any field.

REVEL

Performance of REVEL and other meta-predictors for discriminating disease and neutral exome sequencing variants, stratified by the neutral allele frequency.

The vast majority of variants discovered by high-throughput sequencing are rare. Recent exome and genome sequencing studies have found that approximately 85% of nonsynonymous variants have alternate allele frequencies (AFs) less than 0.5%, and roughly 100-400 rare nonsynonymous variants are discovered per sequenced individual. However, the majority of nonsynonymous variants discovered by sequencing studies have unknown significance because experimental validation of large numbers of rare variants is currently infeasible, and association studies require prohibitively large sample sizes to detect rare variants with modest effect sizes with high statistical power. Therefore, computational tools that can accurately predict the pathogenicity of rare variants are needed to help identify those variants that are most likely to cause disease.

Icahn Institute scientists developed REVEL (Rare Exome Variant Ensemble Learner), an ensemble method for predicting the pathogenicity of missense variants by combining functional prediction and conservation scores from individual tools: MutPred, VEST, PolyPhen, SIFT, PROVEAN, FATHMM, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. REVEL has been independently shown to outperform existing methods, especially for rare variants. REVEL can be used to help prioritize the most likely clinically or functionally relevant variants among the sea of rare variants discovered as sequencing studies expand in scale. Pre-computed scores for all human missense variants are available online.

Scalable Single-cell RNA-seq Analysis

Scaling single-cell RNA-seq analysis to billions of cells will require new computational architectures that leverage distributed computing frameworks and cloud storage. This requires defining new storage formats for large expression matrices along with implementing common analytical pipelines with big data frameworks like Apache Spark or Dask.

The number of single-cell transcriptomes (and other -omes) is poised to explode with new advances in experimental protocols coupled with continuing improvements in genomic sequencing throughput. We expect that within several years, public data repositories will likely contain billions of single-cell transcriptomes. Today, most single-cell analysis tools rely on a traditional single-node in-memory computing model, such as R and Python. Already, the availability of million-cell datasets is proving challenging for common analytical pipelines and exploratory data analysis in single-node environments. At the Icahn Institute, we are working on applying big data best-practices to implement scalable data storage and distributed processing for single-cell RNA-seq applications. Specifically, we are prototyping new storage formats for large gene-cell expression matrices that have features that make them more compatible with distributed cloud object storage, including support for parallel read/write. We are also experimenting with multi-node distributed execution engines to parallelize common operations on gene-cell count matrices, as implemented in common frameworks like Scanpy and Seurat. For example, we are developing prototype backend implementations for Scanpy that demonstrate common workflows using both Dask and Apache Spark. Ultimately, we hope to contribute backend implementations for Scanpy and Seurat that easily allow users to scale their pipelines across clusters of machines and achieve linear scalability of their computations. By contributing directly to popular R and Python frameworks for scRNA-seq, we hope to enable other groups to build on our infrastructure so they can easily achieve horizontal scalability. Our group is also passionate about building high-quality software using software engineering best-practices; multiple individuals in our groups have had experience in the tech industry. We will contribute publicly on GitHub, implement continuous integration testing, publish software artifacts to standard repositories (e.g., PyPI, Maven Central, CRAN), and document public APIs.

Microbiome Analysis

Identifying temporally similar microbiomes. Clustering of subjects after an intervention identifies three types of responses based on microbiome changes over time.

Research in the human microbiome has produced an unprecedented amount of data linking imbalances in our bacterial communities to immune, metabolic, or even neurological disorders. Translating this data into actionable knowledge in the clinic and the laboratory is critical to understand how to develop novel diagnosis tools and therapeutic approaches based on the microbiome.

The Icahn Institute is at the forefront of microbiome data analysis with developed and contributed algorithms and software tools widely used in the field. We are developing novel methods to characterize bacterial species and strains with high accuracy. This will allow us to determine precisely what microbes are transferred from mothers to infants at birth, or from donors to inflammatory bowel disease patients through fecal transplantations. We are also developing new algorithms to measure the temporal similarity of different microbiota. Applying them to identify early response to treatment and interventions based on these similarities. Our work is supported by Mount Sinai’s state-of-the-art Scientific Computing facilities and infrastructure, which allow us to analyze large-scale microbiome datasets and better understand the role of bacteria in human health and disease.

Whole-Cell Modeling Toolkit

The whole-cell (WC) modeling toolkit will enable researchers to build, simulate, and apply comprehensive whole-cell computational models of individual cells.

The whole-cell (WC) modeling toolkit will be a suite of software tools for building, simulating, and analyzing comprehensive WC models of individual cells. The toolkit will enable researchers to build unprecedented models of bacteria and human cells for basic science, precision medicine, and regenerative medicine.

The Icahn Institute is leading WC modeling by developing the first tools for WC modeling. In 2012, Icahn Institute scientists demonstrated the feasibility of WC modeling and provided a roadmap for a WC modeling toolkit in Cell. Over the past few years, Icahn Institute scientists have been developing the first WC modeling toolkit, beginning with tools for organizing the data needed for WC modeling, tools for organizing and sharing WC simulation results, and tools for visualizing WC simulation results. To enable more comprehensive and more predictive models, Icahn Institute scientists are currently developing tools for building, representing, and simulating composite, multi-algorithmic WC models. Together, these tools will enable researchers to build unprecedented models of bacteria and human cells which could help clinicians personalize medicine and help engineers design novel microorganisms.