About PGRR

Environment and Resources

 

The Pittsburgh Genome Resource Repository (PGRR):

 

The Pittsburgh Genome Resource Repository provides data management and computing infrastructure to support biomedical investigation using Big Data. PGRR is funded by the Institute for Personalized Medicine (IPM) and University of Pittsburgh Cancer Institute (UPCI) and includes collaboration of faculty and staff from IPM, UPCI, the Department of Biomedical Informatics (DBMI), the University of Pittsburgh Center for Simulation and Modeling (SaM), the Pittsburgh Supercomputing Center (PSC), and University of Pittsburgh Medical Center (UPMC). Figure 1, provides a conceptual overview of the PGRR infrastructure as it is used to manage TCGA data.

 

 

Background:

 

Publically available datasets such as The Cancer Genome Atlas (TCGA) represent extremely valuable datasets and are used for numerous purposes including discovery of new biomarkers, validation of new methods, and education and training. TCGA currently contains mutations, CNV, miRNA, gene expression and microarray data on 11054 participant cases across 34 tumor types, and is expected to grow to 20,000 or more participant cases. The total size of this dataset is currently 1083 TB (as of06/02/2015).  University of Pittsburgh/UPMC represents the single largest contributing institution to the TCGA study, contributing approximately 7% of all current TCGA data.

 

For most current users of the large TCGA dataset (and many similar datasets), utility of the dataset is constrained by three important barriers: (1) the data is used under a Data Use Certificate from dbGAP which is most often limited to a small number of investigators. PGRR utilizes an existing, collaborative PGRR Data Use Certificate (currently including 75 investigators) to provide data to individuals whose research is encompassed by our dbGAP research description without the need for separate DUCs; (2) Downloads of TCGA (especially BAM files) are limited by the slow transfer rates from distribution sources (e.g. from CGHub), as well as slow movement of files between Pitt and UPMC. PGRR reduces transfer burden by centralizing TCGA using a unique distributed file system shared between SaM and PSC, and thus minimizes the time needed to transfer files to analysis sites; (3) TCGA data requires hundreds of TB to PB level storage with resulting expenditure. PGRR limits duplication of files and thus simplifies control over provenance.

 

Regulatory Compliance:

 

Development and use of the PGRR has been approved by the University of Pittsburgh Institutional Review Board (PRO12090374). The use of this data at UPMC is covered by a separate Data Use Agreement between the University of Pittsburgh and UPMC. The hosting of this data by Pittsburgh Supercomputing Center is covered under a contracted services agreement with CMU.

 

As described in the PGRR IRB protocol (PRO12090374), when individual consents permit, data from UPMC patients are re-identified by UPMC honest brokers and rich clinical data is linked through the Oracle Translational Research Center Data Warehouse and associated tooling. Data provided back to researchers by UPMC is limited to de-identified data only. To date, this process has been performed with 139 consented breast cancer patients from UPMC whose tumor samples and de-identified clinical data were contributed to TCGA.

 

TCGA data management:

 

We have developed an aplication  to fully manage the download of all TCGA data and metadata, both open and protected, from all relevant NCI repositories. PGRR Software developed by DBMI with funding from the Institute for Personalized Medicine provides the following functionality:

 

(1)  Runs weekly to scrape essential metadata from various public sources, not available in any other format at the present time. Updates the graphs in RDF store for Disease Study, Tissue Source Site, Center, Sample Type, and Portion Analyte. Our scripts also identify new, modified and deleted files for all data types and are therefore able to manage the versioning process.

(2)  Because of the massive size of the BAM files, we have written scripts to download files on demand. BAM files are stored as the separate analysis types for a corresponding sample.  All data transmission is secured.

(3)  Manages generation of the Pitt/UPMC version. Compares the existing file version in PGRR storage (if any) and manages the provenance and metadata in the PostgreSQL database. We update the PGRR version of the data only when there is a meaningful change in the data or metadata.
 

(4)  Creates PGRR data stores and archives using a standard directory structure, naming and version conventions. The PGRR directory has sample-centric structure. Analyses performed for a particular sample are stored in sample/dataType/center-platform directory

 

 

Figure 2. PGRR File Directory Structure

 

 VCF data harmonization is accomplished by processing VCF headers and data using programs we have written in Java. In TCGA the VCF file structure varies from one analysis center to another.  They have different header information, number of samples per file, number of files per sample, number of variant-callers used. We created a standard for vcf file header that contains fileformat, filedate, center, platform, genome ref.name, genome ref. url, patient_id, specimen_id followed by any additional information from the original data file . We also make sure that the reference positions POS are sorted numerically within each CHROM sequence.

(5)  We use a PostgreSQL database to store and query TCGA data and metadata.

 

Institute of Personalized Medicine Portal:

 

IPM provides an intranet portal for investigators to access files and to request notification of file changes. The IPM intranet portal is developed using the Drupal content management system, REST-based API and SQLqueries to the PGRR database. The portal provides access to TCGA files stored on the Data Supercell where data can be easily accessed with relevant tools deployed at University of Pittsburgh Simulation and Modeling Center (e.g. GenomOncology analytic platform, CLC Bio etc). The IPM portal also provides notification mechanisms based on our RDF metadata graph (see Figure 3). Investigators are able to identify when new samples are available, and are able to request notification by email when a new version of a dataset of interest has been created at IPM. Thus, if a patient is deleted from the TCGA dataset at NCI, if a new set of samples appears, or if a file has been updated and the data is different – the resulting changes of the PGRR metadata repository are immediately communicated and the investigator can take appropriate action including re-running the analysis.

 

Figure 3: Notification matrix of IPM portal showing selection of datatypes and tumor types.

 

UUIDs and Metadata:

We are currently moving to UUID identification across the PGRR archive. Currently, some data types, like VCF files, DNA_Methylation, Expression Gene contain TCGA barcodes but they may or may not contain sample UUIDs. We have written the software necessary to assign UUIDs to all samples from their respective clinical data. Thus, we are creating a single uniform metadata annotation that can be leveraged to analyze data across cancer types. Currently this is not possible in the existing TCGA data.

 

GenomOncology software:

 

PGRR provides an access point for scripted uploads of TCGA files into the GenomOncology (GO) analysis and visualization software. Intended largely for translational research end users, the GO software to be deployed at SaM this month, provides an easy-to-use analytics portal that we can leverage as a platform to visualize results of any analytic process including those created as part of this BD2K initiative. GO software enables translational researchers to explore annotation based filtering of NGS-based datasets, including DNA sequencing (SNP/indel data), expression data (RNAseq), and methylation data. The software allows comparison of results amongst several different individuals within a project, and allows implementation of filtering schemes against common databases (such as dbSNP or the 1000genomes data) as well as against custom databases.

 

The current GenomOncology software provides the following functionality:

 

  • Interactively view and analyze all types of next generation sequencing data including variation data from SNPs, structural variants, copy number variants, loss of heterozygosity regions, relative expression (from RNA-Seq) and/or epigenetic data, and analyze their combined impact on genes and pathways.
  • Interactively examine variations in one genome, in a group of genomes (hundreds), or between sets of genomes, in seconds. For example, users can find the intersection of non-synonymous variations in COSMIC genes with gene expression quickly.
  • Users can "filter" variations dynamically to focus on features of biological interest (Pathways, COSMIC, OMIM, TCGA etc.) Examples include filtering out SNPs by allele frequency, focusing on those found in genes represented in the COSMIC database, in a subset of COSMIC by cancer type, in a subset of pathways, genes with a copy number change or expression change, etc.
  • View data at the level of base pairs, genes and/or pathways, both in tables and in maps.  The read alignment from the associated BAM file can be edited to remove spurious variants from further analysis.
  • Analyze 1 or many genomes and save analysis information including the ability to add notes / annotations as well as save a set of filters as a template for further analysis in the future. 
  • Use genomic analyses to create a report, based on a template tailored to the user’s institution, or department.  The report contains findings, notes and recommendations that can be delivered directly.
  • Export of data tables or analysis set in a variety of formats for further analysis or publication. (e.g. R)
  • Review Responder / Non-Responder groups (including grouping genomes into sets by phenotypic data) and examine variations shared within the group but not between them, using user-defined percentages of each group.   
  • Use two side by side windows to compare different analyses or parameters simultaneously.
  • Find Genomes by tumor type, patient id, or by analysis information.
  • End to end integration with Galaxy open source platform – enables various tasks / workflows (variant calling, etc.) to be performed and have the end result load into the GenomOncology platform.
  • Compare various data sets (e.g. look at different variant caller data to determine best set for each lab based on the type of data available)

 

GenomOncology TCGA KEY FEATURES:

  • Analyze data against a reference database such as TCGA data or Thousand Genomes.
  • Platform supports use of all available data types from TCGA fully integrated for analysis (CNV, SNP, mutation, RNAseq, Gene Expression, etc)
  • Analyze protected TCGA data and partition data access based on DUC and user roles. 
  • Associate phenotypic or clinical data with reference database samples to use in analyses. (e.g. All patients with this variant that are also estrogen receptor positive)
  • Annotate variation data in specific dataset (e.g. all Pitt data) with information from analogous variation data from reference databases (TCGA). 
  • Interact with data in parallel or in combination with TCGA samples with all platform functionality.

 

Investigators may also use other centrally deployed tools (e.g. CLC Bio) to process this data.

 

The UPMC Enterprise Data Warehouse

 

When permitted, PGRR orchestrates the copy and movement of TCGA files derived from UPMC patients (~9% of total TCGA participants) into the UPMC Enterprise Analytics Data Warehouse, as long as patient consents permit this transfer (as determined by CARe and the IRB). Files obtained from Next Generation Sequencing (NGS) platforms can be loaded by script into the Oracle Translational Research Center (TRC) Omics Data Bank (ODB) - a rich relational model for NGS data. Phenotype data derived from health system data is stored in the Oracle Healthcare Data Warehouse Framework (HDWF) and associated with –omics data on the subset of patients from UPMC within the Cohort Datamart (CDM).  TRC includes associated tooling including the Clinical Development Center (CDC), Oracle Cohort Explorer and Oracle R.

 

The Enterprise Analytics Data Warehouse is positioned behind the UPMC firewall, is subject to UPMC security scans, and meets all dbGaP security requirements. Data available to researchers will be completely de-identified; only honest brokers, and those with IRB approval will have access to identified data. In addition to building partnerships with Oracle, Informatica, and IBM, UPMC has established a Data Governance Program, whose mission is to collect, change, store, move, consume, and release UPMC data assets efficiently, accurately, and legally.

 

User Management

 

Requests to collaborate as part of an IPM Data Use Certificate are considered on a quarterly basis. Access is granted if/when changes to our DUC are approved by dbGAP.

 

Security and Access Control

 

All cluster nodes at SaM are secured behind the University firewall, which permits only encrypted communication from the external network (using SSH). All extraneous services are controlled at the firewall, including email, file sharing, printing, etc., so that the cluster is as secure as possible.  All user accounts are password-restricted - with strong password policies dictating the content of passwords and requiring password rotation - and all data on the cluster is restricted using access-control lists so that only appropriate project members have access to a particular projects' data.

 

Hardware:

 

Simulation and Modeling Center

 

The University of Pittsburgh’s Center for Simulation and Modeling (SAM) is the premier shared high-performance computing (HPC) facility in the University community, and represents investments of hardware and human capital from several University Schools/Departments. SAM operations are overseen by three directors (senior faculty members from the Schools of Engineering, Arts and Sciences, and Health Sciences), and are facilitated through the activities of five consultants (faculty members from various departments). The Center also serves as a collaboration portal, having assembled a group of more than 50 collaborators from across the University who are engaged in computational research in Chemistry, Biology, Physics, Astronomy, Mathematics, Computer Science, Economics and several of the departments in the Swanson School of Engineering, as well as faculty from the Schools of Public Health, Medicine, and the Graduate School of Public and International Affairs. SAM team members are responsible for preparing training and educational material, teaching, cluster user support and consulting, and focused software development and research support for various projects at Pitt. They have extensive experience in administering high-performance computing clusters and in providing support to a large community of users, and further represent expertise in molecular modeling/dynamics, open source programming, parallel processing/programming, GPU-based processing/programming, grid computing, high-throughput data-intensive computing, and various areas of theoretical and computational science and engineering. SAM provides user support, training, and project management services on a continual basis through web 2.0 based platforms (http://core.sam.pitt.edu and http://collab.sam.pitt.edu), as well as organizing year round workshops and training sessions on cluster usage, parallel programming, and various topics in HPC based research. The Center also acts as liaison for national computational resources, through partnerships with the Pittsburgh Supercomputing Center and the NSF/XSEDE Campus Champions program.

 

SAM provides in-house HPC resources allocated for shared usage free of charge for campus researchers. Computational resources consist of a heterogeneous grid/cluster comprised of 200 8-core Intel Nehalem and Harpertown, 45 12-core Intel Westmere, 23 48-core AMD Magny-Cours, 82 16-core Intel Sandy Bridge, and 110 32-core AMD Interlagos compute nodes adding up to a total of 8076 computation-only CPU cores, with a maximum of 128GB per node shared memory (1.5-8Gb per core). Several Nehalem nodes also have general purpose NVIDIA GPU accelerator cards, for a total of 16 GPU cards comprising 5504 GPU cores. Most nodes are connected via fast Infiniband low latency network fabrics. Process and resource allocation is managed using the PBS/Moab suite. Local (temporary) storage on the compute nodes is typically 1-3TB. Users home directories are maintained on an 80TB RAID5 NAS unit, with a redundant array providing online snapshots/backup. A high-performance 120Tb Panasas storage array is also available for processes requiring fast distributed disk access.

 

Pittsburgh Supercomputing Center

 

For applications requiring very large shared memory, high-productivity programming models, and/or moderate parallelism with a high-performance system-wide interconnect, PSC operates Blacklight, an SGI UV 1000 cc-NUMA shared-memory system comprising 256 blades. Each blade shares 128GB of local memory, and holds two Intel Xeon X7560 (Nehalem) eight-core processors, for a total of 4,096 cores and 32 TB across the whole system. Each core has a clock rate of 2.27 GHz, supports two hardware threads and can perform 9 Gflop/s for a total system floating point capability of 37 Tflop/s. Up to 16 TB of this memory is accessible as a single memory space to a shared-memory program. Message-passing and PGAS programs can access all 32 TB on the system. Blacklight is part of the National Science Foundation XSEDE integrated national system of cyberinfrastructure and also is computational component for advanced analytics in the Data Exacell pilot.

 

Sherlock is a YarcData Urika™ (Universal RDF Integration Knowledge Appliance) data appliance with PSC enhancements. It enables large-scale, rapid graph analytics through massive multithreading, a shared address space, sophisticated memory optimizations, a productive user environment, and support for heterogeneous applications. Sherlock consists of both YarcData Graph Analytics Platform (formerly known as next-generation Cray XMT™) nodes and Cray XT5 nodes with standard x86 processors. Sherlock contains 32 YarcData Graph Analytics Platform nodes, each containing 2 Threadstorm 4.0 (TS4) processors, a SeaStar 2 (SS2) interconnect ASIC, and 32 GB of RAM. Aggregate shared memory is 1 TB, which can accommodate a graph of approximately 10 billion edges. The TS4 processors and SS2 interconnect contain complementary hardware advances specifically for working with graph data. These include support for 128 hardware threads per processor (to mask latency), extended memory semantics, a system-wide shared address space, and sophisticated optimizations to prevent “hotspots” involving contention for data. Sherlock supports complex graph analytics without programming using RDF and SPARQL. Sherlock’s Graph Analytics Platform nodes are also fully programmable using C++ and C to implement custom applications, and its XT5 nodes provide full Linux, including Java, Python, Fortran, and other languages, to support heterogeneous applications, graphical user interfaces, and interaction with other systems. Funded as an NSF Strategic Technologies for Cyberinfrastructure (STCI) project, Sherlock is also being incorporated into the Data Exacell pilot.

 

PSC’s Data Supercell for persistent storage of information is a disk-only file repository that is less costly than a disk-tape archive system and provides much faster file access. Each building block in the repository has one petabyte of useable disk storage, which is managed by the ZFS file system and the PSC-developed SLASH2 replicating distributed file system. ZFS and SLASH2 provide multiple layers of robust data integrity checking to protect user data against data corruption. This building-block architecture will enable the repository to scale well beyond its initial deployment of four petabytes.