Seminars and Events

2023-11-01, Kip Handwerker MSc, A Clustering Approach to Non-Equal Length Joint Pattern Genetic and Epigenetics Factors Weighted by Covariates

University of Memphis Biostatistics PhD Student. Clustering analysis is a popular approach to gaining insight into the structure of data, especially on a large scale. Some of the most popular approaches are the K-means and K-prototype algorithms which are partitioning methods that use distance measures to assign groups. While these methods are good, especially for large datasets, when it comes to genetics data they fail to consider potential joint effects and require the same dimensionality across variables. The Vector in Partition (VIP) algorithm fills this gap with a distance measure designed to partition genetic and epigenetic data with non-equal length dimensions; specifically, gene expression (GE), DNA methylation (CPG), and single nucleotide polymorphisms (SNP). The VIP extension method extends this framework by adding another layer of complex joint effects of genetic and epigenetic data with other potential health-related variables to dictate clustering. The extension algorithm performs well on simulated data when the clustering of the covariates follows the same clustering scheme of the genetics data. Like other distance measures, when the data does not follow a clear clustering scheme the algorithm tends to underperform, especially against numeric data. The results highlight many aspects of the algorithm’s performance capabilities, as well as multiple areas for future improvements.

2023-05-22, Dr. Hua Zhou, Inferring Within-Subject Variances From Intensive Longitudinal Data

University of California Los Angeles, The availability of vast amounts of longitudinal data from electronic health records (EHR) and personal wearable devices opens the door to numerous new research questions. In many studies, individual variability of a longitudinal outcome is as important as the mean. Blood pressure fluctuations, glycemic variations, and mood swings are prime examples where it is critical to identify factors that affect the within-individual variability. We propose a scalable method, within-subject variance estimator by robust regression (WiSER), for the estimation and inference of the effects of both time-varying and time-invariant predictors on within-subject variance. It is robust against the misspecification of the conditional distribution of responses or the distribution of random effects. It shows similar performance as the correctly specified likelihood methods but is 10_ ~10_ times faster. The estimation algorithm scales linearly in the total number of observations, making it applicable to massive longitudinal data sets. The effectiveness of WiSER is illustrated using the accelerometry data from the Women's Health Study and a clinical trial for longitudinal diabetes care.

2023-04-24, Dr. Quian Li, Statistical Frameworks for Longitudinal Metagenomic and Transcriptomic Data

St. Jude Children's Research Hospital, Longitudinal sampling has become popular in the omics studies, such as microbiome and transcriptome. To identify operational taxonomy units (OTUs) signaling disease onset, a powerful and cost-efficient strategy was selecting participants by matched sets and profiling their temporal metagenomes, followed by trajectory analysis. We proposed a joint model with matching and regularization (JMR) to detect OTU-specific trajectory predictive of host disease status. The between- and within-matched-sets heterogeneity in OTU relative abundance and disease risk were linked by nested random effects. The inherent negative correlation in microbiota composition was adjusted by incorporating and regularizing the top-correlated taxa as longitudinal covariate, pre-selected by Bray-Curtis distance and elastic net regression. For the longitudinal bulk transcriptomes, we propose a statistical framework ISLET to infer individual-specific and cell-type-specific transcriptome reference panels. ISLET models the repeatedly measured bulk gene expression data to optimize the usage of shared information within each subject. ISLET is the first available method to achieve individual-specific reference estimation in repeated samples. In the simulation study and an application to a large-scale metagenomic study, JMR outperformed the competing methods and identified important taxa in infantsÕ fecal samples with dynamics preceding host autoimmune status. We also show outstanding performance of ISLET in the reference estimation and downstream cell-type-specific differentially expressed genes testing in simulation. An application of ISLET to the longitudinal PBMC transcriptomes in the same study confirms the cell-type-specific gene signatures for early-life autoimmunity.

2023-03-27, Dr. Patrik Breheny, False Discovery Rates for Penalized Regression Models

University of Iowa, Penalized regression is an attractive methodology for dealing with high-dimensional data where classical likelihood approaches to modeling break down. However, its widespread adoption has been hindered by a lack of inferential tools. In particular, penalized regression is very useful for variable selection, but how confident should one be about those selections? How many of those selections would likely have occurred by chance alone? In this talk, I will review recent developments in this area, with an emphasis on my work and that of my recent graduate students.

2023-02-20, Dr. Cecile An, Methods for Phylogenetic Networks

University of Wisconsin-Madison, Phylogenetic networks and admixture graphs can represent the past history of a group of species or populations and how they diversified. Unlike trees, networks can represent events such as migration between populations, admixture, hybridization between species, or recombination between viral strains. I will give an overview of how phylogenetic networks are used and the difficulties of estimating phylogenetic networks from genome-wide data. Then I will focus on the characterization of what is (or is not) knowable about the network based on genetic distance data.

2022-10-31, Chenhao Zhao, MatrixLM: a flexible, interpretable framework for high-throughput data

Geisel School of Medicine, Dartmouth College, The Matrix Linear Model (MLM) is an efficient and computationally feasible solution to association analysis for biomedical high-throughput data. Sen and Liang (2018) developed the MatrixLM.jl package in Julia. programming language, providing core functions to estimate matrix linear models. The project's main goal was to collaborate on user-friendly documentation, increase testing features, and improve certain coding functionalities. We used simulated data to demonstrate how to use the package and used a case study example based on an actual disease metabolomics study to showcase MLM's benefits. Nonalcoholic fatty liver disease (NAFLD) is a progressive liver disease that is strongly associated with type II diabetes 2. Using Matrix Linear Models, our analysis investigated the association between metabolite characteristics (e.g., pathways) and patient characteristics such as type II diabetes.

2022-10-17, Dr. Heather M. Highland, A multi-omics approach to understanding the role of APOL1 in CKD amongst African Ancestry individuals

The University of North Carolina at Chapel Hill, APOL1 is an integral part of the complement system, a component of innate immunity that serves as the first line of defense against pathogens. The trypanosome that causes African sleeping sickness developed mechanisms to evade the innate immunity system after the migration out of Africa. This created a selective pressure on non-synonymous variants in the APOL1 gene. These variants are only observed in people with recent African ancestry. People carrying two variants in APOL1 are at increased risk of developing a variety of kidney disease. The mechanism by which APOL1 alters kidney function is currently unclear. To investigate potential mechanisms, we have looked at differences in DNA methylation across the genome and metabolomic profiles. Using 1740 African Americans (AA) in the ARIC study and 3886 Hispanic/Latinos in SOL, we did not observe any statistically significant differences in metabolism in preliminary analyses after adjusting for multiple comparisons. In 947 AAs in ARIC, 949 AAs in JHS and 332 AAs in MESA, we identified extensive differences in methylation near the APOL gene family region on chromosome 22 (p=2.7x10-80). Additional methylation differences were seen near FEZF2, FAM20A, and KIAA0556. The role of these loci in the development of kidney disease in people with two APOL1 risk alleles continues.

2022-09-19, Dr. Luis FS Castro-de-Araujo, Bidirectional Causal Modeling With Instrumental Variables and Data From Relatives

The University of Melbourne, Establishing (or falsifying) causal associations is an essential step towards developing effective interventions for psychiatric and substance use problems. While randomized controlled trials (RCTs) are considered the gold standard for causal inference in health research, they are impossible or unethical in many common scenarios. Mendelian randomization (MR) can be used where RCTs are not feasible, but it requires stringent assumptions that can be fundamentally flawed when applied to complex traits. Some assumptions of MR can be avoided with using structural equation modeling. In this paper we developed an extension of the Direction of Causation twin model (Neale 1994) that includes two polygenic risk scores in the specification, as an approach to avoid some inherent restrictions of both MR and RCT. We hypothesize that adding a second PRS will generate a more flexible model in terms of identification, whilst maintaining reasonable power and allowing for bidirectional causation. OpenMx software is used to explore the power of such a model and its identification. We arrive at an extension of the Direction of Causation model that can be used both in a twin design or in a extended family design, but at the same time relaxing some of MRs assumptions. We further report the model is well powered enough for current data set sizes (from around 13000 observations or less, depending on the variance of the instruments), and in a range of additive, shared and environmental variances found in common clinical scenarios.

2022-08-29, Dr. Arash Shaban-Nejad, Using Ontologies for Knowledge Engineering and Management in Medicine and Healthcare

University of Tennessee Health Science Center, Health intelligence relies on the systematic collection and integration of data from diverse distributed and heterogeneous sources at various levels of granularity. These sources include data from multiple disciplines represented in different formats, languages, and structures posing significant integration and analytics challenges. Using a series of clinical and population health applications and use cases, this seminar highlights the contribution made by emerging semantic technologies that offer enhanced interoperability, interpretability, and explainability through the adoption of ontologies (a computational artifact capturing domain knowledge using concepts, relations, and complex logical rules and axioms), and knowledge graphs.

2022-08-22, Dr. Abraham Palmer, Using Outbred HS Rats to Study the Genetic Basis of Almost Everything You Can Think Of

University of California San Diego, Whereas there are well established methods for translating findings about single genes from humans to non-humans, there is an urgent need for methods to translate the polygenic signals obtained from GWAS across species. This is difficult because GWAS produces information about SNPs rather than genes, however SNPs are inherently species specific. My lab is helping to develop two complementary methods to address this problem. Both methods depend on translation of GWAS signals from SNPs to genes. In one method, this is done by choosing the gene that is nearest to an implicated SNP. The list of orthologous genes from two or more species are then projected into a previously defined gene network and a random walk is used to defuse the signal to neighboring genes. The overlap between the network defined by each species is then assessed for significance relative to permuted gene sets. In the second method, SNPs are used to predict gene expression and these predictions are used to estimate the effect of each geneÕs expression on phenotype, creating what we term a polygenetic transcriptomic risk score (PTRS). A PTRS can then be used in conjunction with orthologous genes such that a PTRS defined in one species can be used to estimate an analogous trait in individuals from another species. In preliminary work we found that both methods identify highly statistically significant overlap in the signals associated with both BMI and body length. We are extending these methods to behavioral traits, including those relevant for substance use disorders.

2022-04-11, Dr. Daniel Roden, A Single-Cell and Spatially Resolved Atlas of Human Breast Cancers

University of New South Wales, Breast cancers are complex cellular ecosystems where heterotypic interactions play central roles in disease progression and response to therapy. However, our knowledge of their cellular composition and organization remains limited. Recently we published an integrated cellular and spatial atlas of 26 primary human breast cancers spanning all major molecular subtypes1. This provided a systematic, high-resolution, characterization of the cellular diversity of the epithelial, immune and stromal cellular landscape. To investigate neoplastic cell heterogeneity, we developed a single cell classifier of intrinsic subtype (scSubtype) and revealed recurrent transcriptional gene modules that define the neoplastic cells. This detailed cellular taxonomy was then used to deconvolute large breast cancer cohorts, allowing their stratification into nine clusters, termed ÔecotypesÕ, with unique cellular compositions and association with clinical outcome. Further, Visium spatial profiling provided an initial view of how stromal, immune and neoplastic cells are spatially organized in tumours, offering insights into tumour regulation. This work is now being expanded to the generation and integration of 100s of cellular profiles and a matching dataset of spatially resolved tumour transcriptomes. We propose that this will identify cellular niches that are spatially organized in breast tumours, offering insights into anti-tumour immune regulation and neoplastic heterogeneity. In particular, IÕll discuss recent work, using whole-transcriptome Nanostring spatial profiling, that analyses T-cell and cancer rich regions of the tumour micro-environment in triple-negative breast cancers, showing how integration with our cellular taxonomy allows for deconvolution of the cell-type abundances in these specific tissue regions. Our work highlights the potential of large-scale, integrated, cellular and spatial genomics to unravel the complex cellular heterogeneity within tumours and identify novel cell types, niches, and regulatory states that will inform treatment response.

2022-03-28, Dr. Li Hsu, A Mixed-Effects Model for Powerful Association Tests in Integrative Functional Genomic Data

Fred Hutchinson Cancer Research Center, Genome-wide association studies (GWAS) have successfully identified thousands of genetic variants for many complex diseases; however, these variants explain only a small fraction of the heritability. Recent developments in genotype-omics studies have shown promises for discovering novel loci by leveraging genetically regulated molecular phenotypes (e.g., gene expression, methylation, proteomics) into GWAS. However, there is a limitation in the existing approaches. Some variants can individually influence disease risk through alternative functional mechanisms. Existing approaches of testing only the association of imputed molecular phenotypes will potentially lose power. To tackle this challenge, we consider a unified mixed effects model that formulates the association of intermediate phenotypes such as imputed gene expression through fixed effects, while allowing for residual effects of individual variants as random effects. We consider a set-based score testing framework, MiST (Mixed effects Score Test), and propose data-driven combination approaches to jointly test for the fixed and random effects. We also provide p-values for fixed and random effects separately to enhance interpretability of the association signals. Recently, we extend the MiST to depend on only GWAS summary statistics instead of individual level data, allowing for a broad application of MiST to GWAS data. Extensive simulations demonstrate that MiST is more powerful than existing approaches and summary statistics-based MiST (sMiST) agrees well those obtained from individual level data with substantively improved computational speed. We apply sMiST to a large-scale GWAS of colorectal cancer using summary statistics from >120, 000 study participants and gene expression data from the Genotype-Tissue Expression (GTEx) project. We identify several novel and secondary independent genetic loci.

2022-03-21, Dr. Hongkai Ji, A statistical framework for differential pseudotime analysis with multiple single-cell RNA-seq samples

Johns Hopkins Bloomberg School of Public Health, Pseudotime analysis with single-cell RNA-sequencing (scRNA-seq) data has been widely used to study dynamic gene regulatory programs along continuous biological processes. While many computational methods have been developed to infer the pseudo-temporal trajectories of cells within a biological sample, methods that compare pseudo-temporal patterns with multiple samples (or replicates) across different experimental conditions are lacking. Lamian is a comprehensive and statistically-rigorous computational framework for differential multi-sample pseudotime analysis. It can be used to identify changes in a biological process associated with sample covariates, such as different biological conditions, and also to detect changes in gene expression, cell density, and topology of a pseudotemporal trajectory. Unlike existing methods that ignore sample variability, Lamian draws statistical inference after accounting for cross-sample variability and hence substantially reduces sample-specific false discoveries that are not generalizable to new samples. Using both simulations and real scRNA-seq data, including an analysis of differential immune response programs between COVID-19 patients with different disease severity levels, we demonstrate the advantages of Lamian in decoding cellular gene expression programs in continuous biological processes.

2022-03-14, Dr. Lin Hou, Inference of disease associated genomic segments in post-GWAS analysis

Tsinghua University, Identification and interpretation of disease associated loci remain a paramount challenge in genome-wide association (GWAS) of complex disease. We develop post-GWAS analysis tools, which leverage pleiotropy and functional annotations to dissect the genetic architecture of complex traits. In this talk, I will first introduce LOGODetect, a powerful and efficient statistical method to identify small genome segments harboring local genetic correlation signals. LOGODetect automatically identifies genetic regions showing consistent association with multiple phenotypes through a scan statistic approach. Applied to seven neuropsychiatric traits, we identify hub regions showing concordant effect on five or more traits. Next, I will introduce Openness Weighted Association Studies (OWAS), a computational approach that leverage and aggregate predictions of chromatin accessibility in personal genomes for prioritizing GWAS signals. In extensive simulation and real data analysis, OWAS identifies genes/segments that explain more heritability than existing methods, and has a better replication rate in independent cohorts than GWAS. Moreover, the identified genes/segments show tissue-specific patterns and are enriched in disease relevant pathways.

2022-02-14, Dr. Danyu Lin, Durability of Covid-19 Vaccines

University of North Carolina at Chapel Hill, Evaluating the durability of protection afforded by Covid-19 vaccines is a public health priority, with the results needed to inform policies around booster vaccinations as well as those around non-pharmaceutical interventions. In this talk, I will present a general framework for estimating the effects of Covid-19 vaccines over time in phase 3 clinical trials and observational studies. I will show some results on the duration of vaccine protection from the Moderna pivotal trial and from the North Carolina statewide surveillance data. The latter data, which were published in the New England Journal of Medicine in January, provided rich information about the effectiveness of the Pfizer, Moderna, and Johnson & Johnson vaccines in reducing the risks of Covid-19, hospitalization, and death over time. I will discuss the implications of these results for booster vaccinations.

2022-02-07, Dr. Arjun Krishnan, Democratizing data-driven biology: Tackling incomplete data, unstructured metadata, and hidden curricula

Michigan State University, There is much enthusiasm about using omics and biomedical data collections to fuel research on complex traits and diseases. However, there are still some well-known fundamental challenges in seamlessly and effectively using these data to drive research. For instance, there are >1.5 million human gene expression profiles that are publicly available, but, depending on the technology/platform used to record each profile, different subsets of genes in the genome are measured in these transcriptomes, leading to thousands of unmeasured genes in many of these profiles. These gaps in data are major hurdles for integrative analysis. Critical problems also exist with data descriptions: the majority of >2 million publicly available omics samples lack structured metadata, including information about tissue of origin, disease status, and environmental conditions. Thus, discovering samples and datasets of interest is not straightforward. In this seminar, I will present recent work from our group on developing machine learning approaches to address these fundamental challenges. In addition, I will discuss the need for improving advanced research training in biological data analysis by formalizing concepts in statistical procedures, study design, data/code management, critically consuming data-driven findings, and reproducible research.

2022-01-31, Dr. Yingying Wei, Meta-clustering of Genomic Data

Chinese University of Hong Kong, Like traditional meta-analysis that pools effect sizes across studies to improve statistical power, it is of increasing interest to conduct clustering jointly across datasets to identify disease subtypes for bulk genomic In this talk, I will present our Batch-effects-correction-with-Unknown-Subtypes (BUS) framework. BUS is capable of correcting batch effects explicitly, grouping samples that share similar characteristics into subtypes, identifying features that distinguish subtypes, and enjoying a linear-order computational complexity. We prove the identifiability of BUS for not only bulk data but also scRNA-seq data whose dropout events suffer from missing not at random. We mathematically show that under two very flexible and realistic experimental designsÑthe Òreference panelÓ and the Òchain-typeÓ designsÑtrue biological variability can also be separated from batch effects. Moreover, despite the active research on analysis methods for scRNA-seq data, rigorous statistical methods to estimate treatment effects for scRNA-seq dataÑhow an intervention or exposure alters the cellular composition and gene expression levelsÑare still lacking. Building upon our BUS framework, we further develop statistical methods to quantify treatment effects for scRNA-seq data. data and discover cell types for single-cell RNA-sequencing (scRNA-seq) data. Unfortunately, due to the prevalence of technical batch effects among high-throughput experiments, directly clustering samples from multiple datasets can lead to wrong results. The recent emerging meta-clustering approaches require all datasets to contain all subtypes, which is not feasible for many experimental designs.

2022-01-24, Dr. Xuexia Wang, Novel Genetic Association Test and Cardiomyopathy Risk Prediction in Cancer Survivors

University of North Texas, This talk includes two projects. Project 1: Gene-based association tests are widely used in GWAS. The power of a test is often limited by the sample size, the effect size, and the number of causal variants or their directions in a gene. In addition, access to individual-level data is often limited. To resolve the existing limitations, we proposed an optimally weighted combination (OWC) test based on summary statistics from GWAS. We analytically proved that aggregating the variants in one gene is the same as using the weighted combination of Z-scores for each variant based on the proposed score test. Several existing methods are special cases. We also numerically illustrated that our proposed test outperforms several existing methods via simulation studies. Furthermore, we utilized schizophrenia GWAS data and fasting glucose GWAS meta-analysis data to demonstrate that our method outperforms the existing methods in real data analyses. Project 2: We used a carefully curated list of 87 previously published genetic variants to determine whether the incorporation of genetic variants with non-genetic variables could improve the identification of cancer survivors at risk for anthracycline-related cardiomyopathy. We used anthracycline-exposed childhood cancer survivors from a ChildrenÕs Oncology Group study (COG-ALTE03N1: 146 cases; 195 matched controls) as the discovery set. Replication was performed in two anthracycline-exposed survivor populations: i) childhood cancer survivors from the Childhood Cancer Survivor Study (CCSS: 126 cases; 250 controls); ii) autologous blood or marrow transplantation (BMT) survivors from the BMT Survivor Study (BMTSS: 80 cases; 78 controls). The Clinical+Genetic model performed better than the Clinical Model in COG-ALTE03N1 (AUC of Clinical+Genetic Model = 0.88 vs. AUC of Clinical Model = 0.81) and BMTSS (AUC of Clinical+Genetic Model = 0.72 vs. AUC of Clinical Model = 0.64), but not in CCSS (AUC of Clinical+Genetic Model = 0.88 vs. AUC of Clinical Model = 0.89). However, the Clinical+Genetic model performed marginally better in CCSS patients without CVRFs where cardiomyopathy developed within 30 years of anthracycline exposure (AUC of Clinical+Genetic Model = 0.90 vs. AUC of Clinical Model = 0.85). Conclusions: Adding a comprehensively assembled genetic profile to clinical characteristics improves the identification of cancer survivors at risk for anthracycline-related cardiomyopathy.

2021-11-22, Dr. Zoltn Kutalik, Advances in Mendelian Randomization

University of Lausanne, First, I will motivate the need for causal inference in contrast to observational correlations. In particular, IÕll describe the principle of an instrumental variable approach heavily applied in genetic research, termed Mendelian Randomization (MR). Next, I will show four extensions of this method to different settings/assumptions: (i) In-depth quantification and correction of the bias of the most popular MR method (IVW); (ii) Modelling genetic architecture simultaneously with bidirectional causal effects in the presence of a heritable confounder; (iii) Causal inference for composite trait type exposures; (iv) estimation of non-linear causal effects. IÕll finish with two applications to omics data: first, we link differential gene expression analyses with bi-directional causal effects and finally, IÕll touch on how causal effects of different omics layers are mediated.

2021-11-01, Anna Reisetter, Standardization and Penalty Parameter Selection in Penalized Linear Mixed Models

University of Iowa, Penalized linear mixed models (LMMs) have been developed to accurately identify genotype-phenotype associations in the presence of dependent samples. In spite of this, the statistical properties of these models are not well understood. In addition, there is a lack of available software for their implementation. In this talk, we provide an overview of penalized LMMs for the analysis of structured genetic data, while examining their statistical properties in the genetic association setting. We then focus on the statistical properties of penalized LMMs in a general setting, and provide recommendations for key components of their implementation, including appropriate standardization and penalty parameter selection. We demonstrate the benefits of our recommendations using both a general setting, and one specific to genetic data. We conclude with a detailed analysis of a large, empirical GWAS data set which contains complex correlation among samples. We use this analysis to illustrate the benefits of penalized LMMs compared to traditional genome-wide association methods, and to demonstrate the utility of penalizedLMM, an R package we have developed for the flexible, and user-friendly implementation of penalized LMMs.

2021-10-18, Jane Liang, PanelPRO: A General Framework for Multi-Gene, Multi-Cancer Mendelian Risk Prediction Models

Harvard University, Risk evaluation to identify individuals who are at greater risk of cancer as a result of heritable pathogenic variants is a valuable component of individualized clinical management. Using principles of Mendelian genetics, Bayesian probability theory, and variant-specific knowledge, Mendelian models derive the probability of carrying a pathogenic variant and developing cancer in the future, based on family history. Existing Mendelian models are widely employed, but are generally limited to specific genes and syndromes. However, the upsurge of multi-gene panel germline testing has spurred the discovery of many new gene-cancer associations that are not presently accounted for in these models. We have developed PanelPRO, a flexible, efficient Mendelian risk prediction framework that can incorporate an arbitrary number of genes and cancers, overcoming the computational challenges that arise because of the increased model complexity. Using simulations and a clinical cohort with germline panel testing data, we evaluate model performance, validate the reverse-compatibility of our approach with existing Mendelian models, and illustrate its usage.

2021-10-04, Dr. Rebecca Hubbard, Principled Approaches to the Practical Challenges of Real-World Data

University of Pennsylvania, Interest in conducting research using real-world data, data generated as a by-product of digital transactions, has exploded over the past decade and has been further spurred by the 21st Century Cures Act. Real-world data facilitate understanding of treatment utilization and outcomes as they occur in routine practice, and studies using these data sources can potentially proceed rapidly compared to trials and observational studies that rely on primary data collection. However, using data sources that were not collected for research purposes comes at a cost, and na•ve use of such data without considering their complexity and imperfect quality can lead to bias and inferential error. Real-world data frequently violate the assumptions of standard statistical methods, but it is not practicable to develop new methods to address every possible complication arising in their analysis. The statistician is faced with a quandary: how to effectively utilize real-world data to advance research without compromising best practices for principled data analysis. In this talk I will use examples from my research on methods for the analysis of electronic health records (EHR) derived-data to illustrate approaches to understanding the data generating mechanism for real-world data. Drawing on this understanding, I will then discuss approaches to identify, use, and develop principled methods for incorporating EHR into research. The overarching goal of this presentation is to raise awareness of challenges associated with the analysis of real-world data and demonstrate how a principled approach can be grounded in an understanding of the scientific context and data generating process.

2021-09-27, Dr. Duyeol Lee, Generalized V-Learning Framework for Estimating Dynamic Treatment Regimes

Quantitative Analytics Specialist, Model Innovations Team, Wells Fargo Bank, Precision medicine is an approach that incorporates personalized information to efficiently determine which treatments are best for which types of patients. A key component of precision medicine is creating mathematical estimators for clinical decision-making. Dynamic treatment regimes formalize tailored treatment plans as sequences of decision rules. Recently, the V-learning method was introduced to estimate optimal dynamic treatment regimes. This method showed good performance compared to the existing reinforcement learning methods such as greedy gradient Q-learning. However, the complicated functional form of its loss function makes it difficult to apply modern machine learning methods for the estimation of value functions of treatment policies. We propose a generalized V-learning framework for estimating optimal treatment regimes. The proposed method adopts widely used loss functions and an iterative method to estimate value functions. Simulation studies show that the proposed method provides better performance compared with the original V-learning method.

2021-09-24, Dr. Karl Broman, OSGA Webinar - Organizing Data in Spreadsheets

University of Wisconsin-Madison, Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this presentation will offer practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plaintext files.

2021-09-13, Dr. Phillip Hunter Allman, Mendelian Randomization Analyses in a Multivariate Framework

University of Alabama at Birmingham School of Public Health, Mendelian randomization (MR) is an application of instrumental variable (IV) methods to observational data in which the IV is a genetic variant. An IV is a variable which functions similarly to the random treatment group assignment seen in clinical trials. Numerous statistical methods exist for subject-level MR or IV analysis; such as two-stage predictor substitution (2SPS) or the correlated errors model (CEM). These methods have some limitations depending on the distributions involved and the number of IVs; particularly when it comes to asymptotic variance estimation and hypothesis testing. Our research explores extensions of the CEM to scenarios with non-normal variates. This talk will provide a brief introduction to MR; describe the popular statistical methods for such analyses; describe our extensions to the correlated errors model; and compare these methods with simulations and real data analysis.

2021-08-24, Ye Eun Bae, Building tools for understanding the genetic architecture of allopolyploidy

Florida State University, Analysis of quantitative trait loci (QTL) is a useful approach to identify putatively causal genetic factors underlying quantitative trait variation in species. While many QTL analysis tools and methods have been proposed, it is challenging to extend them to allopolyploid species which require polyploidy-aware genetic mapping by using the fact that there is preferential paring between pairs of homologous chromosomes. In this project, we developed a QTL analysis tool for allopolyploids to provide insights into their genetic architecture and genotype-phenotype associations. As a starting point, we applied our tool to switchgrass (Panicum virgatum) datasets and visualized the application results to help understanding the allopolyploid nature of switchgrass.

2021-08-23, Nadeesha Thewarapperuma, Analysis of DNA methylation data using a functional data approach

University of Kansas Medical Center, We will use a data series that examines oral and pharyngeal carcinoma, and which includes 154 cases and 72 controls. A functional data approach will be used to analyze CpG island data for each gene to determine if there is a difference between case and control groups. The package fdANOVA will be used for the hypothesis tests. The null hypothesis is that the case and control groups have the same mean function, and the alternative is that that the mean functions differ.

2021-06-21, Dr. Tianxiao Huan, Genome-Wide Identification of DNA Methylation QTLs in Whole Blood Highlights Pathways for Cardiovascular Disease

UMASS, Abstract: Identifying methylation quantitative trait loci (meQTLs) and integrating them with disease-associated variants from genome-wide association studies (GWAS) may illuminate functional mechanisms underlying genetic variant-disease associations. Here, we perform GWAS of >415 thousand CpG methylation sites in whole blood from 4170 individuals and map 4.7 million cis- and 630 thousand trans-meQTL variants targeting >120 thousand CpGs. Independent replication is performed in 1347 participants from two studies. By linking cis-meQTL variants with GWAS results for cardiovascular disease (CVD) traits, we identify 92 putatively causal CpGs for CVD traits by Mendelian randomization analysis. Further integrating gene expression data reveals evidence of cis CpG-transcript pairs causally linked to CVD. In addition, we identify 22 trans-meQTL hotspots each targeting more than 30 CpGs and find that trans-meQTL hotspots appear to act in cis on expression of nearby transcriptional regulatory genes. Our findings provide a powerful meQTL resource and shed light on DNA methylation involvement in human diseases.

2021-05-29, Dr. Andrew Lawson, Bayesian Spatio-Temporal SIR Modeling of Covid-19 in SC, USA

Medical University of South Carolina, The Covid-19 pandemic has focused awareness on the need for good modeling of infectious disease spread and the need for surveillance which can alert public health officials to developing adverse events such as clusters of unusual risk (hot spots). Bayesian models can provide a dynamically flexible framework for such modeling via recursive Bayesian learning. The use of susceptible-Infected removed compartment models is exemplified In addition, monitoring of events can be facilitated by using posterior functionals of risk. This talk will address some infectious disease modeling basics and demonstrate the need for flexible models that can address transmission, both symptomatic and asymptomatic, and be allowed to address the spatial structure of the pandemic via neighborhood effects and correlated random effects. The addition of predictors is considered and it has been found that % below poverty line at the county level makes a significant contribution to the transmission variation. Prediction form both daily count data and smoothed data is also explored

2021-05-10, Dr. Jay Greene, A Machine Learning Compatible Method for Ordinal Propensity Score Stratification and Matching?

GlaxoSmithKline, Abstract: Although machine learning techniques that estimate propensity scores for observational studies with multi-valued treatments have advanced rapidly in recent years, the development of propensity score adjustment techniques has not kept pace. While machine learning propensity models provide numerous benefits, they do not produce a single variable balancing score that can be used for propensity score stratification and matching. This issue motivates the development of a flexible ordinal propensity scoring methodology that does not require parametric assumptions for the propensity model. The proposed method fits a one-parameter power function to the cumulative distribution function (CDF) of the generalized propensity score (GPS) vector resulting from any machine learning propensity model, and is called the GPS-CDF method. The estimated parameter from the GPS-CDF method, ?, is a scalar balancing score that can be used to group similar subjects in outcome analyses. Specifically, subjects who received different levels of the treatment are stratified or matched based on their ? value to produce unbiased estimates of the average treatment effect (ATE). Simulation studies presented show remediation of covariate balance, minimal bias in ATEs, and maintain coverage probability. The proposed method is applied to the Mexican-American Tobacco use in Children (MATCh) study to determine whether ordinal exposure to smoking imagery in movies is causally associated with cigarette experimentation in Mexican-American adolescents (manuscript: https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8846)

2021-04-26, Dr. ShanShan Zhao, Assessing the Effects of Multiple Exposures Subject to Limit of Detection

NIH/NIEHS, Studies on the health effects of environmental mixtures face the challenge of limit of detection (LOD) in multiple correlated exposure measurements. Conventional approaches to deal with covariates subject to LOD, including complete-case analysis, substitution methods and parametric modeling of covariate distribution, are feasible but may result in efficiency loss or bias. With a single covariate subject to LOD, a flexible semiparametric accelerated failure time (AFT) model to accommodate censored measurements has been proposed. We generalize this approach by considering a multivariate AFT model for the multiple correlated covariates subject to LOD and a generalized linear model for the outcome. A two-stage procedure based on semiparametric pseudo-likelihood is proposed for estimating the effects of these covariates on health outcome. Consistency and asymptotic normality of the estimators are derived for an arbitrary fixed dimension of covariates. Simulations studies demonstrate good large sample performance of the proposed methods versus conventional methods in realistic scenarios. We illustrate the practical utility of the proposed method with the LIFECODES birth cohort data, where we compare our approach to existing approaches in an analysis of multiple urinary trace metals in association with oxidative stress in pregnant women.

2021-03-29, Dr. Omer Ozturk, Meta?analysis of quantile intervals from different studies with an application to a pulmonary tuberculosis data

Ohio State University, After the completion of many studies, experimental results are reported in terms of distribution?free confidence intervals that may involve pairs of order statistics. This article considers a meta?analysis procedure to combine these confidence intervals from independent studies to estimate or construct a confidence interval for the true quantile of the population distribution. Data synthesis is made under both fixed?effect and random?effect meta?analysis models. We show that mean square error (MSE) of the combined quantile estimator is considerably smaller than that of the best individual quantile estimator. We also show that the coverage probability of the meta?analysis confidence interval is quite close to the nominal confidence level. The random?effect meta?analysis model yields a better coverage probability and smaller MSE than the fixed?effect meta?analysis model. The meta?analysis method is then used to synthesize medians of patient delays in pulmonary tuberculosis diagnosis in China to provide an illustration of the proposed methodology. (manuscript: https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8738)

2021-03-15, Dr. Maiying Kong, Propensity score specification for optimal estimation of average treatment effect with binary response

University of Louisville, Propensity score methods are commonly used in statistical analyses of observational data to reduce the impact of confounding bias in estimations of average treatment effect. While the propensity score is defined as the conditional probability of a subject being in the treatment group given that subject?s covariates, the most precise estimation of average treatment effect results from specifying the propensity score as a function of true confounders and predictors only. This property has been demonstrated via simulation in multiple prior research articles. However, we have seen no theoretical explanation as to why this should be so. This paper provides that theoretical proof. Furthermore, this paper presents a method for performing the necessary variable selection by means of elastic net regression, and then estimating the propensity scores so as to obtain optimal estimates of average treatment effect. The proposed method is compared against two other recently introduced methods, outcome-adaptive lasso and covariate balancing propensity score. Extensive simulation analyses are employed to determine the circumstances under which each method appears most effective. We applied the proposed methods to examine the effect of pre-cardiac surgery coagulation indicator on mortality based on a linked dataset from a retrospective review of 1390 patient medical records at Jewish Hospital (Louisville, KY) with the Society of Thoracic Surgeons database. (manuscript: https://journals.sagepub.com/doi/full/10.1177/0962280220934847

2021-03-08, Dr. Cen Wu, Sparse group variable selection for gene--environment interactions in the longitudinal study

Kansas State University, Regularized variable selection for high dimensional longitudinal data has received much attention as accounting for the correlation among repeated measurements can provide additional and essential information for improved identification and prediction performance. Despite the success, in longitudinal studies, the potential of regularization methods is far from fully understood for accommodating structured sparsity. In this work, we have developed a sparse group penalization method to conduct the bi-level G-E interaction study under the repeatedly measured phenotype. Within the quadratic inference function (QIF) framework, the proposed method can achieve simultaneous identification of main and interaction effects on both the group and individual level. Simulation studies have shown that the proposed method outperforms major competitors. In the case study of asthma data from the Childhood Asthma Management Program (CAMP), our method leads to identification of improved prediction, as well as main and interaction effects with important implications.

2021-02-22, Dr. Hyeonju Kim, A tutorial for a multivariate linear mixed model based QTL(Quantitative Trait Loci) analysis tool : FlxQTL

UTHSC , FlxQTL is a Julia software package, developed for genetic mapping of multivariate traits faster with greater flexibility compared to existing linear mixed model-based algorithms. It offers comprehensive functionalities for QTL analysis: genetic kinship computation, genome scan, pair scan, permutation test, and visualization, etc. In this talk, a step-by-step demo from the package installation to visualization will be presented using Arabidopsis thaliana data (Recombinant Inbred Lines).

2021-02-08, Dr. Aiyi Liu, Nonparametric estimation of distributions and diagnostic accuracy based on group-tested results with differential misclassification

NIH, This talk concerns the problem of estimating a continuous distribution in a diseased or nondiseased population when only group-based test results on the disease status are available. The problem is challenging in that individual disease statuses are not observed and testing results are often subject to misclassification, with further complication that the misclassification may be differential as the group size and the number of the diseased individuals in the group vary. We propose a method to construct nonparametric estimation of the distribution and obtain its asymptotic properties. The performance of the distribution estimator is evaluated under various design considerations concerning group sizes and classification errors. The method is exemplified with data from the National Health and Nutrition Examination Survey (NHANES) study to estimate the distribution and diagnostic accuracy of C-reactive protein in blood samples in predicting chlamydia incidence.

2020-11-16, Dr. Rima Izem , Parallelizing Research Efforts with Hosted GIT and Modern Asynchornous Workflows

Children's National Research Institute and the George Washington University, Git is a modern distributed version control system that enables collaborators to explore various avenues of research without interfering with the common core of their project while also keeping track of contributions and project versions over time. It has been the cornerstone of both the smallest and largest software development collaborations over the past decade scaling from a handful of collaborators to thousands. Taking cues from the software engineering ecosystem of workflows and tools that have evolved around Git we can increase reproducibility, assuage pain points in distributed collaboration, and pursue multiple research avenues within projects asynchronously.

2020-11-09, Dr. Lifeng Lin, Treatment ranking in Bayesian network meta-analysis and predictions

Florida State University, Network meta-analysis (NMA) is an important tool to provide high-quality evidence about available treatments? benefits and harms for comparative effectiveness research. Compared with conventional meta-analyses that synthesize related studies for pairs of treatments separately, an NMA uses both direct and indirect evidence to simultaneously compare all available treatments for a certain disease. It is of primary interest for clinicians to rank these treatments and select the optimal ones for patients. Various methods have been proposed to evaluate treatment ranking; among them, the mean rank, the so-called surface under the cumulative ranking curve (SUCRA), and P-score are widely used in current practice of NMAs. However, these measures only summarize treatment ranks among the studies collected in the NMA. Due to heterogeneity between studies, they cannot predict treatment ranks in a future study and thus may not be directly applied to healthcare for new patients. We propose innovative measures to predict treatment ranks by accounting for the heterogeneity between the existing studies in an NMA and a new study. They are the counterparts of the mean rank, SUCRA, and P-score under the new study setting. We use illustrative examples and simulation studies to evaluate the performance of the proposed measures.

2020-10-26, Dr. Brian Egleston , Statistical inference for natural language processing algorithms with a demonstration using type 2 diabetes prediction from electronic health record notes

Fox Chase Cancer Center , The pointwise mutual information statistic (PMI), which measures how often two words occur together in a document corpus, is a cornerstone of recently proposed popular natural language processing algorithms such as word2vec. PMI and word2vec reveal semantic relationships between words and can be helpful in a range of applications such as document indexing, topic analysis, or document categorization. We use probability theory to demonstrate the relationship between PMI and word2vec. We use the theoretical results to demonstrate how the PMI can be modeled and estimated in a simple and straight forward manner. We further describe how one can obtain standard error estimates that account for within?patient clustering that arises from patterns of repeated words within a patient's health record due to a unique health history. We then demonstrate the usefulness of PMI on the problem of predictive identification of disease from free text notes of electronic health records. Specifically, we use our methods to distinguish those with and without type 2 diabetes mellitus in electronic health record free text data using over 400 000 clinical notes from an academic medical center.

2020-09-28, Dr. Bin Zhu , A hidden Markov modeling approach for identifying tumor subclones in next-generation sequencing studies

National Cancer Center/NIH , Allele-specific copy number alteration (ASCNA) analysis is for identifying copy number abnormalities in tumor cells. Unlike normal cells, tumor cells are heterogeneous as a combination of dominant and minor subclones with distinct copy number profiles. Estimating the clonal proportion and identifying mainclone and subclone genotypes across the genome are important for understanding tumor progression. Several ASCNA tools have recently been developed, but they have been limited to the identification of subclone regions, and not the genotype of subclones. In this article, we propose subHMM, a hidden Markov model-based approach that estimates both subclone region and region-specific subclone genotype and clonal proportion. We specify a hidden state variable representing the conglomeration of clonal genotype and subclone status. We propose a two-step algorithm for parameter estimation, where in the first step, a standard hidden Markov model with this conglomerated state variable is fit. Then, in the second step, region-specific estimates of the clonal proportions are obtained by maximizing region-specific pseudo-likelihoods. We apply subHMM to study renal cell carcinoma datasets in The Cancer Genome Atlas. In addition, we conduct simulation studies that show the good performance of the proposed approach. The R source code is available online at https://dceg.cancer.gov/tools/analysis/subhmm.

2020-09-21, Dr. Danh V. Nguyen , Profiling Dialysis Facilities for Adverse Recurrent Events

UC Irvine , Profiling analysis aims to evaluate health care providers, such as hospitals, nursing homes, or dialysis facilities, with respect to a patient outcome. Previous profiling methods have considered binary outcomes, such as 30-day hospital readmission or mortality. For the unique population of dialysis patients, regular blood works are required to evaluate effectiveness of treatment and avoid adverse events, including dialysis inadequacy, imbalance mineral levels, and anemia among others. For example, anemic events (when hemoglobin levels exceed normative range) are recurrent and common for patients on dialysis. Thus, we propose high-dimensional Poisson and negative binomial regression models for rate/count outcomes and introduce a standardized event ratio (SER) measure to compare the event rate at a specific facility relative to a chosen normative standard, typically defined as an ?average? national rate across all facilities. Our proposed estimation and inference procedures overcome the challenge of high-dimensional parameters for thousands of dialysis facilities. Also, we investigate how overdispersion affects inference in the context of profiling analysis.

2020-07-27, Dr. Abolfazl Mollalo , Spatial Variations of the COVID-19 Incidence in the United States: A GIS-based Approach

Baldwin Wallace University , The outbreak of COVID19 in the United States is posing an unprecedented socioeconomic burden to the country. Due to inadequate research on geographic modeling of COVID-19, we investigated county-level variations of disease incidence across the continental United States. We compiled a geodatabase of 35 environmental, socioeconomic, topographic, and demographic variables that could potentially explain the spatial variability of disease incidence. Further, we employed spatial lag and spatial error models to investigate spatial dependencies and geographically weighted regression (GWR) and multiscale GWR (MGWR) models to locally examine spatial non-stationarity.

2020-07-13, Dr. Qi Yan , Deep-learning-based Prediction of Late Age-Related Macular Degeneration Progression

U of Pittsburgh , The results suggested that even though incorporating spatial autocorrelation could significantly improve the performance of the global ordinary least square model; these models still represent a significantly poor performance compared to the local models. Moreover, MGWR could explain the highest variations (adj. R?: 68.1%) with the lowest AICc compared to the others. In the second study we added mortality rates of several infectious and chronic diseases as explanatory variables and examined the performance of multilayer perceptron (MLP) neural network in predicting the cumulative COVID-19 incidence rates using 57 variables. Our results indicated that a single-hidden-layer MLP could explain almost 65% of the correlation with ground truth for the holdout samples. Sensitivity analysis conducted on this model showed that the age-adjusted mortality rates of ischemic heart disease, pancreatic cancer, and leukaemia, together with two socioeconomic and environmental factors (median household income and total precipitation), are among the most substantial factors for predicting COVID-19 incidence rates. The findings may provide useful insights for public health decision makers regarding the influence of potential risk factors associated with the COVID-19 incidence at the county level.

2020-06-29, Dr. Fatma Gunturkun , Artificial Intelligence for Prediction of Late Onset Cardiomyopathy among Childhood Cancer Survivors

UTHSC , Background: Early identification of childhood cancer survivors at high risk for treatment-related cardiomyopathy may improve outcomes by enabling timely intervention. We implemented deep learning and signal processing methods using the Children's Oncology Group (COG) guideline-recommended baseline electrocardiography (ECG) to predict future cardiomyopathy.Methods: Signal processing and deep learning tools were applied to 12-lead electrocardiogarms (ECG) obtained on 1,217 adult survivors (18 years of age, 10 years from diagnosis) of childhood cancer, without evidence of cardiomyopathy, prospectively followed in the St. Jude Lifetime Cohort (SJLIFE) Study. Clinical and echocardiographic assessment of cardiac function was performed at baseline and follow-up evaluations and graded per a modified version of the Common Terminology Criteria for Adverse Events (CTCAE). Extreme gradient boosting (XGboost) algorithms were applied, and model performance evaluated by 5-fold stratified cross validation.Results: Median age at baseline evaluation was 31.7 years (range 18.4-66.4), and median age at cancer diagnosis was 8.4 years (range 0.01 ?C 22.7). The average length of follow-up time following baseline SJLIFE evaluation was 5 years (0.5-9). Among survivors, 67.1% were exposed to chest radiation (median dose of 1,200 cGy (4-6,200 cGy) and 76.6% were exposed to anthracyclines (mean dose of 168.7 mg/m2 (35.1-734.2 mg/m2). A total of 117 (9.6%) survivors developed cardiomyopathy during follow-up. In the model based on ECG features, the cross-validation AUC was 0.87 (95% CI 0.83-0.90), with sensitivity 76% and specificity 79%, and in the model based on ECG and clinical features, the cross-validation AUC was 0.89 (95% CI 0.86-0.91), with sensitivity 78% and specificity 81%.Conclusion: Artificial intelligence using electrocardiographic data may assist in the early identification of childhood cancer survivors at high risk for cardiomyopathy.

2020-06-22, Dr. Hansapani Rodrigo , HIV-Associated Neurocognitive Disorder (HAND) Biomarker Identification with Significance Analysis of Microarrays and Random Forest Analysis

The University of Texas Rio Grande Valley , Genome-wide screening of transcription regulation in brain tissue helps in identifying substantial abnormalities present in patients? gene transcripts and to discover possible biomarkers for HIV Associated Neurological Disorders (HAND). This study explores the possibility of identifying differentially expressed (DE) genes, which can serve as potential biomarkers to detect HAND. In this regard, gene expression levels of three subject groups with different impairment levels of HAND along with a control group in three distinct brain sectors: white matter, frontal cortex, and basal ganglia have been investigated and compared using multiple statistical analysis methods including Significance Analysis of Microarrays (SAM) and random forests (RF). In fact, two aspects of gene expressions have been investigated; single-gene and sub-gene network effects, where the latter model accounts for the co-regulation, and interrelations among genes. The sub-gene network RF model incorporates prior biological information into account and hence has widened the path in the exploration of DE genes.

2020-06-01, Dr. Sixia Chen , Nonparametric Mass Imputation for Data Integration

University of Oklahoma Health Sciences Center , This study has shed light on new sets of genes including CIRBP, RBM3, GPNMB, ISG15, IFIT6, IFI6, and IFIT3, which were previously undermined or undetected to be significantly expressed among the subjects with HAND either in the frontal cortex or basal ganglia. The gene, GADD45A, a protein-coding gene whose transcript levels tend to increase with stressful growth arrest conditions, was consistently ranked among the top genes by two RF models within the frontal cortex.

2020-05-11, Dr. Zonghui Hu , Assessment of collective genetic impact from twin study: a mixture distribution approach

NIH/NIAID , It is challenging to evaluate the genetic impacts on a biologic feature and separate them from the environmental impacts. We approach this through twin studies by assessing the collective genetic impact defined by the differential correlation in monozygotic twins versus dizygotic twins. Since the underlying order in a twin, determined by latent genetic factors, is unknown, the observed twin data are unordered. Conventional methods for correlation are not appropriate. To handle the missing order, we model twin data by a mixture bivariate distribution and estimate under two likelihood functions: the likelihood over the monozygotic and dizygotic twins separately, and the likelihood over the two twin types combined. Both likelihood estimators are consistent. More importantly, the combined likelihood overcomes the drawback of mixture distribution estimation, namely, the slow convergence. It yields correlation coefficient estimator of root-n consistency and allows effective statistical inference on the collective genetic impact. The method is demonstrated by a twin study on immune traits. This is a joint work with Pengfei Li, Dean Follmann and Jing Qin.

2020-05-04, Dr. Marco Geraci , Quantile contours and allometric modelling for risk classification of abnormal ratios with an application to asymmetric growth restriction in preterm infants

University of South Carolina , In this talk, I will present a novel approach to risk classification based on quantile contours and allometric modelling of multivariate anthropometric measurements. I propose the definition of allometric direction tangent to the directional quantile envelope, which divides ratios of measurements into half-spaces. This in turn provides an operational definition of directional quantile that can be used as cutoff for risk assessment. I will apply the proposed approach to a large dataset from the Vermont Oxford Network containing observations of birthweight (BW) and head circumference (HC) for more than 150,000 preterm infants. The analysis suggests that small preterm infants with large HC-to-BW ratio are at increased risk of mortality as compared to appropriate-for-gestational-age as well as proportionately-growth-restricted preterm infants. This study offers not only an approach to risk classification, but also large-sample estimated cutoffs that can be immediately used by practitioners (Geraci M et al, 2019, Statistical Methods in Medical Research, doi:10.1177/0962280219876963).

2020-04-20, D. Mehmet Kocak and Mr. Tristan Hayes , Tracking the Covid-19 Pandemic in the U.S.

UTHSC , Mehmet Kocak, PhD, Associate Professor of Biostatistics in the Department of Preventive Medicine and Tristan Hayes, MSc, Biostatistics Consulting Manager, will present on efforts to track the COVID-19 virus using publicly available data sources. Dr. Kocak's presentation, "A comparative look at the Covid-19 Pandemic in the U.S", will focus on comparing the virus' progression in the US to other nations and will also briefly touch on forecasting. Mr. Hayes's presentation, "Visualizing the Covid-19 Pandemic in the U.S." will cover several methods for depicting the epidemic visually, using R and R Shiny.

2020-03-09, Dr. Yunxi Zhang , Variable Selection and Imputation for High-Dimensional Incomplete Data

University of Mississippi Medical Center , Missing data are an inevitable problem in data with a large number of variables. The presence of missing data obstructs the implementation of the existing variable selection methods. This is especially an issue when there is a limited number of observations. Applicable and efficient selection with imputation method is necessary to obtain valid results. In this talk, I will propose an approach to efficiently select important variables from high dimensional data in the presence of missing data. We employ the shrinkage prior and multiple imputation for variable selection in the high-dimensional setting with missing values. Simulation study and analysis results will be presented and compared with other possible approaches.

2020-02-24, Dr. Dong Wang , New technology and statistical issues for toxicity testing, from in vitro assays to postmarket surveillance

National Center for Toxicological Research/FDA , The emergence of high throughput in vitro assays has the potential to significantly improve toxicological evaluations and lead to more efficient, accurate, and less animal-intensive testing. However, effectively utilizing data from in vitro assays in a predictive model is still a challenging problem. One major difficulty is caused by the small sample size of the training data set in most toxicology problems. Thus, how to utilize data most efficiently takes a premium in predictive modeling of toxicity, and it requires some creative techniques not commonly used in settings with copious amount of data. In this presentation, several examples on how to perform statistical modeling and machine learning in toxicity testing within the constraints of small sample sizes will be discussed. In the first example, a robust learning method was applied to the prediction of the point of departure so that a small portion of outlier chemicals will not harm the overall performance of the model. In the second example, adverse outcome pathway networks were utilized to filter potential predictors from in vitro assays to construct a more parsimonious model with improved performance. Finally, we will discuss how to use postmarket surveillance databases to provide independent validation for predictive models based on in vitro assays.

2019-11-18, Sedigheh Mirzaei Salehabadi , Estimation of time-to-event distribution based on partially recalled data

Biostat, St. Jude , In some retrospective studies, participants are asked to recall the time of occurrence of a landmark event, if experienced. Some respondents who had experienced the event are able to recall the date exactly, some recall only the month or year of the event, and some are unable to recall it. This interval censored data bears evidence of being informatively censored, for which we build a special model for estimating the time-to-event distribution. We provide a set of regularity conditions on the distribution, subject to which the consistency and the asymptotic normality of the parametric maximum likelihood estimator are established. We also provide a computationally simple approximation to the nonparametric maximum likelihood estimator and establish its consistency under mild conditions. The small sample performance of the two estimators through Monte Carlo simulations are studied. Moreover, we provide a graphical check of the assumption of the multinomial model for the recall probabilities, which appears to hold for a menarcheal data set. Its analysis shows that the use of the partially recalled part of the data indeed leads to smaller confidence intervals of the survival function.

2019-11-11, Yujiao Mai , Classic Mediation Analysis for Complex Surveys with Balanced Repeated Replications

Biostat, St. Jude , Mediation analysis is to investigate the role of a third variable as a transmitter in the relationship between the exposure and the outcome. The third variable is called the mediator. Although there have been established methodology and computer tools for classic mediation analysis which is within structural equation modeling (SEM) framework, applications to survey data of complex sampling designs have not been addressed. As complex sampling designs using balanced repeated replications are common in national surveys, this study introduces a classic mediation analysis algorithm adjusting for complex surveys with balanced repeated replications and develops the software packages in R and SAS. The study then illustrates the application of the algorithm and packages to Tobacco Use Supplement to the Current Population Survey. In the end of the talk, we will discuss the limitations of this study and future development.

2019-09-16, Hyo Young Choi , SCISSOR: a novel framework for identifying changes in RNA transcript structures

Computational Biology, UTHSC , High-throughput sequencing protocols such as RNA-seq have made it possible to interrogate the sequence, structure and abundance of RNA transcripts at higher resolution than previous microarray and other molecular techniques. While many computational tools have been proposed for identifying mRNA variation through differential splicing/alternative exon usage, the promise of RNA-seq remains largely unrealized. In this talk, we propose a novel framework for unbiased and robust discovery of aberrant RNA transcript structures using short read sequencing data based on shape changes in an RNA-seq coverage profile. Shape changes in selecting sample outliers in RNA-seq (SCISSOR), is a series of procedures for transforming and normalizing base level RNA sequencing coverage data in a transcript independent manner, followed by a statistical framework for its analysis. The resulting high dimensional object is amenable to unsupervised screening of structural alterations across RNA-seq cohorts with nearly no assumption on the forms of underlying abnormalities. This enables to independently recapture known variants (such as splice site mutations in tumor suppressor genes) as well as novel variants that are previously unrecognized or difficult to identify by any existing methods including recurrent alternate start sites and recurrent complex deletions in 3? UTRs.

2019-08-12, Dr. Qian (Michelle) Zhou , Risk prediction with Cohort Studies

Statistics, Mississipi State U. , Accurate risk prediction is a key component of precision medicine. In this talk, I will present several research projects regarding the risk prediction I have worked on. I will introduce a novel thresh-hold free metric to evaluate the overall precision of a risk model. I will talk about how to identify subgroups for which the new markers are most/least useful in improving risk prediction. In large cohort studies, it is often not feasible to measure expensive biomarkers on the full cohort, and efficient sampling designs, such as nested case-control (NCC), are employed. I will illustrate how to accurately evaluate the incremental values of the new markers under a three-phase NCC design. I will also present a flexible varying coefficient model for estimating the age-specific absolute risk with the marker effects varying with the age.

2019-05-13, Lih-Yuan Deng , Survey of Current Random Number Generators used in R and Comparison with DX Generators

Statistics, University of Memphis , In this talk, we review and compare these PRNGs available in RNGkind and compare them with our proposed DX(Deng-Xu) generator, a fast and efficient, huge period multiple recursive generator (MRG) with equi-distribution in thousands of dimensions. Comparing with these generators, DX generators are fastest with longest period length. Furthermore, we show that DX generators consistently pass all the empirical tests while most of the PRNGs used in base R are not able to pass all these tests, including the default generator MT19937.

2019-04-29, Yimei Li , Genome-wide Association Analysis on Functional Imaging Data

Biostat, St. Jude , Imaging genetics allows for the identification of how common/rare genetic polymorphisms influencing molecular processes (e.g., serotonin signaling), bias neural pathways (e.g., amygdala reactivity), mediating individual differences in complex behavioral processes (e.g., trait anxiety) related to disease risk in response to environmental adversity. The challenges include: how to integrate imaging data with genetic information, how to develop more advanced statistical method to analyze those type of data and how to produce reproducible analysis. Functional phenotypes (e.g., subcortical surface representation), which commonly arise in imaging genetic studies, have been used to detect putative genes for complexly inherited neuropsychiatric and neurodegenerative disorders. However, existing statistical methods largely ignore the functional features (e.g., functional smoothness and correlation). I will focus on introduction of some published methods on analyzing functional medical imaging data with genetics information such as functional structure equation models for twin imaging data, functional mixed effects models for candidate genetic mapping, functional genome wide association analysis and their application in real data analysis.

2019-04-15, Hongmei Zhang , Bayesian network selection with ordered data

Biostat, U of Memphis , We propose an approach for constructing Bayesian networks for data with unknown order and determining whether networks constructed under different conditions are statistically identical. A Bayesian method is developed for this purpose. A penalty-incorporated conditional posterior probability mass function, motivated by conditional posterior distributions under non-informative priors, is implemented to make a selection between identical networks and differential networks. Theoretical assessment, simulations, and real data applications indicate the efficacy and efficiency of the proposed method.

2019-04-08, Bernie J. Daigle, Jr. , Maximum Likelihood Parameter Estimation and Model Identification from Single-Cell Distribution Data

Biological Sciences and Computer Science, U of Memphis , Recent advances in single-cell experimental techniques have provided unprecedented access to the mechanisms underlying fundamental cellular processes. In particular, techniques assaying populations of single cells, including single-cell RNA sequencing (scRNA-seq), have highlighted the importance of cellular noise-stochastic fluctuations within and heterogeneity between genetically identical cells. Despite growing availability of such single-cell distribution data, limitations in computational methods for parameter estimation and model identification remain a bottleneck for developing accurate mechanistic descriptions of cellular processes. Existing methods typically make simplifying assumptions about the underlying biochemical model, impose limits on model size/complexity, or require prior knowledge of model parameter values. I will present a novel maximum likelihood-based method for parameter estimation and model identification-distribution Monte Carlo Expectation-Maximization with Modified Cross-Entropy Method (dMCEM2)-that does not have these limitations. Building upon a method developed for single-cell time series data, dMCEM2 enables automated, computationally efficient inference and identification of stochastic biochemical models from single-cell distribution data. Using both synthetic and real-world scRNA-seq data, I will demonstrate the ability of dMCEM2 to accurately construct mechanistic models of gene expression.

2019-03-18, Natasha Sahr , Multi-level variable screening and selection for survival data

Biostat, St. Jude , Variable selection methods for the marginal proportional hazards model is a relatively understudied research area in biostatistics. The limited available methods focus on the selection of non-zero individual variables for a single outcome. However, variable selection in the presence of grouped covariates is often required. Some methods are available for the selection of non-zero group and within-group variables for the Cox proportional hazards model. There are no available methods to perform group variable selection in the clustered multivariate survival setting. In this context, the hierarchical adaptive group bridge penalty is proposed to select non-zero group and within-group variables for the independent or clustered marginal multivariate proportional hazards model. Simulation studies show the hierarchical adaptive group bridge method has superior performance compared to the extension of the adaptive group bridge in terms of variable selection accuracy. Sometimes, survival data suffers from high-dimensional group variables. Most existing screening methods address the sure screening property for individual variable selection. The sure group joint variable screening method is proposed to screen independent and clustered multivariate survival data in the presence of group variables. Simulation studies show the sure group joint variable screening method performs better than existing screening procedures extended to the multivariate survival setting. The hierarchical adaptive group bridge and sure group joint variable screening methods can be effective tools, used in a two-step process, in identifying non-zero group and within-group variables for high-dimensional multivariate survival data.

2019-02-21, Mehmet Kocak , Statistical Modeling for Growth Curves

Preventive Medicine, UTHSC , In pediatric clinical trials and cohort studies, actual height and weight of children at a specific age may be required for certain developmental assessments such as energy expenditure. This necessitates the choice of a growth model with desired characteristics to predict height and weight accurately. In this talk, we introduce and compare two commonly used growth curve models, namely, Logistic and Gompertz models, with respect to the distribution of their residuals as well as the logistical challenges in model convergence using the US and Turkish Growth Curve Standards for the first 3 years of life.

2018-10-22, Zheng Xu , Association Testing Based On Sequencing Data With Arbitrary Depth

Statistics, University of Nebraska-Lincoln , Association studies have been widely conducted to explain the relationship between phenotypes and genetic variants. Researchers have been developed association testing methods for a single marker or a group of markers based on genotypes. The phenotypes, either continuous or binary traits, are studied in the regression framework with genotypes as explanatory variables. Some factors such as genotype calling uncertainty and sequencing depth may influence the performance of these genotype-based testing methods. We propose association testing methods directly based on next-generation sequencing data without genotype calling. The methods have been applied to the testing for a single marker or a group of markers with applications to rare variants.

2018-04-09, Luhang Han , Identifying stable and dynamic CpG sites pre- and post-adolescence transition via a longitudinal genome-scale study

Statistics, University of Memphis , There is some evidence that DNA methylation (DNA-M) over time is stable at certain cytosine-phosphate-guanine (CpG) sites and varies at others (dynamic methylation). Adolescence transition (puberty) is considered associated with DNA-M change and this change is gender specific. In adolescence, a gender reversal of asthma prevalence occurs, from male predominance of asthma prevalence in young childhood to female predominance after adolescence. Given that DNA methylation may play a central role in susceptibility to asthma, assessing the stability of DNA-M provides a potential to understand the mechanisms of asthma transition during adolescence. The aim of this study was to identify dynamic and stable DNA-M at the genome scale and assess their gender-specificity. Data from children at 10 and 18 years old from the Isle of Wight birth(IOW) cohort in the United Kingdom were included. Epigenome-scale DNA-M was assessed using Illumina 450K and 850k EPIC platforms. Linear mixed models with repeated measures were implemented in our analysis. We identified 15,532 CpG sites were dynamic during adolescence transition in both genders, the level of DNA-M at 1,179 CpG sites is not stable during adolescence transition and this change is gender specific. The findings were further tested in an independent study, the Avon Longitudinal Study of Children and Parents (ALSPAC) study, as expected, the results showed an agreement with the findings from IOW cohort.

2018-03-26, Chi-Yang Chiu , An additive hazards model for gene level association analysis of survival traits of complex disorders

Preventive Medicine, UTHSC , Based on counting processes and Doob-Meyer decomposition of submartingales and functional regression models, we propose an additive hazards model for gene level association analysis of survival traits of complex disorders. The additive hazards model overcomes proportional hazards assumptions of Cox models and is more flexible to model association with time/age. Association between genetic markers and eye disease will be investigated.

2018-03-12, Hyeonju Kim , Efficient algorithms for detecting GxE in Multivariate Linear Model

Preventive Medicine, UTHSC , We develop a multivariate linear mixed model for detecting gene-environment interactions when there are many environments, and we have information annotating the environments. Our prototype example datasets are on segregating plant populations grown in multiple sites in multiple years. We will have information on the weather in each year as well as site-specific information such as latitude. The goal is to find QTLs that depend on latitude accounting for weather patterns that vary by year. We formulate a linear mixed model where traits can be correlated due to genomewide similarities (genetic kinship) and due to weather similarities ("climate kinship") between environments. We implement an efficient algorithm that uses an Expectation Conditional Maximization (ECM) algorithm in conjunction with an acceleration step. Its simulation study will be presented.

2018-02-12, Saunak Sen , Three algorithms for statistical computing: the MM, EM, and proximal gradient algorithms

Preventive Medicine, UTHSC , Many problems in statistical estimation and machine learning boil down to optimization of a criterion such as the log likelihood, the residual sum of squares, or a penalized version. Commonly used algorithms include iteratively reweighted least squares (IWLS) and Newton-Raphson methods. If the objective function depends on a large number of parameters, is not smooth, or is difficult to compute, other methods are needed. In statistics the EM algorithm is a common choice; in machine learning proximal gradient algorithms are useful. It turns out that both are special cases of the MM (minorization-maximization) algorithm. This method is guaranteed to improve the objective function in each iteration under general conditions. We will provide an overview of the three algorithms, and examine their use in penalized regression models.

2018-01-29, Chris Gignoux , The role of human genetic diversity in the architecture of traits: lessons from PAGE

Un. of Colorado , As we enter the second decade of genome-wide association studies, there remains a persistent bias towards focusing on populations of European descent. This has numerous drawbacks, both limiting new discoveries and impairing our translation of findings into precision medicine for people across the world. To counteract this, the Population Architecture using Genomics and Epidemiology (PAGE) Study was developed, currently leveraging genome-wide data from 52,000 individuals representing four multi-ethnic longitudinal cohorts. This large-scale work in diverse populations presents several computational challenges. I will present on our efforts to design new arrays that capture variation in diverse populations. Furthermore, I will demonstrate the use of both statistical and population genetics methods to uncover new insights in complex traits and the role of population structure in elucidating genetic architecture. I will conclude with some of our work on Mendelian genetic disease in the context of human genetic diversity and its implications for modern health systems and biobanks.

2017-11-13, Mina Sartipi , mStroke: Mobile Technology for Post-Stroke Recurrence Prevention and Recovery

The University of Tennessee at Chatanooga , In this talk, I will present mStroke, a real-time quantitative assessment of stroke rehabilitation using wearable Sensors. The goal of mStroke is to explore mobile technology to improve stroke recovery and prevent stroke recurrence. I will focus on mStroke current clinical functions/applications such as the functional reach test (FRT), NIH Stroke Scale (NIHSS) motor arm/motor leg tests, gait speed, activity recognition, and fall detection. mStroke has been tested on more than 200 students emulating individuals post-stroke and also 40 patients post-stroke. I will conclude my talk with discussing our vision for mStroke next steps. We want to study post-stroke management system through multi-modal big data analytics applied on joint real-time (from mStroke and other physiological sensors) and EMR data.

2017-10-23, Saunak Sen , Statistical learning methods and the bias-variance tradeoff

Preventive Medicine, UTHSC , The bias-variance tradeoff is a fundamental concept underlying statistical modeling or learning methods. As we increase the complexity of our models, we usually decrease the bias in its prediction. However, increasing model complexity also tends to make the predictions more variable or unstable. In this expository talk, we will examine how the bias-variance tradeoff is exhibited in the context of some common statistical learning methods such as k-nearest neighbors, multiple linear regression, gradient boosting machines, and regularized regression. We will examine their performance in the context of predicting plant fitness using genomewide genotype data.

2017-09-18, Fridtjof Thomas , Predictive Modeling: Can't See the Forest for the Trees?

Preventive Medicine, UTHSC , Predictive modeling is a very active research area constantly adding new approaches as well as refining existing ones. This seminar provides an overview of the approaches referred to as Random Forests, Elastic Nets, Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Neural Nets. R-code and packages implementing these approaches are presented.

2017-08-28, Dr. Gregory Farage , Feature Detection in PolSAR Images using the wavelet transform

Preventive Medicine, UTHSC , After a brief introduction to Geographical Information System (GIS) and Remote Sensing, I will present some of their applications with a focus on polarimetric radar data. I will present two main approaches for classification of remote sensing data. Next I will present a filtering technique that we developed based on wavelet multiresolution analysis as a feature detection approach for PolSAR data. Finally, I will open a discussion on the similarities between genomic data and remote sensing data.

2017-08-22, Dr. Laura Saba , Animal models and statistical strategies for describing the transcriptional connectome and its role in complex traits

University of Colorado at Denver , Rarely are the exact same genes associated with a complex phenotype in different populations, e.g., different recombinant inbred rodent panels, different human populations, or different species. In most cases, it is the biological pathway that is important in the etiology of disease rather than an individual gene. In our first analysis, information across several rat populations is combined to identify a set of co-expressed genes in brain that predispose to alcohol-related traits using weighted gene co-expression network analysis (WGCNA) and quantitative trait loci. At the heart of this co-expression module is an unannotated transcript that is likely to be a non-coding transcript. When this unannotated transcript is genetically manipulated, not only does the phenotype of interest, alcohol consumption, change but also the transcription levels of many other genes are also altered. This statistical model provides insight about relevant biological processes but may not give detail about the directed pathway from genotype to disease necessary to identify potential therapeutic targets. In an additional study, we performed a similar WGCNA analysis of liver RNA expression levels and associated modules and genetic variants with alcohol metabolism-related phenotypes. In this study, we identified a module that contains multiple isoforms of alcohol dehydrogenase, which is known to have a major role in the metabolism of alcohol to acetaldehyde. Using this co-expression module, we also explored the application of Bayesian networks to further hypothesize about the direction of relationships between transcripts using Mendelian randomization. This model further distinguished transcripts.

2017-08-14, Dr. Ibrahim Abdelrazeq , Levy Driven CARMA(2,1) vs Realized Volatility

Rhodes College , The Levy driven CARMA(2,1) process is a popular one with which to model stochastic volatility. However, there has been little development in statistical tools to verify this model assumption and assess the goodness-of-fit of real world data (Realized Volatility). When a L?vy driven CARMA(2,1) is observed at high frequencies, the unobserved driving process can be approximated from the observed process. Since, under general conditions, the L?vy driven CARMA(2,1) can be written as a sum of two dependent L\'evy driven CAR(1) process, the methods developed in Abdelrazeq, Ivanoff, and Kulik (2014) can be employed in order to use the approximated increments of the driving process to test the assumption that the process is L?vy-driven. The performance of the test is illustrated through simulation and real world data.

2017-07-24, Demba Fofana , Gene Expression and Network Analysis

University of Texas Rio Grande Valley , Analyzing gene expression data rigorously requires taking assumptions into consideration but also relies on using information about network relations that exist among genes. Combining these different elements cannot only improve statistical power, but also provide a better framework through which gene expression can be properly analyzed. We propose a novel statistical model that combines assumptions and gene network information into the analysis. Assumptions are important since every test statistic is valid only when required assumptions hold. We incorporate gene network information into the analysis because neighboring genes share biological functions. This correlation factor is taken into account via similar prior probabilities for neighboring genes. With a series of simulations, our approach is compared with other approaches. Our method that combines assumptions and network information into the analysis is shown to be more powerful.

2017-07-17, Dr. Rishi Kamaleswaran , Continuous Big Data and Analytics at the Point of Care

Department of Pediatrics, UTHSC , Critically ill patients are admitted to the intensive care unit (ICU) for complex, time sensitive, and dynamic care. While traditional patient monitors in the ICU have been used to generate large volumes of continuous physiological data from sensors attached the patient, analytics applied to those systems have largely been univariate and limited. Use of big data approaches, such as continuous and longitudinal event stream analytics through open source software such as Apache Spark allows us to analyze multiple channels of physiological data for prediction of potentially devastating conditions prior to their clinical manifestation. The use of novel discretization and machine learning methods allow us to identify salient 'physiomarkers', such as reduced heart rate variability or arrhythmias. This presentation will highlight recent work and new directions we are progressing at the Pediatric Intensive Care Unit at Le Bonheur Children's Hospital.

2017-03-27, Fridtjof Thomas , P-values: Too Big To Fail?

Preventive Medicine, UTHSC, This seminar outlines what we like about p-values and in which situations they are inherently meaningful. The participant will then get an understanding why their use is somewhat less straight forward in situations arising in observational studies and when very many hypotheses are tested (multiplicity of testing problem). Details are given why p-values cannot serve as measures of support for a specific hypothesis and why they do not measure "strength of scientific evidence" in a general sense. The American Statistical Association has recently issued recommendations on P-values and reporting of statistical analyses and the Journal of Basic and Applied Social Psychology has banned P values altogether from its research articles. Are P values still around because they are inherently beneficial or have they simply become so prolific that they are "too big to fail"?

2016-10-03, Josh Callaway , The Next Generation of Data Analysis

CEO/Co-founder/Data Scientist, PendulumRock Analytics, LLC, According to Moore's Law, the number of transistors on a microchip doubles nearly every 2 years. As we are now approaching this theoretical limit, Ray Kurzweil points out that technological performance will enter a new paradigm of quantum computing, neuromorphic chips, and 3-D stacking. We can well extrapolate a similar paradigm shift in many other realms of information technology. Apps are replacing the middle man and rendering life cheaper in nearly any service imaginable from the hotel industry to hailing a taxi or ordering food. It appears very likely, as well, that apps will make their way into medical diagnosis and rental services for autonomous self-driving cars. The boon of this app awakening manifests in the acceleration of declining costs and easy accessibility. In developing countries, such technology is providing business opportunity where it was previously too expensive. And, this is not limited to consumer services. The proliferation of Big Data and high-performance computing has broken barriers in statistics. Ten years ago, an analytical model that today takes 10 minutes to compute might have taken 10 days. Furthermore, this is not merely limited to high-cost licensed commercial software. Breakthroughs have been realized that allow smaller institutions and individuals to spread the work of programming across multiple processes. Additionally, cheap production of statistical applications is now plausible through open-source platforms. Therefore, we may be witnessing an analytical singularity in which nearly-instantaneous output can be produced from low-cost input. Already underway, this is providing a mainspring in technical startups and will fuel a similar paradigm shift in statistics and analytics

2016-08-22, Fridtjof Thomas , R for power users: Compiled R code and parallel computing techniques

Preventive Medicine, UTHSC, This seminar will demonstrate how to harness the power of R for large numerical computations with focus on simulation studies or other repetitive computational tasks. I will outline how to write R code with speedy execution in mind and how to investigate where computation time is spent in your R function by profiling your R execution. A life demonstration will show how R code can be compiled for faster execution as well as demonstrate how to create and utilize multiple cores that your hardware provides. We will specifically address (pseudo) random number generation and repeatability of parallel/distributed computing as a cornerstone of reproducible research. All techniques are demonstrated in a Windows 10 environment but are generally applicable to UNIX and Macintosh systems as well.

2016-05-23, Hyeonju Kim , Probabilities of Ruin in Economics and Insurance under Light- and Heavy-tailed Distributions

Preventive Medicine, UTHSC, This research is conducted on ruin problems in two fields. First, the ruin or survival of an economic agent over finite and infinite time horizons is explored for a one-good economy. A recursive relation derived for the intractable ruin distribution is used to compute its moments. A new system of Chebyshev inequalities, using an optimal allocation of different orders of moments over different ranges of the initial stock, provide good conservative estimates of the true ruin distribution. The second part of the research is devoted to the study of ruin probabilities in the general renewal model of insurance under both light- and heavy-tailed claim size distributions. Recent results on the dual problem of equilibrium of the Lindley-Spitzer Markov process provide clues to the orders of magnitude of finite time ruin probabilities in insurance.

2016-04-25, Parichoy Pal Chounhury , Causal Effect Among The Treated: Multiple Data Sources and Censored Outcomes

Johns Hopkins Bloomberg School of Public Health , We develop an inferential framework for estimating the causal effect among "exposed" subjects on a time-to-event outcome, based on multiple data sources and censored outcome information. We conceptualize a hypothetical point exposure study where subjects are enrolled and allowed to select their own exposure. Using information from two data sources (one for exposed subjects and one for non-exposed subjects with multiple examination times), we describe a process of manufacturing a dataset that closely mimics this hypothetical study. The identification of the causal effect relies on a no unmeasured confounding assumption based on covariates available at exposure selection and a non-informative censoring assumption. Estimation proceeds by fitting separate proportional hazards regression models for exposed and non-exposed subjects using the manufactured dataset and using G-computation to estimate, for exposed subjects, the distributions of time-to-event under exposure and non-exposure. Using these estimated distributions, we compute a parsimonious measure of the causal effect of interest.

2016-04-18, Stanley Pounds , Profiling Dialysis Facilities for Adverse Recurrent Events

St. Jude Children's Research Hospital , Profiling analysis aims to evaluate health care providers, such as hospitals, nursing homes, or dialysis facilities, with respect to a patient outcome. Previous profiling methods have considered binary outcomes, such as 30-day hospital readmission or mortality. For the unique population of dialysis patients, regular blood works are required to evaluate effectiveness of treatment and avoid adverse events, including dialysis inadequacy, imbalance mineral levels, and anemia among others. For example, anemic events (when hemoglobin levels exceed normative range) are recurrent and common for patients on dialysis. Thus, we propose high-dimensional Poisson and negative binomial regression models for rate/count outcomes and introduce a standardized event ratio (SER) measure to compare the event rate at a specific facility relative to a chosen normative standard, typically defined as an ?average? national rate across all facilities. Our proposed estimation and inference procedures overcome the challenge of high-dimensional parameters for thousands of dialysis facilities. Also, we investigate how overdispersion affects inference in the context of profiling analysis. (manuscript: https://onlinelibrary.wiley.com/doi/epdf/10.1002/sim.8482)

2016-04-04, Jonathan S. Schildcrout , Epidemiological sampling designs for longitudinal binary data with application to spirometery-based COPD diagnosis

Vanderbilt University, We discuss an epidemiological study design and analyses approach for longitudinal binary response data. Similar to other epidemiological study designs, we seek to gain efficiency and increase power compared to standard designs by over-sampling relatively informative subjects for inclusion into the sample. In particular, we will discuss a design that conducts a case-control sample (i.e., sample cases with high probability and controls with low probability); however, subjects are then followed longitudinally and case-control status is observed repeatedly for each subject. If the sampling variable (case-control status at baseline) is closely related to the binary response (case-control status over time), we are able to observe a sample that is highly enriched with case-visits compared to a standard random sampling design. We may therefore realize a substantial improvements in power and efficiency. However, because the design over-samples case-days, we must acknowledge the non-representativeness of the sample when conducting statistical analyses. We will describe a sequentially offsetted regression approach for valid inferences. Motivated by data provided by the Lung Health Study we will show that targeted sampling designs can yield valid inferences and can be far more resource efficient than standard random sampling designs for longitudinal data.

2016-03-16, Charisse Madlock-Brown , Exploring the UTHSC co-authorship Network with D3

Informatics and Information Management, UTHSC, The D3 JavaScript library is currently having a huge impact on academic data analysis. The D3 library can allow researchers to build interactive data visualizations for exploration using a wide variety of charts, networks, and graphs. These visualizations can further help researchers communicate their findings in presentations, and answers questions with on-the-fly interactive investigation of their data. In this presentation, I will go over the basics of the D3 library. I will also demonstrate its potential by exploring a collaborative network visualization of UTHSC researchers. I will explore the co-authorship patterns between basic and clinical scientists, investigate sub-structures, and explore the publication history of prominent co-authorship communities using the network as a guide for exploration.

2016-02-29, Arzu-Onar Thomas , A critical look at non-inferiority trials: benefits and pitfalls

St. Jude Children's Research Hospital, Non-inferiority trials have the potential to be extremely useful and are designs of choice when placebo controlled trials are not ethical or when a new treatment is thought to be similar in primary outcome to standard of care but may have advantages in secondary endpoints such as quality of life, cost, compliance etc. Designing and conducting non-inferiority trials however can be a lot more challenging compared to a superiority trial as the sources of bias are more abundant and lurk at unusual places such as in intent-to-treat populations etc. Furthermore full interpretation of results may rely on information outside of the trial itself making non-inferiority trials vulnerable to hazards of non-randomized trials. In this talk, I will try to outline some of the issues that need to be taken into account when designing and running a non-inferiority trial and provide guidance regarding best-practices. I will also discuss potential biases that may be present and discuss proper ways of analyzing the data and interpreting the results. The talk will primarily focus on conceptual aspects rather than mathematical details and is intended for statisticians, physician scientists as well as others involved in operational aspects of clinical trials.

2016-01-25, Ethan Willis , Causal Inference in Rare Diseases, In Practice

Center of Biomedical Inforamatics, UTHSC, This presentation will give an overview of how researchers can measure the impact of medical interventions on patients with rare diseases using longitudinal data. In the United States, rare diseases affect 30 million patients in approximately 7000 disease areas. Causal inference to inform therapy decisions in this area is challenged by several factors: the patient population is small, the genotype or phenotype is variable, and the disease pathways can be uncharacterized or lengthy. These challenges are not specific to the rare disease setting as they apply more largely to answering causal inference questions in the era of precision medicine where public health decisions rely on subgroups of sample sizes. In 2019, the Food and Drug Administration published documents guiding best practices on data collection, study design and use of registry data in the rare diseases setting. Over the past few years, a few algorithms were published to guide researchers to appropriate study design. This presentation will give an overview of best design practices and innovative analysis methods and illustrate their use with case studies in rare diseases.

2015-11-04, Karl Broman, Reproducible Research

Biostatistics and Medical Informatics, University of Wisconsin Madison, A minimal standard for data analysis and other scientific computations is that they be reproducible: that the code and data are assembled in a way so that another group can re-create all of the results (e.g., the figures in a paper). I will discuss my personal struggles to make my work reproducible and will present a series of suggested steps on the path towards reproducibility (see http://kbroman.org/steps2rr).

2015-11-02, Karl Broman, Interactive graphics for genetic data

Biostatistics and Medical Informatics, University of Wisconsin Madison, The value of interactive graphics for making sense of high-dimensional data has long been appreciated but is still not in routine use. I will describe my efforts to develop interactive graphical tools for genetic data, using JavaScript and D3. (The tools are available as an R package: R/qtlcharts, http://kbroman.org/qtlcharts) I will focus on an expression genetics experiment in the mouse, with gene expression microarray data on each of six tissues, plus high-density genotype data, in each of 500 mice. I argue that in research with such data, precise statistical inference is not so important as data visualization.

Seminars and Events

Past Seminars

Past Biostatistics Seminars

Contact Us

Department Links

Share