Software - Danyu Lin, PhD

booster-interval

CNVstat

CNVstat is a command-line program written in C for the statistical association analysis of CNVs and SNPs. CNVstat allows the user to estimate or test the effects of CNVs and SNPs by maximizing the (observed-data) likelihood that properly accounts for differential measurement errors and calling uncertainties. It is versatile in several aspects: (1) it provides the integrated analysis of CNVs and SNPs as well as the analysis of total CNVs; (2) it can accommodate both Affymetrix and Illumina data, as well as all platforms that assay CNVs quantitatively, such as array CGH; (3) it accounts for the case-control sampling, differential measurement errors and calling uncertainties; (4) it can be readily extended to other study designs and traits; (5) it formulates the effects of CNVs and SNPs on the phenotype through flexible regression models, which can accommodate various genetic mechanisms and gene-environment interactions; and (6) it allows genetic and environmental variables to be correlated. The program is fast and scalable to genomewide association scans. For example, it took about 2 hrs on a 64-bit, 3.0-GHz Intel Xeon machine to perform the analysis on chromosome 1 of the schizophrenia data (Hu et al. Submitted for publication). We are working intensely to improve the capabilities of CNVstat, so please check back frequently for updates.

COVE

COVID

DOVE

DOVE2

GAS2

GAS2 provides a Fortran-77 program to evaluate statistical significance in two-stage genomewide association studies, based on the method proposed by Lin (The American Journal of Human Genetics, 2006).

GWASelect

For analyzing Genomewide Association Studies (GWAS) data, virtually all the existing methods focus on one SNP at a time, such as the Armitage trend test. These univariate methods do not take full advantages of genomewide data and are often lack of statistical power. Furthermore, they do not naturally lead to a model that can be readily used for disease prediction. A possible way to overcome these difficulties is to conduct variable selection at the genome-wide level. GWASelect implements a novel variable selection method for GWAS data and is able to handle more than half million SNPs. Extensive simulation studies and real data analysis show that this method enjoys high power and low false discovery rate compared to existing variable selection methods (Bioinformatics (2010) doi: 10.1093/bioinformatics/btq600, first published online: October 29, 2010). The variables selected by GWASelect can be readily placed into a logistic regression model for disease prediction. The current release is designed for binary outcome under the additive mode of inheritance. More developments are still underway and please check back frequently for updates.

HAPSTAT

HAPSTAT is a user-friendly software interface for the statistical analysis of haplotype-disease association. HAPSTAT allows the user to estimate or test haplotype effects and haplotype-environment interactions by maximizing the (observed-data) likelihood that properly accounts for phase uncertainty and study design. Cross-sectional, longitudinal, case-control and cohort studies are considered. The underlying methodology and a subset of the numerical algorithms used in HAPSTAT are found in Lin and Zeng (JASA, 2006), Lin, Zeng and Millikan (Genetic Epidemiology, 2005) Zeng, Lin, Avery, North and Bray (Biostatistics, 2006) and Lin, Hu and Huang (The American Journal of Human Genetics, 2008). The current version allows haplotype analysis of multiple genes as well as single- and multi-SNP analysis with missing genotypes, as described in Lin, Hu and Huang (The American Journal of Human Genetics, 2008).

HazardRatio

Hazard_Ratio is a SAS macro to generate confidence Intervals for the hazard ratio in randomized clinical trials. The standard practice is the Wald method, which is based on the maximum partial likelihood estimate under the proportional hazards model and the corresponding Fisher information matrix. The resulting confidence interval may not be consistent with the log-rank test. Peto et al. (1977) provided an estimator for the hazard ratio based on the log-rank statistic, and the corresponding confidence interval is consistent with the log-rank test. However, Peto’s estimator is not consistent, and the corresponding confidence interval does not have correct coverage probability. Lin et al. (2016) proposed to construct the confidence interval by inverting the score test under the (possibly stratified) proportional hazards model. The resulting confidence interval is consistent with the log-rank test and has correct coverage probability. An added benefit of this confidence interval is that it tends to be more accurate and narrower than the Wald confidence interval.

IBoost

IBoost is an R function for predicting survival time using multiple types of potentially high-dimensional genomics and clinical data. It implements the I-Boost method of Wong et al. (2017) to estimate a sparse model for the survival time.

iDOVE

iDOVE is an R package that implements the methods described in Lin et al. (2021) for estimating potentially waning vaccine efficacy against SARS-CoV-2 infection.

iMODA

In multi-omics studies, genotypes are typically available for all study subjects, whereas other data types may be measured only on a subset of subjects due to cost or other constraints. In addition, quantitative omics measurements, such as metabolite levels and protein expressions, are subject to detection limits in that the measurements below (or above) certain thresholds are not detectable. iMODA is a software program that performs integrative analysis of multi-omics data with missing values and detection limits. The quantitative omics variables are related to genetic variants and other variables through linear regression models, while phenotypes are related to quantitative omics variables and other variables through generalized linear models. An EM algorithm is used to perform maximum likelihood estimation for all model parameters. The current implementation covers only continuous phenotypes. We are actively improving the capabilities of iMODA, so please check back frequently for updates.

MAOS

Data from genome-wide association studies are often analyzed jointly for the purposes of combining information from multiple studies of the same disease or comparing results across different disorders. In many instances, the same subjects appear in multiple studies. Failure to account for overlapping subjects can greatly inflate type I error when combining results from multiple studies of the same disease and can drastically reduce power when comparing results across different disorders. MAOS implements valid and efficient statistical methods for meta-analysis of genomewide association studies with overlapping subjects, as described in Lin and Sullivan (submitted for publication, 2009). The current release performs logistic regression analysis of individual level data under the additive mode of inheritance. (Meta-analysis of summary results is much simpler to implement.) We are working intensely to improve the capabilities of MAOS, so please check back frequently for updates.

MASS

MASS is a command-line program written in C to perform meta-analysis of sequencing studies by combining the score statistics from multiple studies. It implements three types of multivariate tests that encompass all commonly used association tests for rare variants, including simple burden test, CMC test (Li and Leal, 2008), weighted sum statistic (Madsen and Browning, 2009), variable-threshold (VT) test (Price et al., 2010; Lin and Tang, 2011), C-alpha test (Neale et al., 2011) and SKAT (Wu et al., 2011). The input file can be generated from the accompanying software SCORE-Seq and SCORE-Seq/TDS. This bundle of programs allows meta-analysis of sequencing studies in a statistically accurate, numerically stable and computationally efficient manner.
(Version 3 released on March 19, 2013)

MOST

Genetic studies often contain multiple outcomes (traits) that are clinically correlated. Identifying genetic variants underlying multiple traits may help to better understand the etiology of complex diseases. Conventional univariate association tests may miss variants which have weak or moderate effects on individual traits. We propose a general framework, MOST (Multivariate Outcome Score Test), that is able to analyze all the traits jointly. Our framework is flexible in that it can handle both continuous and binary traits, or a mixture of them. In addition, it can accommodate family data, can adjust for covariates (such as ancestry variables, age, gender, etc.), and can be combined efficiently across studies with different designs. Furthermore, it has a built-in Monte Carlo procedure that can determine the genome-wide significance by taking into account the LD information among all the SNPs. Our framework establishes a flexible platform for the analysis of multivariate-outcome association study, and provides a powerful tool for uncovering pleiotropic genetic variants.
(Version 1 released on January 4, 2013)

MultiTDS

MultiTDS is a command-line software program written in C++ to implement the methods described in Tao, et al. (2014) for the analysis of sequence data under multivariate trait-dependent sampling.

(Version 1.0 released on December 8, 2013)

NPMLE

We have done a considerable amount of work on the nonparametric likelihood estimation (NPMLE) of semiparametric transformation models with censored data. Software for much of our work can be found at http://www.bios.unc.edu/~dzeng/Transform.html.

PreMeta

PreMeta is an R program that reformats summary statistics among four meta-analysis pipelines (MASS, RAREMETAL, MetaSKAT, and seqMeta). In addition, preMeta normalizes the score statistics from RAREMETALWORKER by the estimated residual variance. We are working intensely to improve the capabilities of PreMeta, so please check back frequently for updates.
(Version 1 released on February 02, 2015)

SCORE-Seq

SCORE-Seq is a command-line program for detecting disease associations with rare variants in sequencing studies. The mutation information is aggregated across multiple variant sites of a gene through a weighted linear combination and then related to disease phenotypes through appropriate regression models. The weights can be constant or dependent on allele frequencies and phenotypes. The association testing is based on score-type statistics. The allele-frequency thresholds can be fixed or variable. Statistical significance can be assessed by using asymptotic normal approximation or resampling. The current release handles binary and continuous traits with arbitrary covariates under case-control or cross-sectional sampling. We are working intensely to improve the capabilities of SCORE-Seq, so please check back frequently for updates.
(Version 5 released on March 20, 2013)

SCORE-SeqTDS

SCORE-SeqTDS is a command-line program which implements the score tests described in Lin et al. (2013) for analyzing primary and secondary quantitative traits in sequencing studies with trait-dependent sampling. The primary trait is the quantitative trait that is used to select subjects for sequencing, and all other traits are treated as secondary. Both the maximum likelihood estimation (MLE) and standard least-squares (LS) methods are available. The MLE method properly accounts for trait-dependent sampling whereas the LS method does not. The LS method is the ideal choice for random sampling and is approximately correct for analyzing secondary quantitative traits in case-control or case-only studies with rare diseases. We are working intensely to improve the capabilities of SCORE-SeqTDS, so please check back frequently for updates.
(Version 3 released on March 20, 2013)

SNPMStat

SNPMStat is a command-line program for the statistical analysis of SNP-disease association in case-control/cohort/cross-sectional studies with potentially missing genotype data. SNPMStat allows the user to estimate or test SNP effects and SNP-environment interactions by maximizing the (observed-data) likelihood that properly accounts for phase uncertainty, study design and gene-environment dependence. For SNPs without missing data, the program performs the standard association analysis. For typed SNPs with missing data or untyped SNPs, the program performs the maximum-likelihood analysis described in Lin, Hu and Huang (2008) and Hu, Lin and Zeng (2010). We are working intensely to improve the capabilities of SNPMStat, so please check back frequently for updates.

SPREG

SPREG is a computer program for performing regression analysis of secondary phenotype data in case-control association studies. Secondary phenotypes are quantitative or qualitative traits other than the case-control status. Because the case-control sample is not a random sample of the general population, standard statistical analysis of secondary phenotype data can yield very misleading results. SPREG implements valid and efficient statistical methods, as described in Lin DY, Zeng D. 2009, Proper analysis of secondary phenotype data in case-control association studies, Genetic Epidemiology, 33:256-265.

SQTDT/SPDT

The efficient and reliable algorithms of Diao and Lin (Genetic Epidemiology, 2006) for the semiparametric family-based tests of association are available for the Linux platform.

SQTL

We implement the algorithm of Diao and Lin (The American Journal of Human Genetics, 2005) for the semiparametric QTL mapping method in general pedigrees in a console application for the Linux platform.

SUGEN

SUGEN is a command-line software program written in C++ to implement the weighted and unweighted approaches described by Lin et al. (2014) for various types of association analysis under complex survey sampling.

SVCC

The semiparametric variance-component models for linkage and association analysis of censored trait data, as described in Diao and Lin (Genetic Epidemiology, 2006), are implemented in a Linux console application.

tagIMPUTE

tagIMPUTE is a command-line program for the imputation of untyped SNPs. tagIMPUTE is based on a few flanking SNPs that can optimally predict the SNP under imputation. For more details, see Hu and Lin (2010).

THRESHOLD

THRESHOLD is a command-line program to implement a bootstrap approach to determining genome-wide significance threshold for association tests in sequencing studies. It currently covers single-variant tests, Burden tests, and SKAT tests. We are working intensely to improve the capabilities of THRESHOLD, so please check back frequently for updates.