SPREG
SPREG : Regression Analysis of Secondary Phenotype Data in CaseControl Association Studies
SPREG is a computer program for performing regression analysis of secondary phenotype data in casecontrol association studies. Secondary phenotypes are quantitative or qualitative traits other than the casecontrol status. Because the casecontrol sample is not a random sample of the general population, standard statistical analysis of secondary phenotype data can yield very misleading results. SPREG implements valid and efficient statistical methods, as described in Lin DY, Zeng D. 2009, Proper analysis of secondary phenotype data in casecontrol association studies, Genetic Epidemiology, 33:256265.
GENERAL INFORMATION
The software is written in C and built under 64bit x86based Linux. The current release performs linear regression analysis of quantitative traits and logistic regression of binary traits under the additive mode of inheritance with or without environmental factors.
SYNOPSIS
spreg infile outfile rate nenv
All four program parameters are required and must be entered in the same order as specified above.
PROGRAM PARAMETERS
Parameter  Description 

infile  Name of the input file 
outfile  Name of the output file 
rate  Disease rate 
nenv  Number of environmental variables; 0 or a positive integer 
INPUT
The program requires one input file. The input file contains text data in a matrix format. Suppose there are n study subjects, k genes, and d environmental variables. The input file is a (k + d + 2) by n matrix, with columns representing study subjects, and rows conforming to this format:
 Row 1 contains the primary phenotype of the study subjects. This should be the casecontrol status (1=case; 0=control). Missing values are represented by NA.
 Row 2 contains the secondary phenotype values of the study subjects. Their values must be numerical, and can be either binary or continuous. Binary values must be coded as 0 and 1; otherwise they will be treated as continuous. Missing values are represented by NA.
 If one or more environmental variables are to be included in the analysis, they must be row 3 through row (d + 2), with each row representing one environmental variable, where d is the total number of environmental variables. Values of environmental variables must be numerical. Missing values are represented by NA.
 All subsequent rows are genotypes of genes. Suppose there are k genes while the number of environmental variables is d, noting that d can be 0. Row (d + 3) corresponds to gene 1; row (d + 4) corresponds gene 2; … ; and row (d + k + 2) corresponds to gene k. Values of genotypes must be numerical. Genotype values other than 0, 1, and 2 are accepted. Missing values are represented by NA.
If the disease is rare, enter any number less than 0.01 for the disease rate.
OUTPUT
Computational results are written to the output file specified by the user. For each gene, the output shows the maximum likelihood estimate of the genetic effect (i.e., slope parameter in the linear model or log odds ratio in the logistic model), its standard error, the standardnormal test statistic and the (twosided) pvalue.
EXAMPLE
The example files (can be downloaded in the DOWNLOAD section below) includes an input file demo.dat
and an output file demo.out
. The input file demo.dat
contains the casecontrol status of 3000 individuals, a continuous secondary phenotype, two environmental variables, and genotypes of 10 genes. The disease rate is 0.08.
Enter the command
spreg demo.dat demo.out 0.08 2
to obtain the output file as given in demo.out
. Its contents are
Gene_number Estimate Std_Error Z_stat p_value
1 5.203e03 8.509e03 6.114e01 5.409e01
2 1.275e03 5.514e03 2.312e01 8.172e01
3 3.067e03 5.713e03 5.368e01 5.914e01
4 8.081e04 5.651e03 1.430e01 8.863e01
5 1.198e04 8.013e03 1.495e02 9.881e01
6 2.148e03 6.575e03 3.267e01 7.439e01
7 2.517e03 7.245e03 3.474e01 7.283e01
8 2.308e03 5.975e03 3.864e01 6.992e01
9 2.384e02 6.004e03 3.970e+00 7.178e05
10 1.665e02 6.257e03 2.661e+00 7.781e03
DOWNLOAD
SPREG 2.0 for 64bit x86 based Linux [updated May 18, 2011]
Example files [updated May 18, 2011]
REFERENCE
Lin DY, Zeng D. 2009. Proper analysis of secondary phenotype data in casecontrol association studies. Genetic Epidemiology, 33:256265.
VERSION HISTORY
Version  Date  Description 

1.0  April 22, 2008  First version released 
2.0  May 18, 2011 
