SNPMStat
SNPMStat v4.0 : Statistical Analysis of SNP-Disease Association with Missing Genotype Data
SNPMStat is a command-line program for the statistical analysis of SNP-disease association in case-control/cohort/cross-sectional studies with potentially missing genotype data. SNPMStat allows the user to estimate or test SNP effects and SNP-environment interactions by maximizing the (observed-data) likelihood that properly accounts for phase uncertainty, study design and gene-environment dependence. For SNPs without missing data, the program performs the standard association analysis. For typed SNPs with missing data or untyped SNPs, the program performs the maximum-likelihood analysis described in Lin, Hu and Huang (2008) and Hu, Lin and Zeng (2010). We are working intensely to improve the capabilities of SNPMStat, so please check back frequently for updates.
SYNOPSIS
SNPMStat [-sfile specfile] [-pfile phenofile] [-gfile genofile] [-efile exterfile] [-h] [-hfile haplofile] [-ofile outfile] [-no_fix] [-no_remove] [-speed] [-ne] [-tag] [-window] [-max] [-min]
OPTIONS
Option | Parameter | Default | Description |
---|---|---|---|
-sfile | {specfile} | specification.txt | Specify the specification file |
-pfile | {phenofile} | phenotype.dat | Specify the phenotype file |
-gfile | {genofile} | genotype.dat | Specify the genotype file |
-efile | {exterfile} | external.dat | Specify the external file |
-h | No haplotype file | Use the haplotype file | |
-hfile | {haplofile} | haplotype.dat | Specify the haplotype file |
-ofile | {outfile} | results | Specify the output file. Unlike other file names above, this one should be rid of any suffix. |
-no_fix | Perform internal check | Turn off internal check | |
-no_remove | Remove SNPs that cannot be aligned | Turn off default removal of SNPs that cannot be aligned | |
-speed |
|
||
-ne | Use external panel | No external file | |
-tag | {#tags} | 4 | Specify the number of tag SNPs used to impute the SNP of interest. |
-window | {win size} | 50,000 (bp) | Specify the maximum distance to the untyped SNP within which typed SNPs are identified as candidate tags. |
-max | {#SNPs} | 20 | Specify the maximum number of typed SNPs identified as candidate tags |
-min | {#SNPs} | 8 | Specify the minimum number of typed SNPs identified as candidate tags |
By default, SNPMStat analyzes both typed SNPs in a study with potentially missing data and untyped SNPs that are on an external panel. For each SNP of interest, SNPMStat first identifies candidate tags within a distance (-window) and then finds the predefined number of SNPs that yields the largest MD measure (Nicolae, 2006, Genetic Epidemiology, 30, 703-717). If the number of candidate tags within that distance is less than the minimum (-min), the distance is enlarged until the minimum number is met. If the number of candidate tags within the distance exceeds the maximum (-max), only the closest maximum number of SNPs are considered as candidate tags. We perform an internal check to see whether the strand alignment between the study and external panel can be determined from (a) the allele labels (at non A/T and G/C SNPs), and (b) allele frequencies (at A/T and G/C SNPs). SNPs that cannot be aligned are removed from the data. The internal check and removal can be turned off by using -no_fix and -no_remove. Phased haplotypes can be supplied by -h to facilitate the selection of tag SNPs. The -speed flag can further speed up the selection process at the cost of skipping SNPs that are not phased. If only typed SNPs are interested, the use of external panel can be suppressed by -ne.
INPUT FILES
specification file
DESIGN = cohort |
---|
CATEGORICAL = smk_status |
DEPENDENT = smk_status CPD |
PANEL = 30 0 0 |
MODE = additive |
EFFECT = G CPD G*CPD |
OUTPUT = G G*CPD |
The specification file describes the feathers of the study and variables, and specifies the disease risk model required for the analysis. The syntax follows
KEYWORD = value1 [value2 …]
with spaces around “=”. KEYWORD with an empty value, i.e., “KEYWORD =”, is not allowed.
DESIGN =
case-control/cohort/cross-sectional
Specify the study design. Required at the first line of the specification file.
CATEGORICAL =
{covariate names in the phenotype file}
Specify covariates that are categorical (more than two levels). A categorical covariate is transformed into (level-1) indicators with the lowest level as the reference. For example, if smoke has values 1, 2, 3, it will be transformed into two indicators I(smoke=2) and I(smoke=3) with names “
smoke(2)
” and “smoke(3)
“. Unspecified covariates are assumed to be continuous by default. Optional.
DEPENDENT =
{covariate names in the phenotype file}
Specify covariates that are potentially correlated with haplotypes. Unspecified covariates are assumed to be independent of haplotypes by default. Optional.
PANEL =
{#trios #duos #singletons}
Specify the number of trios, duos and singletons, respectively.
Required.
MODE =
additive/recessive/dominant/codominant
Specify the mode of inheritance. Default is additive mode. Optional.
EFFECT =
{main effects and interactions}
Specify the main effects and interactions considered in the disease risk model. In particular, the SNP effect is designated by ‘G’. Interactions between SNP and covariates are indicated by ‘*’ with no space on either side. Required.
OUTPUT =
{main effects and interactions}
Specify the main effects and interactions whose estimation and testing results are to be outputted. These effects should be a subset of those in
EFFECT
. Each effect is outputted to a separate file, with the file name specified in -ofile appended by “_effectname.out
“. Note that the “*
” sign in interactions is replaced by “$
” for legitimacy purpose. Specifying a categorical covariate induces multiple files corresponding to its derivative indicators. Specifying a codominant SNP effect induces two files corresponding to two genotypes. Required.
phenotype file
Y | del | age | smk_status | CPD |
---|---|---|---|---|
32 | 0 | 26 | 0 | -0.635 |
31 | 0 | 32 | 0 | -0.635 |
36 | 1 | 31 | 1 | -0.635 |
… | … | … | … | … |
The phenotype file provides information on the disease and covariates of the study subjects in a tabular (row-column) format. Each row contains space or tab delimited data specific to an individual. Variable names should be specified in the first line of the file. The disease variable should be listed first and can be followed by an arbitrary number of covariates (or no covariate). In a case-control study, the disease variable should be coded 0/1 to represent unaffected/affected. In a cohort study which has two disease variables, the time variable should be listed first and the indicator of disease second. Missing disease variables or covariates are denoted as ‘.’.
genotype file
rs16977020 | 54706569 | 0 | A C | 1 2 2 2 2 1 2 2 0 2 1 1 1 2 2 … |
---|---|---|---|---|
rs12903336 | 54715530 | 0 | A G | 0 1 2 2 1 1 1 1 0 2 1 1 1 0 2 … |
rs28678122 | 54743606 | 0 | A C | 1 1 2 2 1 2 1 1 2 2 2 2 2 0 2 … |
… | … | … | … | … |
The genotype file provides genotype information for the study subjects in a tabular (row-column) format. Each row contains space or tab delimited data specific to a SNP. The columns follow the format
SNP_id position strand_orientation nucleo1 nucleo2 geno_1 … geno_n
If the strand orientation information is not available, all strand_orientation fields should be shown as 0. If this information is available, flag 1 in the field indicates that the strand orientation in the study data is different from the external panel (so the allele coding of the external panel will be switched by the program) and flag 0 indicates strand consistency. In particular, if all the genotypes in the external panel are in forward strand, then flag 1 means that the SNP in the study was recorded on reverse strand. The strand orientation information is only required for C/G and A/T SNPs. For all the other types of SNPs, this field can be left 0. In nucleo1 and nucleo2 fields are the nucleotides of the SNP and should be in the alphabetical order. The genotypes are coded with 0, 1 and 2, referring to the count of nucleo1. Missing genotype should be coded as 9.
external genotype file
rs4774891 | 54807077 | C T | 1 1 2 2 1 2 2 2 2 1 2 2 1 2 2 2 … |
---|---|---|---|
rs8025391 | 54808154 | A T | 1 1 2 0 1 1 2 1 1 0 2 1 1 1 1 2 … |
rs10518872 | 54809475 | G T | 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 … |
… | … | … | … |
The external genotype file follows the same format as the genotype file except that the strand_orientation column is absent. position should be in the same ascending or descending order as in the study genotype file. Trio data should be entered first, followed by duos and unrelated individuals. Trios and duos should be entered in family blocks. Within each trio, the child genotype is entered last.
haplotype file
rs4774891 | 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 … |
---|---|
rs8025391 | 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 … |
rs10518872 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 … |
… | … |
The haplotype file for the external panel can be incorporated by using the flag -h. The file format is
SNP_id phase1_1 phase2_1 phase1_2 phase2_2 …phase1_n phase2_n
SNP_id should be in the same order as in the external genotype file. Each subject contributes two columns (phase1_i phase2_i, i=1, …, n) with 0/1 coding, referring to the count of nucleo1 as in the external genotype file. Typically, only phasing information on founders (mothers and fathers) is provided for trios.
OUTPUT
The file format is
Typed SNP-id Position 1/0 MD Freq Estimate StdErr Z-Stat p-Value
For each untyped SNP, Typed is set to be “no”, 1/0 indicates the nucleotides coded as 1 and 0, MD is the MD measure between the SNP and the set of typed SNPs with the best prediction, Freq is the frequency of 1-coded allele in the external panel. Any untyped SNP with allele frequency 0.0 or 1.0 in the external panel is excluded from the analysis.
For each genotyped SNP, Typed is the proportion of non-missing genotypes, MD is set to be ‘-‘, Freq is the frequency of 1-coded allele in the study. Any SNP with allele frequency 0.0 or 1.0 in the study is excluded from the analysis. The results for alleles with very low minor-allele frequencies may not be stable and should be viewed with great caution, especially for untyped SNPs or typed SNPs with substantial missingness.
EXAMPLE
The example includes a specification file “GAWspec.txt
“, a phenotype file “GAWpheno.dat
“, a genotype file “GAWgeno.dat
“, an external file “GAWexter.dat
“, and a haplotype file “GAWhaplo.dat
“.
Enter the command
$ SNPMStat -sfile GAWspec.txt -pfile GAWpheno.dat -gfile GAWgeno.dat -efile GAWexter.dat -h -hfile GAWhaplo.dat -speed -ofile GAW
to obtain the results given in “GAW_G.out
” and “GAW_CPD$G.out
“.
REFERENCE
Hu, Y. J., Lin, D. Y. and Zeng, D. (2010), “A General Framework for Studying Genetic Effects and Gene-Environment Interactions with Missing Data”, Biostatistics, in press.
Lin, D. Y., Hu, Y. and Huang, B. E. (2008), “Simple and Efficient Analysis of SNP-Disease Association with Missing Genotype Data”, American Journal of Human Genetics, 82, 444-452.
DOWNLOAD
SNPMStat for Linux [updated July 13 2010]
Example files [updated July 13 2010]
VERSION HISTORY
Version | Date | Description |
---|---|---|
1.0 | Oct. 2007 | First version released |
1.1 | May 8, 2008 | Bug Fix:
|
2.0 | Jul. 9, 2008 | New Features:
|
2.1 | Sep. 29, 2008 | Bug Fix:
|
3.0 | Oct. 14, 2008 |
|
3.1 | Dec. 17, 2008 |
|
3.2 | Feb. 17, 2009 |
|
4.0 | Jul. 13, 2010 |
|