Requirement

This software application was developed and tested on a 64-bit Linux (CentOS 6.5) environment.
It requires gSearch, gcc (as the compiler), Python, and R for running.

Program

Name Description Download
LR Filter A set of shell scripts, C source code, and Python programs for logistic regression based filtering Download

Annotation Database Files

All database files were originally downloaded from the UCSC Table Browser and converted to appropriate formats for LR Filter.

Name Description Download
RepeatMasker track BED file for the RepeatMasker track Download
dbSNP131 (SNV) dbSNP (Build ID: 131) database for SNVs in Genome Variation Format (GVF) Download
dbSNP131 (INS) dbSNP (Build ID: 131) database for insertions in GVF Download
dbSNP131 (DEL) dbSNP (Build ID: 131) database for deletions in GVF Download
RefSeq genes A RefSeq gene model database in gSearch Format (GSF) Download

Sample Files for trainingModule

We provide a sample training variant call file and a set of gold standard variants for it.

Name Description Download
12877_ill_hg19.gvf An example training variant call file in GVF for NA12877 (an individual of CEPH/Utah Pedigree 1463) prepared using the Illumina CASAVA pipeline Download
12877_ill_hg19_gs.gvf An example set of gold standard variants for NA12877, consisting of the variants concordantly called by both Complete Genomics and Illumina platforms Download
12877_ill.gvf.filter The logistic regression filter built using the training and gold standard variant call files in GVF for NA12877 Download
12877_ill.gvf.summary Summary of the logistic regression filter built using the training and gold standard variant call files in GVF for NA12877 Download

Training variant call file (12877_ill_hg19.gvf)

...
chr1    VCF     SNV     28863   28863   9       +       .       ID=36;Reference_seq=C;Variant_seq=A;Genotype=heterozygous;gene_component_detail=Ensembl|WASH7P|ENST00000423562|intron,Ensembl|WASH7P|ENST00000438504|intron,Ensembl|WASH7P|ENST00000488147|intron,Ensembl|WASH7P|ENST00000538476|intron,Ensembl|WASH7P|ENST00000430492|intron,UCSC|WASH7P|uc001aah.4|intron,UCSC|WASH7P|uc009vir.3|intron,UCSC|WASH7P|uc009viq.3|intron,UCSC|WASH7P|uc001aac.4|intron,UCSC|WASH7P|uc009viv.2|intron,UCSC|WASH7P|uc009viw.2|intron,UCSC|WASH7P|uc009vix.2|intron,UCSC|WASH7P|uc009viy.2|intron,UCSC|WASH7P|uc009viz.2|intron,UCSC|WASH7P|uc010nxs.1|intron,UCSC|WASH7P|uc009vjb.1|intron,UCSC|WASH7P|uc009vje.2|intron,UCSC|WASH7P|uc009vjf.2|intron,RefSeq|WASH7P|NR_024540|intron;gene_component_overlap=1;
chr1    VCF     SNV     30923   30923   56      +       .       ID=37;Reference_seq=G;Variant_seq=T;Genotype=homozygous;gene_component_detail=Ensembl|MIR1302-11|ENST00000473358|intron,Ensembl|MIR1302-11|ENST00000469289|intron;gene_component_overlap=1;
chr1    VCF     SNV     49298   49298   68      +       .       ID=39;Reference_seq=T;Variant_seq=C;Genotype=homozygous;
chr1    VCF     SNV     51459   51459   5       +       .       ID=40;Reference_seq=G;Variant_seq=A;Genotype=heterozygous;
chr1    VCF     SNV     51476   51476   5       +       .       ID=41;Reference_seq=T;Variant_seq=C;Genotype=heterozygous;
chr1    VCF     SNV     51928   51928   24      +       .       ID=42;Reference_seq=G;Variant_seq=A;Genotype=heterozygous;
chr1    VCF     SNV     52238   52238   106     +       .       ID=43;Reference_seq=T;Variant_seq=G;Genotype=homozygous;
chr1    VCF     SNV     54676   54676   19      +       .       ID=46;Reference_seq=C;Variant_seq=T;Genotype=heterozygous;
chr1    VCF     SNV     54708   54708   16      +       .       ID=47;Reference_seq=G;Variant_seq=C;Genotype=heterozygous;
chr1    VCF     SNV     54716   54716   18      +       .       ID=48;Reference_seq=C;Variant_seq=T;Genotype=heterozygous;
chr1    VCF     SNV     54844   54844   19      +       .       ID=49;Reference_seq=G;Variant_seq=A;Genotype=homozygous;
chr1    VCF     SNV     55164   55164   102     +       .       ID=50;Reference_seq=C;Variant_seq=A;Genotype=homozygous;
chr1    VCF     SNV     58211   58211   26      +       .       ID=51;Reference_seq=A;Variant_seq=G;Genotype=homozygous;
chr1    VCF     SNV     61442   61442   36      +       .       ID=53;Reference_seq=A;Variant_seq=G;Genotype=homozygous;
...
		

Gold standard variant set (12877_ill_hg19_gs.gvf)

...
chr1    VCF     SNV     52238   52238   106     +       .       ID=43;Reference_seq=T;Variant_seq=G;Genotype=homozygous;
chr1    VCF     SNV     55164   55164   102     +       .       ID=50;Reference_seq=C;Variant_seq=A;Genotype=homozygous;
chr1    VCF     SNV     58211   58211   26      +       .       ID=51;Reference_seq=A;Variant_seq=G;Genotype=homozygous;
chr1    VCF     SNV     61442   61442   36      +       .       ID=53;Reference_seq=A;Variant_seq=G;Genotype=homozygous;
chr1    VCF     SNV     69511   69511   45      +       .       ID=60;Reference_seq=A;Variant_seq=G;Genotype=homozygous;gene_component_detail=Ensembl|OR4F5|ENST00000335137|CDS,CCDS|OR4F5|CCDS30547.1|CDS,UCSC|OR4F5|uc001aal.1|CDS,RefSeq|OR4F5|NM_001005484|CDS;gene_component_overlap=1;
chr1    VCF     SNV     128798  128798  120     +       .       ID=112;Reference_seq=C;Variant_seq=T;Genotype=homozygous;gene_component_detail=Ensembl|RP11-34P13.7|ENST00000477740|intron,Ensembl|RP11-34P13.7|ENST00000471248|intron;gene_component_overlap=1;
chr1    VCF     SNV     548491  548491  38      +       .       ID=218;Reference_seq=C;Variant_seq=T;Genotype=homozygous;gene_component_detail=Ensembl|RP5-857K21.4|ENST00000440200|intron;gene_component_overlap=1.0000;
chr1    VCF     deletion        567240  567240  770     +       .       ID=24;Reference_seq=G;Variant_seq=-;Genotype=homozygous;gene_component_detail=Ensembl|RP5-857K21.4|ENST00000440200|intron,Ensembl|RP5-857K21.6|ENST00000414273|CDS;gene_component_overlap=1;
chr1    VCF     deletion        688055  688055  90      +       .       ID=27;Reference_seq=A;Variant_seq=-;Genotype=homozygous;
chr1    VCF     SNV     704367  704367  45      +       .       ID=326;Reference_seq=T;Variant_seq=C;Genotype=homozygous;gene_component_detail=Ensembl|RP11-206L10.2|ENST00000428504|intron,UCSC|LOC100288069|uc001abo.3|intron,RefSeq|LOC100288069|NR_033908|intron;gene_component_overlap=1;
...
		

Logistic regression filter (12877_ill.gvf.filter)

.SNV.hetero
  (Intercept) quality_score          rmsk         dbsnp    gene_model
  -8.36093490    1.80831214   -0.07303558    1.11260619   10.84397095
 SNV_typeA->G  SNV_typeA->T  SNV_typeC->A  SNV_typeC->G  SNV_typeC->T
   0.26527300   -0.22657708    0.07260413    0.13455575    0.36617201
 SNV_typeG->A  SNV_typeG->C  SNV_typeG->T  SNV_typeT->A  SNV_typeT->C
   0.37280697    0.13363085    0.07604954   -0.26103353    0.24608697
 SNV_typeT->G
  -0.00743212
  ...
		

Summary of Logistic regression filter (12877_ill.gvf.summary)

12877_ill_hg19.gvf.SNV.hetero

Call:
glm(formula = yval ~ ., family = binomial(), data = train)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-5.0565   0.0016   0.1642   0.4581   3.6338

Coefficients:
               Estimate Std. Error  z value Pr(>|z|)
(Intercept)   -8.360935   0.022054 -379.105  < 2e-16 ***
quality_score  1.808312   0.003781  478.238  < 2e-16 ***
rmsk          -0.073036   0.005123  -14.255  < 2e-16 ***
dbsnp          1.112606   0.006768  164.390  < 2e-16 ***
gene_model    10.843971   0.375903   28.848  < 2e-16 ***
SNV_typeA->G   0.265273   0.012985   20.429  < 2e-16 ***
SNV_typeA->T  -0.226577   0.015798  -14.342  < 2e-16 ***
SNV_typeC->A   0.072604   0.015710    4.622 3.81e-06 ***
SNV_typeC->G   0.134556   0.016282    8.264  < 2e-16 ***
SNV_typeC->T   0.366172   0.012835   28.529  < 2e-16 ***
SNV_typeG->A   0.372807   0.012845   29.023  < 2e-16 ***
SNV_typeG->C   0.133631   0.016261    8.218  < 2e-16 ***
SNV_typeG->T   0.076050   0.015733    4.834 1.34e-06 ***
SNV_typeT->A  -0.261034   0.015725  -16.599  < 2e-16 ***
SNV_typeT->C   0.246087   0.012965   18.980  < 2e-16 ***
SNV_typeT->G  -0.007432   0.015995   -0.465    0.642
---
  ...
		

Sample Files for predictionModule

We provide a sample variant call file for variant prioritization and the logistic regression filter built using the training variant call file and the gold standard variants for it described above. The sample variant file with filtering score is also given.

Name Description Download
12878_ill_hg19.gvf An example variant call file in GVF for NA12878 (an individual of the CEPH/Utah pedigree) prepared using the Illumina CASAVA pipeline Download
12877_ill.gvf.filter The logistic regression filter built using the training and gold standard variant call files in GVF for NA12877 Download
12878_ill_hg19.filtered_lr.gvf The example variant call file in GVF for NA12878 with filtering score predicted by the logistic regression filter above. Download

Variant call file (12878_ill_hg19.gvf)

...
chr1    VCF     SNV     20250   20250   11      +       .       ID=22;Reference_seq=T;Variant_seq=C;Genotype=heterozygous;
chr1    VCF     SNV     28376   28376   49      +       .       ID=24;Reference_seq=G;Variant_seq=A;Genotype=homozygous;
chr1    VCF     SNV     28563   28563   53      +       .       ID=25;Reference_seq=A;Variant_seq=G;Genotype=homozygous;
chr1    VCF     SNV     28835   28835   10      +       .       ID=26;Reference_seq=A;Variant_seq=G;Genotype=heterozygous;
chr1    VCF     SNV     30923   30923   39      +       .       ID=27;Reference_seq=G;Variant_seq=T;Genotype=homozygous;
chr1    VCF     SNV     31029   31029   5       +       .       ID=28;Reference_seq=G;Variant_seq=A;Genotype=heterozygous;
chr1    VCF     SNV     52238   52238   53      +       .       ID=31;Reference_seq=T;Variant_seq=G;Genotype=homozygous;
chr1    VCF     SNV     54586   54586   13      +       .       ID=32;Reference_seq=T;Variant_seq=C;Genotype=heterozygous;
chr1    VCF     SNV     54676   54676   129     +       .       ID=33;Reference_seq=C;Variant_seq=T;Genotype=heterozygous;
chr1    VCF     SNV     54708   54708   14      +       .       ID=34;Reference_seq=G;Variant_seq=C;Genotype=heterozygous;
...
		

Logistic regression filter (12877_ill.gvf.filter)

.SNV.hetero
  (Intercept) quality_score          rmsk         dbsnp    gene_model
  -8.36093490    1.80831214   -0.07303558    1.11260619   10.84397095
 SNV_typeA->G  SNV_typeA->T  SNV_typeC->A  SNV_typeC->G  SNV_typeC->T
   0.26527300   -0.22657708    0.07260413    0.13455575    0.36617201
 SNV_typeG->A  SNV_typeG->C  SNV_typeG->T  SNV_typeT->A  SNV_typeT->C
   0.37280697    0.13363085    0.07604954   -0.26103353    0.24608697
 SNV_typeT->G
  -0.00743212
  ...
		

Variant call file with filtering score(12878_ill_hg19.filtered_lr.gvf)

...
chr1    VCF     SNV     20250   20250   11      +       .       ID=22;Reference_seq=T;Variant_seq=C;Genotype=heterozygous;repeat_with_rmsk_tag_detail=L3;dbsnp_tag_detail=ID=960;Reference_seq=T;Variant_seq=C;,ID=961;Reference_seq=T;Variant_seq=A;,ID=962;Reference_seq=T;Variant_seq=G;;T->C_SNV;lr_score=1.08205372613e-06;
chr1    VCF     SNV     28376   28376   49      +       .       ID=24;Reference_seq=G;Variant_seq=A;Genotype=homozygous;dbsnp_tag_detail=ID=1724;Reference_seq=G;Variant_seq=A;,ID=1725;Reference_seq=G;Variant_seq=C;,ID=1726;Reference_seq=G;Variant_seq=T;,ID=1727;Reference_seq=G;Variant_seq=A;;G->A_SNV;lr_score=5.07639376526e-12;
chr1    VCF     SNV     28563   28563   53      +       .       ID=25;Reference_seq=A;Variant_seq=G;Genotype=homozygous;dbsnp_tag_detail=ID=1751;Reference_seq=A;Variant_seq=G;,ID=1752;Reference_seq=A;Variant_seq=C;,ID=1753;Reference_seq=A;Variant_seq=T;;A->G_SNV;lr_score=4.09326257894e-12;
chr1    VCF     SNV     28835   28835   10      +       .       ID=26;Reference_seq=A;Variant_seq=G;Genotype=heterozygous;dbsnp_tag_detail=ID=1784;Reference_seq=A;Variant_seq=G;,ID=1785;Reference_seq=A;Variant_seq=G;,ID=1786;Reference_seq=A;Variant_seq=G;;A->G_SNV;lr_score=9.1658815874e-07;
chr1    VCF     SNV     30923   30923   39      +       .       ID=27;Reference_seq=G;Variant_seq=T;Genotype=homozygous;repeat_with_rmsk_tag_detail=(TC)n;dbsnp_tag_detail=ID=1965;Reference_seq=G;Variant_seq=A;,ID=1966;Reference_seq=G;Variant_seq=C;;G->T_SNV;lr_score=1.0523479336e-11;
chr1    VCF     SNV     31029   31029   5       +       .       ID=28;Reference_seq=G;Variant_seq=A;Genotype=heterozygous;repeat_with_rmsk_tag_detail=MLT1A;dbsnp_tag_detail=ID=1972;Reference_seq=G;Variant_seq=A;,ID=1973;Reference_seq=G;Variant_seq=C;,ID=1974;Reference_seq=G;Variant_seq=T;;G->A_SNV;lr_score=3.10132320581e-06;
chr1    VCF     SNV     52238   52238   53      +       .       ID=31;Reference_seq=T;Variant_seq=G;Genotype=homozygous;repeat_with_rmsk_tag_detail=AT_rich;dbsnp_tag_detail=ID=2967;Reference_seq=T;Variant_seq=G;;T->G_SNV;lr_score=3.92584308003e-12;
chr1    VCF     SNV     54586   54586   13      +       .       ID=32;Reference_seq=T;Variant_seq=C;Genotype=heterozygous;repeat_with_rmsk_tag_detail=L2;dbsnp_tag_detail=ID=3012;Reference_seq=T;Variant_seq=C;;T->C_SNV;lr_score=7.9993499285e-07;
chr1    VCF     SNV     54676   54676   129     +       .       ID=33;Reference_seq=C;Variant_seq=T;Genotype=heterozygous;repeat_with_rmsk_tag_detail=L2;dbsnp_tag_detail=ID=3013;Reference_seq=C;Variant_seq=T;;C->T_SNV;lr_score=8.74542353761e-09;
chr1    VCF     SNV     54708   54708   14      +       .       ID=34;Reference_seq=G;Variant_seq=C;Genotype=heterozygous;repeat_with_rmsk_tag_detail=L2;G->C_SNV;lr_score=1.86216375082e-06;
...