Parameters

Name Usage Default Description
Input -i <input.gvf|vcf> A list of query variants in GVF or VCF format. See -if for supported input file formats.
Input file format -if GVF|VCF GVF

-if GVF: Genome Variation Format

-if VCF: Variant Call Format

Output -o <output.gvf|vcf> A single GVF/VCF file ( -s n ) or two GVF/VCF files ( -s y ).
Separate matching and non-matching query variants -s y|n

-s y: Separate matching and non-matching queries into two output files.

-s n: Combine matching and non-matching queries into a single output file.

Reference -r <reference.txt> A list of reference variants or genomic regions with annotations. See -rf for supported reference file formats.
Reference file format -rf GVF|GFF3|GTF|VCF|GSF|BED GVF

-rf GVF: Genome Variation Format

-rf GFF3: Generic Feature Format version 3

-rf GTF: Gene Transfer Format

-rf VCF: Variant Call Format

-rf BED: Browser Extensible Data Format

-rf GSF: gSearch Format. Tab-delimited text file with seven columns: chromosome, start, end, reference sequence (defualt '.'), variant sequence (default '.'), annotation (default '.'), numerical value (default 0).

See also: if -v y option is used, score for a given annotation feature is required. For -rf GVF|GFF3|GTF|VCF, column 6 is used. For -rf BED, column 5 is used (if any). For -rf GSF, column 7 is used.

Annotation tag -tg <tag name> tag The following suffixes are added to tag: _detail (with -a y), _overlap (with -l y), _avg_value (with -v y).
Search mode -m e|o

-m e: Exact match only. This will match the query variant with the reference variant only if all of the five attributes (chr, start, end, ref_seq, var_seq) are exactly the same.

-m o: Allow overlap. A match is found when there is any overlap between the query variant and the reference genomic regions based on only the three attributes (chr, start, end).

Annotation type -t s|c

-t s: Single annotation. The annotation column in the reference file contains a single annotation field. If a query variant matches to multiple reference regions, combine their annotations, separated by a comma (",").

-t c: Composite annotation. The annotation column in the reference file contains several annotation fields. If a query variant matches to multiple reference regions, do not combine their annotations but only use the annotation from the last match, and copy it as it is.

Report annotation -a y|n

-a y: Report annotation from the reference file. The annotation tag specified with -tg is suffixed with _detail.

-a n: Do not report annotation from the reference file.

Report overlap -l y|n

-l y: Report the extent of overlap (ranged from 0 - 1) between the query and the reference regions. The annotation tag specified with -tg is suffixed with _overlap.

-l n: Do not report overlap.

Report average value -v y|n

-v y: Report average value. The annotation tag specified with -tg is suffixed with _avg_value. See also: -rf option.

-v n: Do not report average value.

Number of threads -nt 1..25 1 Number of threads created by gSearch (1 to 25).
Big data -b Use this option if the input (-i) or reference (-r) files are bigger than your main memory. Both files must be sorted by chromosome.

Examples

Exact search

./gsearch -i GS12877.gvf -o GS12877.af_annotated.gvf -r hg18.ref.1000g2010_CEU.txt -rf GSF -tg allele_freq_1000g_ceu -m e -s n -t s -a y -l n -v n -nt 4

Input (GS12877.gvf)

##gvf-version 1.05
##file-version 1.0
##feature-ontology http://sourceforge.net/projects/song/files/Sequence%20Ontology/so_2_4_4/so_2_4_4.obo/download
##genome-build NCBI B36.3
##sequence-region
##technology-platform Complete Genomics WGS
...
chr1    CGI     SNV     729289  729289  73      +       .       ID=98;Reference_seq=A;Variant_seq=G;Genotype=homozygous;
chr1    CGI     SNV     742429  742429  267     +       .       ID=99;Reference_seq=G;Variant_seq=A;Genotype=heterozygous;
chr1    CGI     SNV     742584  742584  234     +       .       ID=100;Reference_seq=A;Variant_seq=G;Genotype=heterozygous;
chr1    CGI     SNV     743132  743132  28      +       .       ID=101;Reference_seq=C;Variant_seq=G;Genotype=homozygous;
chr1    CGI     deletion        743708  743709  76      +       .       ID=102;Reference_seq=CT;Variant_seq=-;Genotype=heterozygous;
chr1    CGI     SNV     743712  743712  90      +       .       ID=103;Reference_seq=G;Variant_seq=T;Genotype=heterozygous;
...

Output (GS12877.af_annotated.gvf)

##gvf-version 1.05
##file-version 1.0
##feature-ontology http://sourceforge.net/projects/song/files/Sequence%20Ontology/so_2_4_4/so_2_4_4.obo/download
##genome-build NCBI B36.3
##sequence-region
##technology-platform Complete Genomics WGS
...
chr1    CGI     SNV     729289  729289  73      +       .       ID=98;Reference_seq=A;Variant_seq=G;Genotype=homozygous;
chr1    CGI     SNV     742429  742429  267     +       .       ID=99;Reference_seq=G;Variant_seq=A;Genotype=heterozygous;allele_freq_1000g_ceu_detail=0.858;
chr1    CGI     SNV     742584  742584  234     +       .       ID=100;Reference_seq=A;Variant_seq=G;Genotype=heterozygous;allele_freq_1000g_ceu_detail=0.858;
chr1    CGI     SNV     743132  743132  28      +       .       ID=101;Reference_seq=C;Variant_seq=G;Genotype=homozygous;
chr1    CGI     deletion        743708  743709  76      +       .       ID=102;Reference_seq=CT;Variant_seq=-;Genotype=heterozygous;
chr1    CGI     SNV     743712  743712  90      +       .       ID=103;Reference_seq=G;Variant_seq=T;Genotype=heterozygous;allele_freq_1000g_ceu_detail=0.292;
...

Reference (hg18.ref.1000g2010_CEU.txt)

...
chr1    742429  742429  G       A       0.858   0
chr1    742456  742456  T       G       0.033   0
chr1    742584  742584  A       G       0.858   0
chr1    743268  743268  C       A       0.883   0
chr1    743288  743288  T       C       0.692   0
chr1    743404  743404  G       A       0.100   0
chr1    743712  743712  G       T       0.292   0
...

Range search

./gsearch -i GS12877.gvf -o GS12877.tf_annotated.gvf -r hg18.ref.tfbs.txt -rf GSF -tg tfbs -m o -s n -t s -a y -l y -v y -nt 4

Input (GS12877.gvf)

##gvf-version 1.05
##file-version 1.0
##feature-ontology http://sourceforge.net/projects/song/files/Sequence%20Ontology/so_2_4_4/so_2_4_4.obo/download
##genome-build NCBI B36.3
##sequence-region
##technology-platform Complete Genomics WGS
...
chr1    CGI     SNV     42590   42590   102     +       .       ID=6;Reference_seq=C;Variant_seq=G;Genotype=heterozygous;
chr1    CGI     SNV     43069   43069   119     +       .       ID=7;Reference_seq=G;Variant_seq=C;Genotype=heterozygous;
chr1    CGI     SNV     45027   45027   59      +       .       ID=8;Reference_seq=C;Variant_seq=A;Genotype=homozygous;
...

Output (GS12877.tf_annotated.gvf)

##gvf-version 1.05
##file-version 1.0
##feature-ontology http://sourceforge.net/projects/song/files/Sequence%20Ontology/so_2_4_4/so_2_4_4.obo/download
##genome-build NCBI B36.3
##sequence-region
##technology-platform Complete Genomics WGS
...
chr1    CGI     SNV     42590   42590   102     +       .       ID=6;Reference_seq=C;Variant_seq=G;Genotype=heterozygous;
chr1    CGI     SNV     43069   43069   119     +       .       ID=7;Reference_seq=G;Variant_seq=C;Genotype=heterozygous;tfbs_detail=V$CDPCR3_01;tfbs_overlap=1.0000;tfbs_avg_value=816.0000;
chr1    CGI     SNV     45027   45027   59      +       .       ID=8;Reference_seq=C;Variant_seq=A;Genotype=homozygous;
...

Reference (hg18.ref.tfbs.txt)

...
chr1    26297   26308   .       .       V$E4BP4_01      834
chr1    26600   26615   .       .       V$FREAC7_01     890
chr1    43055   43069   .       .       V$CDPCR3_01     816
chr1    43111   43124   .       .       V$SOX9_B1       865
chr1    43113   43122   .       .       V$SOX5_01       961
...


File Format Conversion