How To Use Python To Automatically Download Dna Sequencing Files

I have recently entered new territory and started working on analysing Dna sequencing data (every bit opposed to analysing RNA sequencing, i.e. transcriptomics). While it is still the assay of lots of sequencing reads, one of the goals is to identify disease causing mutations; this is in contrast to identifying differentially expressed genes or inferring cistron networks. The new playground includes an entirely new repertoire of bioinformatic tools, file formats, and genetics that I was unaware of. While information technology is always fun to acquire virtually new things, time is always a constraint. In this post, I write about some of the tools and file formats I learned virtually, and hopefully they will be useful for those who are getting started with analysing DNA sequencing data.

Ane of the tools I first got familiar with was the Genome Analysis Toolkit (GATK), which seems like one of the more popular toolkits for determining genetic variants in sequencing information. I spent a lot of time reading various GATK guides. Luckily at that place is a best practices guide that one tin follow to go started with the GATK. The guides are very well written and I institute the various GATK presentations to be as informative. I used Bpipe to build a pipeline with the GATK and I would highly recommend using Bpipe, non only with the GATK, but with various bioinformatic pipelines.

At the end of the GATK pipeline, a VCF file is produced that contains all of the inferred variants. The VCF file produced by the GATK contains A LOT of data. I spent some time going through each annotation, which are more often than not well documented (annoyingly some aren't documented nearly too equally the others). In addition, I created a GitHub repository to assistance me larn a bit more near VCF files; I will exist continually updating the repo as I acquire more nearly VCF files and aim to include more "recipes" for performing various tasks with VCF files.

Reading VCF files into R

I use R almost exclusively to produce all my plots, so naturally I wanted to discover a style to load VCF files into R (without writing my own terrible parser). I came beyond the VariantAnnotation package, which every bit the package name suggests, is a tool for annotating variants, and comes with functions that process VCF files into R. Below is a simple demonstration of producing a histogram of alternate allele frequencies from the sample file provided past the package.

#install parcel source("http://bioconductor.org/biocLite.R") biocLite("VariantAnnotation")  #load library library(VariantAnnotation)  #location of example file fl <- system.file("extdata", "chr22.vcf.gz", packet="VariantAnnotation") fl [ane] "/Users/dtang/Library/R/3.2/library/VariantAnnotation/extdata/chr22.vcf.gz"  #read into vcf object vcf <- readVcf(fl, "hg19") found header lines for 3 'stock-still' fields: ALT, QUAL, FILTER  establish header lines for 22 'info' fields: LDAF, AVGPOST, RSQ, ERATE, THETA, CIEND, CIPOS, Finish, HOMLEN, HOMSEQ, SVLEN, SVTYPE, Ac, AN, AA, AF, AMR_AF, ASN_AF, AFR_AF, EUR_AF, VT, SNPSOURCE  found header lines for 3 'geno' fields: GT, DS, GL

This VCF file contains a small subset of variants from the g Genomes projection. What does this VCF file look like?

gunzip -c /Users/dtang/Library/R/iii.ii/library/VariantAnnotation/extdata/chr22.vcf.gz | grep -v "^#" | head -2 22      50300078        rs7410291       A       Thou       100     Pass    AN=2184;AVGPOST=0.9890;THETA=0.0005;VT=SNP;RSQ=0.9856;ERATE=0.0020;SNPSOURCE=LOWCOV;AA=N;AC=751;LDAF=0.3431;AF=0.34;ASN_AF=0.19;AMR_AF=0.twenty;AFR_AF=0.83;EUR_AF=0.22  GT:DS:GL        0|0:0.000:-0.00,-2.77,-5.00     0|0:0.000:-0.12,-0.62,-2.40     1|0:1.000:-0.48,-0.48,-0.48     0|0:0.000:-0.17,-0.50,-2.79  0|0:0.000:-0.03,-1.13,-5.00 22      50300086        rs147922003     C       T       100     Laissez passer    ERATE=0.0005;AVGPOST=0.9963;AN=2184;Air conditioning=20;VT=SNP;THETA=0.0011;RSQ=0.8398;AA=c;SNPSOURCE=LOWCOV;LDAF=0.0091;AF=0.01;AMR_AF=0.01;AFR_AF=0.02;EUR_AF=0.01       GT:DS:GL        0|0:0.000:-0.00,-two.75,-five.00     0|0:0.000:-0.12,-0.62,-2.41     0|0:0.000:-0.48,-0.48,-0.48     0|0:0.000:-0.10,-0.lxx,-four.seventy  0|0:0.000:-0.03,-ane.xvi,-5.00

Have a look at this tutorial, if you're unfamiliar with the VCF. Using the info() office, we tin obtain all the values for various INFO central/value pairs in the VCF file. We will use this part to produce a simple histogram of the alternate allele frequency (alternate with respect to the reference genome).

AF <- info(vcf)$AF class(AF) [1] "numeric" summary(AF)    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.  0.00000 0.00090 0.00460 0.07347 0.05000 one.00000 hist(AF, breaks=50, xlab='Alternate Allele Frequency')

hist_af

Data analysis

I of the most popular and useful data analysis method is the Principal Component Analysis (PCA). The SNPRelate packet has functions to perform a PCA on variant data, as well as providing functions for calculating the fixation index, and for performing a Identity By State (IBS) and Identity By Descent analysis. Information technology besides has a brilliantly written tutorial. To install the packet, just perform the usual Bioconductor steps:

#install SNPRelate source("http://bioconductor.org/biocLite.R") biocLite("SNPRelate")

Processing a VCF file using SNPRelate

The SNPRelate packet uses the Genomic Information Structure (GDS) format, so we demand to convert a VCF file into a GDS file before we tin can analyse information technology. Every bit an example, I downloaded a VCF file (ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz) from the 1000 Genomes project, which is located at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20101123/interim_phase1_release/. The file is around 300M in size.

#load the package library(SNPRelate) Loading required package: gdsfmt SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)  my_vcf <- "ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz"  #convert VCF to GDS snpgdsVCF2GDS(my_vcf, "ALL.chr22.phase1.projectConsensus.genotypes.gds", method="biallelic.only") VCF Format --> SNP GDS Format Method: exacting biallelic SNPs Number of samples: 1094 Parsing "ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz" ...         import 494975 variants. + genotype   { Bit2 1094x494975 } * Optimize the access efficiency ... Clean up the fragments of GDS file:         open the file "ALL.chr22.phase1.projectConsensus.genotypes.gds" (size: 138172578).         # of fragments in total: 143.         save information technology to "ALL.chr22.phase1.projectConsensus.genotypes.gds.tmp".         rename "ALL.chr22.phase1.projectConsensus.genotypes.gds.tmp" (size: 138171102).         # of fragments in total: 20.  # Open the GDS file (genofile <- snpgdsOpen('ALL.chr22.phase1.projectConsensus.genotypes.gds')) File: /GROUP/ANALYSIS/Tang/muse/snprelate/ALL.chr22.phase1.projectConsensus.genotypes.gds +    [  ] * |--+ sample.id   { VStr8 1094 ZIP_RA(26.22%) } |--+ snp.id   { Int32 494975 ZIP_RA(34.58%) } |--+ snp.rs.id   { VStr8 494975 ZIP_RA(35.07%) } |--+ snp.position   { Int32 494975 ZIP_RA(45.92%) } |--+ snp.chromosome   { VStr8 494975 ZIP_RA(0.x%) } |--+ snp.allele   { VStr8 494975 ZIP_RA(14.29%) } |--+ genotype   { Bit2 1094x494975 } * |--+ snp.annot   [  ] |  |--+ qual   { Float32 494975 ZIP_RA(0.10%) } |  |--+ filter   { VStr8 494975 ZIP_RA(0.fifteen%) }  #tally the genotypes tabular array(read.gdsn( alphabetize.gdsn(genofile, "genotype")))          0         1         2   19802443  38544219 483155988  19802443 + 38544219 + 483155988 [ane] 541502650  494975 * 1094 [1] 541502650

Data analysis using SNPRelate

I will use the sample file (hapmap_geno.gds) provided by the SNPRelate parcel to perform a uncomplicated analysis.

#load the package library(SNPRelate) Loading required package: gdsfmt SNPRelate -- supported past Streaming SIMD Extensions 2 (SSE2)  #summary of the sample file snpgdsSummary(snpgdsExampleFileName()) The file name: /Users/dtang/Library/R/3.ii/library/SNPRelate/extdata/hapmap_geno.gds  The total number of samples: 279  The total number of SNPs: 9088  SNP genotypes are stored in SNP-major manner (Sample 10 SNP).  # Open the sample GDS file (genofile <- snpgdsOpen(snpgdsExampleFileName())) File: /Users/dtang/Library/R/3.two/library/SNPRelate/extdata/hapmap_geno.gds +    [  ] * |--+ sample.id   { FStr8 279 Aught(23.ten%) } |--+ snp.id   { Int32 9088 ZIP(34.76%) } |--+ snp.rs.id   { FStr8 9088 Null(42.66%) } |--+ snp.position   { Int32 9088 ZIP(94.73%) } |--+ snp.chromosome   { UInt8 9088 ZIP(0.94%) } * |--+ snp.allele   { FStr8 9088 ZIP(14.45%) } |--+ genotype   { Bit2 279x9088 } * |--+ sample.annot   [ information.frame ] * |  |--+ sample.id   { FStr8 279 Zip(23.10%) } |  |--+ family.id   { FStr8 279 ZIP(28.37%) } |  |--+ father.id   { FStr8 279 ZIP(12.98%) } |  |--+ mother.id   { FStr8 279 ZIP(12.86%) } |  |--+ sex   { FStr8 279 Nada(28.32%) } |  |--+ pop.grouping   { FStr8 279 Nil(vii.89%) }

What exactly is the GDF format? From the gdsfmt vignette:

The GDS file can contain a hierarchical structure to store multiple GDS variables (or GDS nodes) in the file, and diverse data types are immune (see the document of add together.gdsn()) including integer, floating-point number and grapheme.

Let'due south take a look inside the genofile object:

# use the functions read.gdsn() and index.gdsn() # to access the information in genofile and store these # into a data.frame for viewing purposes df <- data.frame(                  snp_chromosome = read.gdsn(alphabetize.gdsn(genofile, "snp.chromosome")),                  snp_position = read.gdsn(alphabetize.gdsn(genofile, "snp.position")),                  snp_rs_id = read.gdsn(index.gdsn(genofile, "snp.rs.id")),                  snp_allele = read.gdsn(index.gdsn(genofile, "snp.allele"))                  )  head(df)   snp_chromosome snp_position  snp_rs_id snp_allele ane              i      1355433  rs1695824        Thousand/T 2              1      3114015 rs13328662        C/T 3              1      4124418  rs4654497        A/Yard iv              1      4221327 rs10915489        A/Yard 5              i      4271809 rs12132314        C/T 6              1      4358909 rs12042555        C/T  dim(read.gdsn( index.gdsn(genofile, "sample.annot") )) [1] 279   six  caput(read.gdsn( alphabetize.gdsn(genofile, "sample.annot") ))   sample.id family.id father.id mother.id sex popular.group ane   NA19152        72                       F       YRI 2   NA19139        43 200171931 200061729   Thou       YRI three   NA18912        28                       F       YRI 4   NA19160        56                       M       YRI 5   NA07034      1341                       M       CEU 6   NA07055      1341                       F       CEU  tabular array(read.gdsn(alphabetize.gdsn(genofile, path="sample.annot/pop.group")))  CEU HCB JPT YRI   92  47  47  93  dim(read.gdsn( alphabetize.gdsn(genofile, "genotype") )) [i]  279 9088  read.gdsn( index.gdsn(genofile, "genotype"))[1:5,1:v]      [,ane] [,2] [,3] [,4] [,5] [1,]    ii    1    0    one    2 [2,]    1    ane    0    1    ii [iii,]    two    1    ane    two    two [4,]    2    1    1    two    ii [5,]    0    0    0    2    2  table(read.gdsn(alphabetize.gdsn(genofile, "genotype")))       0      i      2      3  920248 657267 947899  10138

The SNP genotypic matrix is stored in a SNP-major mode, $n_{sample} \times n_{snp}$ , in contrast to an individual-major mode, $n_{snp} \times n_{sample}$ . From the table above we tin see that in that location are iv possible values stored for the genotypes: 0, 1, 2, and three. For bi-allelic SNP sites, "0" indicates ii B alleles, "1" indicates 1 A allele and ane B allele, "two" indicates two A alleles, and "3" is a missing genotype. For multi-allelic sites, information technology is a count of the reference allele (iii meaning no telephone call). The format of snp.allele is "A allele/B allele", like "T/Yard" where T is A allele and G is B allele.

Performing an Identity By Country analysis

The Identity By Land (IBS) analysis is a straightforward measure of genetic similarity; it merely tallies the number of alleles that are shared between 2 individuals beyond a ready of given loci. When comparison bi-allelic alleles, there are 3 possible scenarios at each site: no shared alleles, one shared allele, and two shared alleles; IBS assigns a score of 0, 1, and 2 for the three scenarios, respectively. To obtain a similarity score between two individuals, we just tally up the number of shared alleles and divide this by the maximum possible score, i.e. ii times the total number of loci.

#perform the IBS analysis ibs <- snpgdsIBS(genofile, num.thread=2) Identity-By-State (IBS) assay on SNP genotypes: Excluding 365 SNPs on not-autosomes Excluding 1 SNP (monomorphic: TRUE, < MAF: NaN, or > missing charge per unit: NaN) Working infinite: 279 samples, 8722 SNPs 	Using two (CPU) cores IBS:	the sum of all working genotypes (0, i and 2) = 2446510 IBS:	Mon Jul 20 12:56:45 2015	0% IBS:	Mon Jul 20 12:56:45 2015	100%  length(ibs$ibs) [1] 77841  #the ibs() office actually calculates all pairwise #instead of calculating merely the upper or lower matrix cull(279, 2) [1] 38781  38781*2 + 279 [1] 77841  summary(every bit.vector(ibs$ibs))    Min. 1st Qu.  Median    Mean third Qu.    Max.   0.6988  0.7150  0.7508  0.7472  0.7701  one.0000

Nosotros tin can perform hierarchical clustering on the IBS pairwise scores:

#for colour dendrograms install.packages("sparcl")  #load packet library(sparcl)  #catechumen to distance hc <- hclust(as.dist(1 - ibs$ibs))  #obtain the unlike populations pop <- read.gdsn(alphabetize.gdsn(genofile, path="sample.annot/pop.group")) pop <- unclass(cistron(pop))  #ascertain some colours my_colour <- rainbow(4)  #plot the dendrogram ColorDendrogram(hc,                 y=my_colour[pop],                 branchlength=0.06)  #add a legend fable(x = 260, y = 0.05, fill = my_colour, legend = levels(pop), cex = 0.5)

hclust_ibs By tallying the number of shared alleles, we form almost homogenous clusters that contain individuals from one ethnicity. YRI = Yoruba in Ibadan, Nigeria, JPT = Japanese in Tokyo, CHB = Han Chinese in Beijing, and CEU = Utah residents with ancestry from northern and western Europe.

PLINK file formats

One day I was trying to utilise a tool called The Exomiser (truthful story) and I couldn't get it to work considering it requires that a PED file must be exist provided for multi-sample VCF files. I later learned that the PED format was defined by some other popular toolkit called PLINK. To learn what PED files are, I had a look at some files in various PLINK formats that are provided with SNPRelate.

system.file("extdata", "plinkhapmap.bed.gz", package="SNPRelate") [1] "/Users/dtang/Library/R/three.2/library/SNPRelate/extdata/plinkhapmap.bed.gz" organisation.file("extdata", "plinkhapmap.fam.gz", packet="SNPRelate") [1] "/Users/dtang/Library/R/3.2/library/SNPRelate/extdata/plinkhapmap.fam.gz" system.file("extdata", "plinkhapmap.bim.gz", package="SNPRelate") [one] "/Users/dtang/Library/R/three.2/library/SNPRelate/extdata/plinkhapmap.bim.gz"

Every bit you may have diligently spotted, in that location aren't any PED files above. The BED file, plinkhapmap.bed.gz, is (confusingly for me, not the UCSC Genome Browser defined BED file) actually a binary PED file (I learned this after I tried to open up the file). Hither's how it looks:

gunzip -c /Users/dtang/Library/R/iii.2/library/SNPRelate/extdata/plinkhapmap.bed.gz | xxd -b | head -5 00000000: 01101100 00011011 00000001 10000010 11101011 10001011  50..... 00000006: 00101010 10111111 11101110 10000011 11101111 11101010  *..... 0000000c: 00100010 11111010 00101010 10001110 11101111 00100010  ".*.." 00000012: 11111111 10101110 10001010 10101111 00001010 10100000  ...... 00000018: 11101011 10001010 10101100 10111111 10001010 11101000  ......

If y'all are interested in what those binary numbers stand for, have a look at the clarification at the PLINK page. The other ii files, plinkhapmap.fam.gz and plinkhapmap.bim.gz, are not binary files and can exist viewed directly:

gunzip -c /Users/dtang/Library/R/three.2/library/SNPRelate/extdata/plinkhapmap.fam.gz | head -five 0 NA19152 0 0 0 -ix 0 NA19139 0 0 0 -nine 0 NA18912 0 0 0 -9 0 NA19160 0 0 0 -ix 0 NA07034 0 0 0 -nine

The six columns of the FAM file are:

Family ID

Individual ID

Paternal ID

Maternal ID

Sex (one=male; two=female; other=unknown)

Phenotype ('1' = control, '2' = instance, '-9'/'0'/not-numeric = missing data if case/control)

And here'south the BIM file:

gunzip -c /Users/dtang/Library/R/3.ii/library/SNPRelate/extdata/plinkhapmap.bim.gz | head -5 one       9       0       2444790 Thousand       C 1       eighteen      0       3314897 C       T 1       20      0       3644455 C       T ane       32      0       4221327 A       1000 1       48      0       4541602 A       Thousand

The BIM format contains extended variant data and the columns are:

Chromosome code (either an integer, or 'Ten'/'Y'/'XY'/'MT'; '0' indicates unknown) or name

Variant identifier

Position in morgans or centimorgans (prophylactic to use dummy value of '0')

Base of operations-pair coordinate (normally one-based, just 0 ok; limited to 231-2)

Allele ane (corresponding to clear bits in .bed; usually modest)

Allele 2 (respective to set bits in .bed; usually major)

To procedure PLINK files using SNPRelate, you would start accept to convert them into a GDS file, and then load the GDS file.

Creating a PED file from a VCF file

In order to use The Exomiser with my VCF file, I need a PED file. SNPRelate also provides a function called snpgdsGDS2PED that performs the conversion from VCF to PED:

# using the VCF file I downloaded from the thou Genomes project # which I converted into a GDS file (genofile <- snpgdsOpen('ALL.chr22.phase1.projectConsensus.genotypes.gds'))  #perform the conversion snpgdsGDS2PED(genofile, 'ALL.chr22.phase1.projectConsensus.genotypes') Converting from GDS to PLINK PED:         Output a MAP file Done.         Output a PED file ...                 Output:         Wed Jul 22 10:05:27 2015        0%                 Output:         Wed Jul 22 10:05:57 2015        37%                 Output:         Wed Jul 22 10:06:27 2015        75%                 Output:         Wed Jul 22 10:06:47 2015        100%

Two files are created in the conversion; the MAP files contain iii-4 columns, which are:

Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.

Variant identifier

Position in morgans or centimorgans (optional; also safe to use dummy value of '0')

Base-pair coordinate

Here's how it looks:

cat ALL.chr22.phase1.projectConsensus.genotypes.map | wc -l 494975  true cat ALL.chr22.phase1.projectConsensus.genotypes.map | head -v 22              0       16050036 22              0       16050053 22              0       16050069 22              0       16050115 22              0       16050116

The PED format is like to a FAM file (see higher up), where the beginning six fields are the same. The remaining columns contain the genotype data and each row is for 1 individual. Hither's how it looks:

cat ALL.chr22.phase1.projectConsensus.genotypes.ped | wc -l 1094  cat ALL.chr22.phase1.projectConsensus.genotypes.ped | cutting -f1-8 | caput -5 0       HG00096 0       0       0       -9      C C     C C 0       HG00097 0       0       0       -9      A C     C C 0       HG00099 0       0       0       -nine      A C     C C 0       HG00100 0       0       0       -ix      C C     C C 0       HG00101 0       0       0       -nine      A C     C C

Finally if you want to convert a PED file into BED, FAM, and BIM files utilise PLINK.

Summary

Well, that was a fairly wordy post. It is likewise a rather patchy post, which is cogitating of my current knowhow in this area. The point of the post was to outline some of the tools that I institute useful while I was getting started with analysing Dna sequencing data, in the hopes that it will be useful to others getting started in this area. I have also provided various links through the posts, and then delight check those out when in doubt.

Finally, here are a handful of papers to be read:

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation Deoxyribonucleic acid sequencing information
Constructive filtering strategies to improve data quality from population-based whole exome sequencing studies
Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing
Rare-affliction genetics in the era of next-generation sequencing: discovery to translation
Variant Callers for Adjacent-Generation Sequencing Information: A Comparison Study

And if you lot made information technology this far…

This is my 200th postal service! I gauge it'south safe to say that I bask blogging. Equally I have mentioned in this mail service, I am now working in the area of Dna sequencing, so I foresee more posts in that area. The traffic on this web log has actually started to dwindle, then either I'yard becoming less interesting or hopefully people accept get more than proficient in bioinformatics and no longer demand to visit. Either fashion, I'll keep the posts coming.