What is bicycle?

bicycle (bisulfite-based methylcytosine caller) is a next-generation sequencing bioinformatic pipeline aimed to analyze whole genome bisulfite sequencing data. It can process data from directional (Lister) and non-directional (Cokus) bisulfite sequencing protocols, and from single-end and paired-end sequencing, and performs methylation calls for cytosines in CG and non-CG contexts (CHG and CHH).

bicycle uses as input the bisulfite sequencing files from the different samples (FASTQ format) and a reference genome (FASTA format). It then performs: generation and indexing of Watson and Crick bisulfited versions of the reference genome, in-silico bisulfitation of sequenced reads, read alignment, error estimation in bisulfite conversion, identification of clonal and ambiguous reads, cytosine methylation detection in CG and non-CG contexts, with non-CG to CG context correction when appropriated, calculates methylation ratios, beta scores and weighted mean of cytosine methylation status, and performs genomic annotation of methylated regions, and differential methylation for cytosines (DMC) and genomic regions (DMR).



Features

  • Statistical methylcytosine calling. All reference cytosines with higher methylation level (methylated vs. unmethylated reads) than expected will be called as methylcytosines. We include two empirical error computation procedures:
    • From a control genome: an unmethylated genome that allows to calculate the bisulfite conversion error rate as the per-context detected methylation level in the control genome.
    • From barcodes: in case that the experiment includes barcodes with unmethylated cytosines, bicycle considers the ratio of unconverted cytosines in barcodes as the error rate.
    • Fixed error rate given by the user (not empirical).
  • Output in a custom VCF.
  • Several filters:
    • Removal of ambiguous reads, which are those which aligned to both the Watson and Crick reference genomes.
    • Removal of non-correctly converted reads (those with more than three cytosines in a non-CG context).
    • Trim reads to 'x' mismatch (default 4).
    • Removal of "clonal" reads (those reads that shared the same 5' alignment position, leaving the read at that position that had the highest sum quality score).
  • CH to CG context correction (possible SNPs).
  • Multithread support. Both the alignments (performed with Bowtie) and the methylcytosine calling (programmed with GATK) support multi-core processing.
  • Step by step execution, allowing the re-execution of parts with different parameters.

For impatient users

First of all, download and install bicycle.
You should also download the simulated sample data. The following example is based on this data.

Pipeline

Please note: adjust the paths in the example to your environment

  1. Create a project
    ./bicycle create-project --project-directory /home/user/myproject --reference-directory /home/user/sampledata/ref_genomes --reads-directory /home/user/sampledata/reads --bowtie-directory /opt/bowtie-0.12.7 --samtools-directory /opt/samtools-0.1.8
  2. Create the Watson and Crick in-silico bisulfited reference genomes
    ./bicycle reference-bisulfitation --project-directory /home/user/myproject
  3. Create the in-silico bisulfited reads
    ./bicycle reads-bisulfitation --project-directory /home/user/myproject
  4. Create bowtie references indexes
    ./bicycle reference-index --project-directory /home/user/myproject
  5. Align to both references
    ./bicycle align --project-directory /home/user/myproject
  6. Perform methylation analysis and methylcytosine calling
    ./bicycle analyze-methylation --project-directory /home/user/myproject --error-mode from_control_genome=Ecoli
  7. You are finished! You can explore the summary output file /home/user/myproject/output/sample-1_mm9_chr1_reduced_plus_Ecoli.fa.summary (in bold there are the detected errors in Ecoli and the methylation level, as expected from simulated data)
    ====METHYLATION RESULTS=======================================================
    File: sample-1_mm9_chr1_reduced_plus_Ecoli.fa.summary
    Date: Mon Jul 02 12:56:39 CEST 2012
    
    ====ANALYSIS PARAMETERS=======================================================
     Correct non-CG: true
     Filters:
      remove ambiguous reads: true
      remove non-correctly bisulfite-converted reads: true
      trim to 'x' mismatch: true x=4
      remove clonal reads: false
     FDR threshold: 0.01
    
    ====ERROR ESTIMATION AND SIGNIFICANCE ADJUSTMENTS=============================
     Error rates (from control genome: Ecoli):
      WATSON={CG=9.686509334272632E-4 (1265/1305940), CHG=0.0010630211982115276 (1179/1109103), CHH=0.0010214901652048762 (2076/2032325)}
      CRICK={CG=0.0010030059599348046 (1312/1308068), CHG=0.001042832963281255 (1154/1106601), CHH=9.824304531497866E-4 (1970/2005231)}
      
      p-value cutoffs: {WATSON={CHH=9.515226330187381E-5, CG=0.01494593925387917, CHG=9.853761678579196E-5}, CRICK={CHH=9.471328057717618E-5, CG=0.01481125092524056, CHG=9.836380454616681E-5}}
    
    ====METHYLATION ANALYSIS RESULTS==============================================
    ---- GLOBAL --------
    Called methylcytosines (pval<cutoff)
     total: 108033/3898156 (0.027713872918374738)
     per context called methylcytosines:  CG:0.6686382864495108 CHG:0.0720890838910333 CHH:0.2592726296594559
     CG called methylcytosines: 72235/120785 (0.5980461149977232)
     CHG called methylcytosines: 7788/798844 (0.009749087431338285)
     CHH called methylcytosines: 28010/2978527 (0.009403977200811004)
    Methylation Levels:
     CG: 107826/364342 (0.2959472144303978)
     CHG: 118475/2427700 (0.048801334596531694)
     CHH: 436883/9009617 (0.048490740505395515)
    non-CG corrections: 0
    
    
    
    ... (more)