September 13, 2020

RRBS Analysis 1: Preprocess

RRBS

Preprocess

I am working on a RRBS (Reduced representation bisulfite sequencing) project, which is generated from Zebrafish Retina tissue. It's my first time to work on this type of data, so I wil record a bit my actions here.

The RRBS file I recieved is in folder contains many folders, the structure is below.

$ tree -d 190822_D00623_0193_AH3Y2YBCX3/
190822_D00623_0193_AH3Y2YBCX3/
├── Data
│   ├── Intensities
│   │   ├── BaseCalls
│   │   │   ├── L001
│   │   │   │   ├── C100.1
│   │   │   │   ├── C10.1
│   │   │   │   ├── C101.1
│   │   │   │   ├── C102.1

If I navigate into the deepest folder, the file format is .bcl file. Binary Base Call (BCL) files are the raw data files generated by the Illumina sequencers. The FASTQ is a text-based sequence file format that is generated from the BCL file that stores both raw sequence data and quality scores.

/190822_D00623_0193_AH3Y2YBCX3/Data/Intensities/BaseCalls/L001/C100.1$ ls *.bcl
s_1_1101.bcl  s_1_1107.bcl  s_1_1113.bcl  s_1_1203.bcl  s_1_1209.bcl  s_1_1215.bcl  s_1_2105.bcl  s_1_2111.bcl  s_1_2201.bcl  s_1_2207.bcl  s_1_2213.bcl
s_1_1102.bcl  s_1_1108.bcl  s_1_1114.bcl  s_1_1204.bcl  s_1_1210.bcl  s_1_1216.bcl  s_1_2106.bcl  s_1_2112.bcl  s_1_2202.bcl  s_1_2208.bcl  s_1_2214.bcl
s_1_1103.bcl  s_1_1109.bcl  s_1_1115.bcl  s_1_1205.bcl  s_1_1211.bcl  s_1_2101.bcl  s_1_2107.bcl  s_1_2113.bcl  s_1_2203.bcl  s_1_2209.bcl  s_1_2215.bcl
s_1_1104.bcl  s_1_1110.bcl  s_1_1116.bcl  s_1_1206.bcl  s_1_1212.bcl  s_1_2102.bcl  s_1_2108.bcl  s_1_2114.bcl  s_1_2204.bcl  s_1_2210.bcl  s_1_2216.bcl
s_1_1105.bcl  s_1_1111.bcl  s_1_1201.bcl  s_1_1207.bcl  s_1_1213.bcl  s_1_2103.bcl  s_1_2109.bcl  s_1_2115.bcl  s_1_2205.bcl  s_1_2211.bcl
s_1_1106.bcl  s_1_1112.bcl  s_1_1202.bcl  s_1_1208.bcl  s_1_1214.bcl  s_1_2104.bcl  s_1_2110.bcl  s_1_2116.bcl  s_1_2206.bcl  s_1_2212.bcl

1. BCL file to fastq file

So firstly I need to try convert these files into fastq file. bcl2fastq is the tool will be used here. The code is here:

bcl2fastq  --runfolder-dir bclfilefolder -p 40 --output-dir ./BCL2FASTQ --no-lane-splitting  --use-bases-mask Y*,I6Y* --minimum-trimmed-read-length 0 --mask-  short-adapter-reads 0 &> ./BCL2FASTQ/bcl2fastq_log.txt

After transforming to fastq file, now we can start to follow the preprocess pipeline provided by NuGEN company. Compared with other RRBS data, NuGEN company provide a series of unique scripts and actions... like add some special adapter in their reads, which means, during the analysis, we should do some special action on their data, instead of following common pipeline. For example, they provided a Pipeline I can follow

2. General Triming

The firstly we need to do is general triming, which removed adaptor added in sequencing step, the code we use is:

trim_galore -a AGATCGGAAGAGC R1.FQ

Importantly, I find a perfect command to finish triming in parallel!

parallel --plus 'trim_galore -a AGATCGGAAGAGC {...}.fastq.gz' ::: *.fastq.gz

Above is the first step triming, which removes some general adaptor, as long as you did sequencing, you need to do below triming I believe.

3. NuGEND Diversity Triming

Then, their compnay provided a script on github to trim adaptor they added on reads:

parallel --plus 'python ../NuGENDiversityTrimming/trimRRBSdiversityAdaptCustomers.py -1 {...}.fastq.gz' ::: *.fastq.gz

Then we would have a new file, named as XXX_trimmed.fq_trimmed.fa.gz

After TWO-STEP triming, you get below result:

regmtyu@rds-gw-007:~/rd00qp/Zebrafish2/SecondRun/NuGEN_DIVERSITY_TRIMING$ ls
Bc-01_S1_R1_001_trimmed.fq_trimmed.fq.gz  Bc-03_S3_R1_001_trimmed.fq_trimmed.fq.gz  Bc-05_S5_R1_001_trimmed.fq_trimmed.fq.gz  nugen_diversity_trming.log
Bc-02_S2_R1_001_trimmed.fq_trimmed.fq.gz  Bc-04_S4_R1_001_trimmed.fq_trimmed.fq.gz  Bc-06_S6_R1_001_trimmed.fq_trimmed.fq.gz  Undetermined_S0_R1_001_trimmed.fq_trimmed.fq.gz

In above figure, the top panel is results from first triming, and the bottom panel are result from second triming.

4. Prepare Bismark Genome

Then I can start to do Alignment. According to Nugen's requirement, I will use Bismark. Firstly I created a folder called Bismark, to placing various files related to Alignment. The vital step is prepare genome for alignment, the genome was downloaded from ensmbl website. Since we are working on Zebrafish, the Genome version I am using is Danio_rerio.GRCz11.

wget ftp://ftp.ensembl.org/pub/release-98/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz

Then put this unziped file under folder called Genome, Then use commands to prepare genome:

bismark_genome_preparation ./Genome

It would take about 0.5 hour. Then two sub folder would emerge from Genome folder: CT_conversion and GA_conversion. I don't what exactly that are they for yet...

5. Bismark Alignment

After building alignment genome, we can now run Alignment, I created a small R script below to run Bismark on these trimmed.fq.gz.

Samples <- dir('../BCL2FASTQ/NuGENDiversityTrimming')
Samples <- Samples[sapply(Samples,function(x) "trimmed.fq.gz" %in% strsplit(x,split='_')[[1]])]

for(i in Samples) {
    message('\\n-----------------------------------------------')
    message('Runing ',i,' with Bismark alignment now...')

    cmd <- paste('bismark --parallel 8 --output_dir ./Results --bowtie2 ./Genome ../BCL2FASTQ/NuGENDiversityTrimming/',i,sep='')
    message(cmd)
    system(cmd,ignore.stdout=TRUE)
}

This step is actually quite fast. Note that -parallel parameter above is NOT simply cores/thread, but seems to be times of resources will be allocated to row Bowtie, normally 8 is good enough. And we have another issue here, after Bismark, Nugen will use Sam files for Nudup, but I can't believe Bismark seems only support bam file for multi-core. So I can only output as bam file, like below:

regmtyu@rds-gw-007:~/rd00qp/Zebrafish2/SecondRun/BISMARK$ ls
Bc-01_S1_R1_001_trimmed.fq_trimmed_bismark_bt2.bam            Bc-05_S5_R1_001_trimmed.fq_trimmed_bismark_bt2.bam
Bc-01_S1_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt  Bc-05_S5_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt
Bc-02_S2_R1_001_trimmed.fq_trimmed_bismark_bt2.bam            Bc-06_S6_R1_001_trimmed.fq_trimmed_bismark_bt2.bam
Bc-02_S2_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt  Bc-06_S6_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt
Bc-03_S3_R1_001_trimmed.fq_trimmed_bismark_bt2.bam            log
Bc-03_S3_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt  Undetermined_S0_R1_001_trimmed.fq_trimmed_bismark_bt2.bam
Bc-04_S4_R1_001_trimmed.fq_trimmed_bismark_bt2.bam            Undetermined_S0_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt
Bc-04_S4_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt

6. Samtools Transfer

Then I use Samtools to transfer all bam files into sam file. Again I used parallel here:

parallel --plus 'samtools view -h -o {...}.sam {...}.fq_trimmed_bismark_bt2.bam' ::: *.bam

7. Strip Bismark Result

Again here it comes to Nugen-specific time. We useNudup here, but before that we need a small script. (Is that common for big companies use small scripts for jobs?)

After downloading, I use linux command to make it work:

chmod +x ./strip_bismark_sam.sh

Then again, I use parallel command to run it:

parallel --plus './strip_bismark_sam.sh {...}.sam' ::: *.sam

To be honest I still don't know what is this step for...

8. Run Nudup

Finally I reach the final step, to run Nudup, I wrote a small loop script to run it:


Samples <- dir('../Index')
Samples <- sapply(Samples,function(x) paste(strsplit(x,split="_")[[1]][1:2],collapse="_"))

for(i in Samples) {
    message('\\n-----------------------------------------------')
    message('Runing ',i,' with Nudup now...')

    cmd <- paste('python ./nudup.py -f ../Index/',i,'_I1_001.fastq.gz -o ',i,' ../Results/',i,'_R1_001_trimmed.sam_stripped.sam',sep='')
    message(cmd)
    system(cmd,ignore.stdout=TRUE)
}

It's really slow and no parallel... After a long time, I final get fully processed bam file.

NUDUP$ ls
Bc-01_S1_dup_log.txt         Bc-02_S2.sorted.markdup.bam  Bc-04_S4_dup_log.txt         Bc-05_S5.sorted.markdup.bam  Ready  TMP3  TMP6
Bc-01_S1.sorted.markdup.bam  Bc-03_S3_dup_log.txt         Bc-04_S4.sorted.markdup.bam  Bc-06_S6_dup_log.txt         TMP    TMP4
Bc-02_S2_dup_log.txt         Bc-03_S3.sorted.markdup.bam  Bc-05_S5_dup_log.txt         Bc-06_S6.sorted.markdup.bam  TMP2   TMP5

The 6 bam file in above folder is what I will use into future Downstream Analysis.