I am working on a RRBS (Reduced representation bisulfite sequencing) project, which is generated from Zebrafish Retina tissue. It's my first time to work on this type of data, so I wil record a bit my actions here.
The RRBS file I recieved is in folder contains many folders, the structure is below.
$ tree -d 190822_D00623_0193_AH3Y2YBCX3/
190822_D00623_0193_AH3Y2YBCX3/
├── Data
│ ├── Intensities
│ │ ├── BaseCalls
│ │ │ ├── L001
│ │ │ │ ├── C100.1
│ │ │ │ ├── C10.1
│ │ │ │ ├── C101.1
│ │ │ │ ├── C102.1
If I navigate into the deepest folder, the file format is .bcl file. Binary Base Call (BCL) files are the raw data files generated by the Illumina sequencers. The FASTQ is a text-based sequence file format that is generated from the BCL file that stores both raw sequence data and quality scores.
/190822_D00623_0193_AH3Y2YBCX3/Data/Intensities/BaseCalls/L001/C100.1$ ls *.bcl
s_1_1101.bcl s_1_1107.bcl s_1_1113.bcl s_1_1203.bcl s_1_1209.bcl s_1_1215.bcl s_1_2105.bcl s_1_2111.bcl s_1_2201.bcl s_1_2207.bcl s_1_2213.bcl
s_1_1102.bcl s_1_1108.bcl s_1_1114.bcl s_1_1204.bcl s_1_1210.bcl s_1_1216.bcl s_1_2106.bcl s_1_2112.bcl s_1_2202.bcl s_1_2208.bcl s_1_2214.bcl
s_1_1103.bcl s_1_1109.bcl s_1_1115.bcl s_1_1205.bcl s_1_1211.bcl s_1_2101.bcl s_1_2107.bcl s_1_2113.bcl s_1_2203.bcl s_1_2209.bcl s_1_2215.bcl
s_1_1104.bcl s_1_1110.bcl s_1_1116.bcl s_1_1206.bcl s_1_1212.bcl s_1_2102.bcl s_1_2108.bcl s_1_2114.bcl s_1_2204.bcl s_1_2210.bcl s_1_2216.bcl
s_1_1105.bcl s_1_1111.bcl s_1_1201.bcl s_1_1207.bcl s_1_1213.bcl s_1_2103.bcl s_1_2109.bcl s_1_2115.bcl s_1_2205.bcl s_1_2211.bcl
s_1_1106.bcl s_1_1112.bcl s_1_1202.bcl s_1_1208.bcl s_1_1214.bcl s_1_2104.bcl s_1_2110.bcl s_1_2116.bcl s_1_2206.bcl s_1_2212.bcl
1. BCL file to fastq file
So firstly I need to try convert these files into fastq file. bcl2fastq
is the tool will be used here. The code is here:
bcl2fastq --runfolder-dir bclfilefolder -p 40 --output-dir ./BCL2FASTQ --no-lane-splitting --use-bases-mask Y*,I6Y* --minimum-trimmed-read-length 0 --mask- short-adapter-reads 0 &> ./BCL2FASTQ/bcl2fastq_log.txt
After transforming to fastq file, now we can start to follow the preprocess pipeline provided by NuGEN company. Compared with other RRBS data, NuGEN company provide a series of unique scripts and actions... like add some special adapter in their reads, which means, during the analysis, we should do some special action on their data, instead of following common pipeline. For example, they provided a Pipeline I can follow
2. General Triming
The firstly we need to do is general triming, which removed adaptor added in sequencing step, the code we use is:
trim_galore -a AGATCGGAAGAGC R1.FQ
Importantly, I find a perfect command to finish triming in parallel!
parallel --plus 'trim_galore -a AGATCGGAAGAGC {...}.fastq.gz' ::: *.fastq.gz
Above is the first step triming, which removes some general adaptor, as long as you did sequencing, you need to do below triming I believe.
3. NuGEND Diversity Triming
Then, their compnay provided a script on github to trim adaptor they added on reads:
parallel --plus 'python ../NuGENDiversityTrimming/trimRRBSdiversityAdaptCustomers.py -1 {...}.fastq.gz' ::: *.fastq.gz
Then we would have a new file, named as XXX_trimmed.fq_trimmed.fa.gz
After TWO-STEP triming, you get below result:
regmtyu@rds-gw-007:~/rd00qp/Zebrafish2/SecondRun/NuGEN_DIVERSITY_TRIMING$ ls
Bc-01_S1_R1_001_trimmed.fq_trimmed.fq.gz Bc-03_S3_R1_001_trimmed.fq_trimmed.fq.gz Bc-05_S5_R1_001_trimmed.fq_trimmed.fq.gz nugen_diversity_trming.log
Bc-02_S2_R1_001_trimmed.fq_trimmed.fq.gz Bc-04_S4_R1_001_trimmed.fq_trimmed.fq.gz Bc-06_S6_R1_001_trimmed.fq_trimmed.fq.gz Undetermined_S0_R1_001_trimmed.fq_trimmed.fq.gz
In above figure, the top panel is results from first triming, and the bottom panel are result from second triming.
4. Prepare Bismark Genome
Then I can start to do Alignment. According to Nugen's requirement, I will use Bismark. Firstly I created a folder called Bismark, to placing various files related to Alignment. The vital step is prepare genome for alignment, the genome was downloaded from ensmbl website. Since we are working on Zebrafish, the Genome version I am using is Danio_rerio.GRCz11.
wget ftp://ftp.ensembl.org/pub/release-98/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
Then put this unziped file under folder called Genome
, Then use commands to prepare genome:
bismark_genome_preparation ./Genome
It would take about 0.5 hour. Then two sub folder would emerge from Genome folder: CT_conversion
and GA_conversion
. I don't what exactly that are they for yet...
5. Bismark Alignment
After building alignment genome, we can now run Alignment, I created a small R script below to run Bismark on these trimmed.fq.gz
.
Samples <- dir('../BCL2FASTQ/NuGENDiversityTrimming')
Samples <- Samples[sapply(Samples,function(x) "trimmed.fq.gz" %in% strsplit(x,split='_')[[1]])]
for(i in Samples) {
message('\\n-----------------------------------------------')
message('Runing ',i,' with Bismark alignment now...')
cmd <- paste('bismark --parallel 8 --output_dir ./Results --bowtie2 ./Genome ../BCL2FASTQ/NuGENDiversityTrimming/',i,sep='')
message(cmd)
system(cmd,ignore.stdout=TRUE)
}
This step is actually quite fast. Note that -parallel
parameter above is NOT simply cores/thread, but seems to be times of resources will be allocated to row Bowtie, normally 8 is good enough. And we have another issue here, after Bismark, Nugen will use Sam files for Nudup, but I can't believe Bismark seems only support bam file for multi-core. So I can only output as bam file, like below:
regmtyu@rds-gw-007:~/rd00qp/Zebrafish2/SecondRun/BISMARK$ ls
Bc-01_S1_R1_001_trimmed.fq_trimmed_bismark_bt2.bam Bc-05_S5_R1_001_trimmed.fq_trimmed_bismark_bt2.bam
Bc-01_S1_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt Bc-05_S5_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt
Bc-02_S2_R1_001_trimmed.fq_trimmed_bismark_bt2.bam Bc-06_S6_R1_001_trimmed.fq_trimmed_bismark_bt2.bam
Bc-02_S2_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt Bc-06_S6_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt
Bc-03_S3_R1_001_trimmed.fq_trimmed_bismark_bt2.bam log
Bc-03_S3_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt Undetermined_S0_R1_001_trimmed.fq_trimmed_bismark_bt2.bam
Bc-04_S4_R1_001_trimmed.fq_trimmed_bismark_bt2.bam Undetermined_S0_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt
Bc-04_S4_R1_001_trimmed.fq_trimmed_bismark_bt2_SE_report.txt
6. Samtools Transfer
Then I use Samtools to transfer all bam files into sam file. Again I used parallel here:
parallel --plus 'samtools view -h -o {...}.sam {...}.fq_trimmed_bismark_bt2.bam' ::: *.bam
7. Strip Bismark Result
Again here it comes to Nugen-specific time. We useNudup here, but before that we need a small script. (Is that common for big companies use small scripts for jobs?)
After downloading, I use linux command to make it work:
chmod +x ./strip_bismark_sam.sh
Then again, I use parallel command to run it:
parallel --plus './strip_bismark_sam.sh {...}.sam' ::: *.sam
To be honest I still don't know what is this step for...
8. Run Nudup
Finally I reach the final step, to run Nudup, I wrote a small loop script to run it:
Samples <- dir('../Index')
Samples <- sapply(Samples,function(x) paste(strsplit(x,split="_")[[1]][1:2],collapse="_"))
for(i in Samples) {
message('\\n-----------------------------------------------')
message('Runing ',i,' with Nudup now...')
cmd <- paste('python ./nudup.py -f ../Index/',i,'_I1_001.fastq.gz -o ',i,' ../Results/',i,'_R1_001_trimmed.sam_stripped.sam',sep='')
message(cmd)
system(cmd,ignore.stdout=TRUE)
}
It's really slow and no parallel... After a long time, I final get fully processed bam file.
NUDUP$ ls
Bc-01_S1_dup_log.txt Bc-02_S2.sorted.markdup.bam Bc-04_S4_dup_log.txt Bc-05_S5.sorted.markdup.bam Ready TMP3 TMP6
Bc-01_S1.sorted.markdup.bam Bc-03_S3_dup_log.txt Bc-04_S4.sorted.markdup.bam Bc-06_S6_dup_log.txt TMP TMP4
Bc-02_S2_dup_log.txt Bc-03_S3.sorted.markdup.bam Bc-05_S5_dup_log.txt Bc-06_S6.sorted.markdup.bam TMP2 TMP5
The 6 bam file in above folder is what I will use into future Downstream Analysis.