November 18, 2021

Analysis RRBS with GemBS

RRBS
Pipeline

So I am working on a RRBS data set, I want to use GemBS to analysis it. First step is to install GemBS in the Ubuntu system. It looks easy but indeed have something worth recording.

1. Install GemBS3

Following the instruction, I started the installation.

First error I encountered is:

/usr/local/cuda/bin/nvcc -O3 -m64 -Xptxas="-dlcm=ca" -gencode arch=compute_20,code=\"sm_20,compute_20\" -gencode arch=compute_20,code=sm_21 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=\"sm_35,compute_35\" -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=\"sm_50,compute_50\" -gencode arch=compute_52,code=\"sm_52,compute_52\" -gencode arch=compute_60,code=\"sm_60,compute_60\" -gencode arch=compute_61,code=\"sm_61,compute_61\" -gencode arch=compute_62,code=\"sm_62,compute_62\" -c src/gpu_fmi_decode.cu -o build/gpu_fmi_decode.o
nvcc fatal : Unsupported gpu architecture 'compute_20'

It's a error related with CUDA platform, after google online, I found the the solution is to manually change Makefile: Enter folder tools/gem3-mapper/resources/gem-cutter, then I use vim to change Makefile. Comment the first two line of code.

## Force to test JIT compiler:
## export CUDA_FORCE_PTX_JIT=1
# CUDA_SASS_FLAG_20=-gencode arch=compute_20,code=\"sm_20,compute_20\" -gencode arch=compute_20,code=sm_21
# CUDA_SASS_FLAG_30=$(CUDA_SASS_FLAG_20) -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=\"sm_35,compute_35\"
CUDA_SASS_FLAG_37=$(CUDA_SASS_FLAG_30) -gencode arch=compute_37,code=sm_37
CUDA_SASS_FLAG_50=$(CUDA_SASS_FLAG_37) -gencode arch=compute_50,code=\"sm_50,compute_50\"
CUDA_SASS_FLAG_52=$(CUDA_SASS_FLAG_50) -gencode arch=compute_52,code=\"sm_52,compute_52\"
CUDA_SASS_FLAG_60=$(CUDA_SASS_FLAG_52) -gencode arch=compute_60,code=\"sm_60,compute_60\" -gencode arch=compute_61,code=\"sm_61,compute_61\" -gencode arch=compute_62,code=\"sm_62,compute_62\"

Then I can start installation:

python3 setup.py install --prefix=/my/install/path/ --exec-prefix=/my/install/path/

Above command could help me to install GemBS into any folder without root, and will not mess up whole system's software eco-system. Finally, add the path into .bashrc for future usage.

2. Install GemBS4

Another option is to install GemBS4.

git clone --recursive https://github.com/heathsc/gemBS-rs.git
cd gemBS-rs

Then I modified gemBS_config.mk.in file to my own install path. Also, I need to fix the CUDA error in .../gemBS-rs/c_tools/gem3-mapper/resources/gem-cutter/Makefile, commend the compute_20 and compute_30. Then run:

make gemBS_config.mk
make install

Also, I need to add the path into my .bashrc. However, after installation, I found actually GemBS-rs have some problem in Samtools, missing hts_test_feature, so for the analysis, I still use GemBS 3.

2. GemBS Prepare

The file I have are trimmed fastq files, so I can start to use GemBS since here. According to the guild, I should prepare two things below. Also, I need to prepare the genome for mapping.

  • Sample metadata (in the metadata CSV file)
  • Pipeline parameters (in the pipeline configuration file)

So I composed a SampleSheet.csv file in the folder, only listed file_id, fastq file information .etc since I currently don't know the phenotype. Note that phenotype can be added on the last column.

> knitr::kable(SampleSheet)

|Barcode   |file_id   |end_1              |end_2              |
|:---------|:---------|:------------------|:------------------|
|H78_0001  |H78_0001  |H78_0001_R1.fq.gz  |H78_0001_R2.fq.gz  |
|H78_0008  |H78_0008  |H78_0008_R1.fq.gz  |H78_0008_R2.fq.gz  |
|H78_0009  |H78_0009  |H78_0009_R1.fq.gz  |H78_0009_R2.fq.gz  |
|H78_0010  |H78_0010  |H78_0010_R1.fq.gz  |H78_0010_R2.fq.gz  |
|H78_0011  |H78_0011  |H78_0011_R1.fq.gz  |H78_0011_R2.fq.gz  |
|H78_0012  |H78_0012  |H78_0012_R1.fq.gz  |H78_0012_R2.fq.gz  |
|H78_0013  |H78_0013  |H78_0013_R1.fq.gz  |H78_0013_R2.fq.gz  |
|H78_0014  |H78_0014  |H78_0014_R1.fq.gz  |H78_0014_R2.fq.gz  |
|H78_0015  |H78_0015  |H78_0015_R1.fq.gz  |H78_0015_R2.fq.gz  |
|I18_1_001 |I18_1_001 |I18_1_001_R1.fq.gz |I18_1_001_R2.fq.gz |
|I18_1_002 |I18_1_002 |I18_1_002_R1.fq.gz |I18_1_002_R2.fq.gz |
...

Then I prepare config file. Firstly I downloaded original file from official github. Then modify it a bit, just add the path of my fq files, and reference file hg38.fa.gz I downloaded from USCS.

base = ./result/

reference = ./hg38.fa.gz
index_dir = indexes

sequence_dir = /scratch1/cansor/analysis_nov21/trimmed_fastq    # @SAMPLE and @BARCODE are special
bam_dir = ${base}/mapping/@BARCODE      # variables that are replaced with
bcf_dir = ${base}/calls/@BARCODE        # the sample name or barcode being
extract_dir = ${base}/extract/@BARCODE  # worked on during gemBS operation
report_dir = ${base}/report

# General project info
project = Cansor
species = Human

# Default parameters
threads = 40
jobs = 4

... # All the rest are unmodified.

3. Run GemBS

After above preperation, we can start to run GmeBS, it's super easy, first step is to prepare:

gemBS prepare -c IHEC_RRBS.conf -t SampleSheet.csv

Firstly I need to index the reference genome:

gemBS index

Then, start mapping. Mapping in gemBS is performed using GEM3 in bisulfite mode. After I run it, each bam file will be placed in the mapping folder.

gemBS map

Then start calling methylation status with bs_call. It take account of the sequence quality scores, under and over conversion probabilities and the observed bases, bs_call finds the most likely genotype (maximizing the likelihood over the unknown methylation parameter for each genotype) and then reports the methylation conditional on the called genotype. The output are *.bcf files in calls folder.

gemBS call

We can use bcftools to check those files, bcftools view I18_1_003.bcf. I am not sure what are all column means, need to read the tutorial better in the future.

chr8_KI270900v1_alt	250389	.	C	.	20	PASS	CX=GACTC	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:PASS:8:60:20:2:-0.00380238:0,8,0,0,0,0,0,0:37:+:N:GACTC
chr8_KI270900v1_alt	250391	.	C	.	20	PASS	CX=CTCGG	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:PASS:8:60:20:2:-0.00380238:0,8,0,0,0,0,0,0:37:+:C:CTCGG
chr8_KI270900v1_alt	250392	.	G	.	20	PASS	CX=TCGGG	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:PASS:0:60:20:20:-0.00380249:0,0,0,0,0,0,8,0:37:-:C:TCGGG
chr8_KI270900v1_alt	250393	.	G	.	2	fail	CX=CGGGC	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:q20:0:60:2:2:-0.367877:0,0,0,0,8,0,0,0:37:-:N:CGGGC
chr8_KI270900v1_alt	250394	.	G	.	2	fail	CX=GGGCA	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:q20:0:60:2:2:-0.367877:0,0,0,0,8,0,0,0:37:-:N:GGGCA
chr8_KI270900v1_alt	250395	.	C	.	20	PASS	CX=GGCAT	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:PASS:8:60:20:2:-0.00380238:0,8,0,0,0,0,0,0:37:+:N:GGCAT
chr8_KI270900v1_alt	250398	.	G	.	2	fail	CX=ATGAG	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:q20:0:60:2:2:-0.367877:0,0,0,0,8,0,0,0:37:-:N:ATGAG
chr8_KI270900v1_alt	250400	.	G	.	2	fail	CX=GAGAC	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:q20:0:60:2:2:-0.367877:0,0,0,0,8,0,0,0:37:-:N:GAGAC
chr8_KI270900v1_alt	250402	.	C	.	20	PASS	CX=GACCC	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:PASS:8:60:20:2:-0.00380238:0,8,0,0,0,0,0,0:37:+:N:GACCC
chr8_KI270900v1_alt	250403	.	C	.	20	PASS	CX=ACCCC	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:PASS:8:60:20:2:-0.00380238:0,8,0,0,0,0,0,0:37:+:N:ACCCC
chr8_KI270900v1_alt	250404	.	C	.	20	PASS	CX=CCCCA	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:PASS:8:60:20:2:-0.0038029:0,8,0,0,0,0,0,0:36:+:N:CCCCA
chr8_KI270900v1_alt	250405	.	C	.	20	PASS	CX=CCCAT	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:PASS:8:60:20:2:-0.00380238:0,8,0,0,0,0,0,0:37:+:N:CCCAT
chr8_KI270900v1_alt	250410	.	C	.	20	PASS	CX=TACGG	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:PASS:8:60:20:2:-0.00380238:0,8,0,0,0,0,0,0:37:+:C:TACGG
chr8_KI270900v1_alt	250411	.	G	.	20	PASS	CX=ACGGG	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:PASS:0:60:20:20:-0.00380249:0,0,0,0,0,0,8,0:37:-:C:ACGGG
chr8_KI270900v1_alt	250412	.	G	.	2	fail	CX=CGGGC	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:q20:0:60:2:2:-0.367877:0,0,0,0,8,0,0,0:36:-:N:CGGGC
chr8_KI270900v1_alt	250413	.	G	.	2	fail	CX=GGGCC	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:q20:0:60:2:2:-0.367877:0,0,0,0,8,0,0,0:37:-:N:GGGCC
chr8_KI270900v1_alt	250414	.	C	.	20	PASS	CX=GGCCT	GT:FT:DP:MQ:GQ:QD:GL:MC8:AMQ:CS:CG:CX	0/0:PASS:8:60:20:2:-0.00380238:0,8,0,0,0,0,0,0:37:+:N:GGCCT

Then extract methylation data.

gemBS extract

Copyright © Yuan Tian 2023.