De novo transcriptome analysis

De novo transcriptome analysis

Bioinformatics 2020. 1. 15. 10:03

리눅스 (Ubuntu 16.0.4) 운영체제에서 분석한 내용을 토대로 작성한 글입니다.

(Illumina short read RNA-seq, Paired-end reads)

0. De novo transcriptome analysis

De novo transcriptome anlalysis 는 Trinity sortware [1] 를 활용해 봅시다.

필요한 sortware

- Trinity (version 2.2.0 기준으로 작성했습니다.)

- Bowtie2

- Samtools

- TransDecoder (version 3.0.0)

- BLAST

- CD-hit

- RSEM

- R

- R packages (edgeR, ctc, Biobase, ape, gplots 등)

PATH 설정을 해놓아야 합니다.

1. Concatenation of reads

De novo RNA-seq 분석을 하게 된다면 일반적으로 여러 samples (주로 multiple tissues) 을 가지고 시작하게 됩니다.

또한 genome 이 없다보니 genome 과 유사한 역할을 할 수 있는 transcriptome assembly 를 만들어야 합니다.

이때 comprehensive 한 transcriptome assembly 를 만들기 위해 여러 samples 의 RNA-seq data 를 concatenation 해야합니다.

Forward 끼리 Reverse 끼리 (Preprocessing 된 reads)

ex)

만약 sample 이

Brain_1.fastq

Brain_2.fastq

Liver_1.fastq

Liver_2.fastq

Testis_1.fastq

Testis_2.fastq

이런식으로 있다면

Brain_1.fastq, Liver_1.fastq, Testis_1.fastq 끼리

Brain_2.fastq, Liver_2.fastq, Testis_2.fastq 끼리

concatenation 하는 것 입니다.

Command

cat Brain_1.fastq Liver_1.fastq Testis_1.fastq >> Merged_tissues_1.fastq

cat Brain_2.fastq Liver_2.fastq Testis_2.fastq >> Merged_tissues_2.fastq

2. De novo assembly

이제 Trinity 를 활용하여 assembly 를 진행합니다.

Command

Trinity --seqType fq --left  Merged_tissues_1.fastq --right  Merged_tissues_2.fastq --output trinity_out --max_memory 100G --CPU 8

--seqType: reads format 을 지정합니다. (fq: fastq, fa: fasta)

--left, right: foward, reverse reads 를 지정합니다.

--output: 생성될 output directory 입니다. (trinity 단어가 들어가야합니다.)

--max_memory: assembly 과정중 할당할 최대 memory 입니다.

--CPU: 사용할 cpu 수 입니다.

Assembly 과정은 서버 spec 에 따라 다를 수 있지만 오래 걸리는 작업입니다.

완료되면 여러 파일들이 생성되는데 이때  Trinity.fasta 라는 파일이 assembly 된 transcriptome 입니다.

이 과정 이후에 statistics 를 구하고 싶으면 Trinity tool 의 util directory 내의 TrinityStats.pl 을 통해 할 수 있다.

TrinityStats.pl Trinity.fasta

결과)

################################

## Counts of transcripts, etc.

################################

Total trinity 'genes': 288406

Total trinity transcripts: 362522

Percent GC: 44.33

########################################

Stats based on ALL transcript contigs:

########################################

Contig N10: 5508

Contig N20: 3716

Contig N30: 2666

Contig N40: 1876

Contig N50: 1222

Median contig length: 332

Average contig: 679.15

Total assembled bases: 246207382

#####################################################

## Stats based on ONLY LONGEST ISOFORM per 'GENE':

#####################################################

Contig N10: 4576

Contig N20: 2774

Contig N30: 1715

Contig N40: 993

Contig N50: 633

Median contig length: 306

Average contig: 530.14

Total assembled bases: 152896963

Statistics 에서 isoforms 과 gene 을 구분하여 보여주게 되는데, 저는 transcripts 전체인 isoforms 을 가지고 이후 분석을 진행합니다.

(gene 은 isoforms 으로 묶여있는 cluster 중 가장 긴 transcripts 를 대표합니다. 그러나 transcript 의 길이가 gene prediction 하는데에는 큰 상관이 없기 때문에 isoforms 전체를 대상으로 prediction 합니다.)

3. Gene prediction

Assembled transcripts 의 protein coding genes 을 찾기 위해 TransDecoder 를 이용할 것임

1) 100 amino acids 이상의 ORF 포함하는 transcripts 만 뽑아내기

Command

TransDecoder.LongOrfs -t Trinity.fasta

-t: assembly 를 통해 생성된 Trinity.fasta 를 넣어주면 됩니다.

결과로는

longest_orfs.pep (이후 분석에 사용)

longest_orfs.gff3

longest_orfs.cds 등등이 생성됩니다.

2) Homology-based search

기존에 잘알려진 protein database (Swissprot DB) 에 homology search 를 하여 gene prediction 과정에 활용할 수 있다.

NCBI 의 BLAST program 을 사용하면 된다.

(2-1) 먼저 swissprot.fasta 파일을 download 받은후 blast db 형식으로 만들어준다.

Command

ncbi-blast-2.x.x+/bin/makeblastdb -in swissprot.fasta -dbtype prot

-in: db 형식으로 만들고자 하는 fasta 파일

-dbtype: fasta 파일이 nucleotide 또는 protein sequence 인지 (nucleotide: nucl, protein: prot)

(2-2) 만들어진 db 파일에 transcript assembly 를 homology search 한다.

Command

ncbi-blast-2.x.x+/bin/blastp -query  logest_orfs.pep  -db swissprot.fasta -max_target_seqs 1 -outfmt 6 -evalue 1e-5 -num_threads 8 >  blastp.outfmt6

-query: db 에 search 할 sequence 파일

-db: database

-max_target_seqs: 해당 query 가 matching 되는 sequence 중 몇개를 보여줄지에 대한 parameter

-outfmt: output 의 format 의 대한 정보 (0~11 까지 있으며 6 은 tabular format 이다.)

-evalue: blast evalue cutoff

-num_threads: 사용할 cpu 수

(2-3) Gene prediction

이제 blast 결과를 토대로 gene prediction 을 진행할 차례이다.

transcripts.transdecoder_dir 가 있는 directory 에서 진행하면 된다.

Command

TransDecoder.Predict -t  Trinity.fasta --retain_blastp_hits blastp.outfmt6  --cpu 8 --single_best_orf

-t: Trinity assembly 파일

--retain_blastp_hits: (2) 과정에서 얻은 blastp 결과 파일

--cpu:  사용할 cpu 수

--single_best_orf: gene prediction 과정에서 하나의 transcripts 에서 6개의 ORFs 를 탐색했을때 그중 가장 best orf 만 출력 한다.

결과로는 Trinity.fasta.transdecoder.pep 등등이 생성된다.

(2-4) Removing redundant transcripts

앞서 (2-1)~(2-3) 과정을 통해 gene prediction 은 완료되었으나, transcriptome 이다보니 불가피하게 isoforms 이 존재 할 것이다.

Reference genome 의 역할을 하기 위해서는 representative protein coding genes 만 있는게 좋으므로, redundant 한 transcripts 를 제거하기 위해 CD-hit 프로그램을 사용한다.

Command

cdhit -i  Trinity.fasta.transdecoder.pep  -o  Trinity.fasta.transdecoder.pep.cdhit  -c 0.99 -T 8

-i: TransDecoder 를 통해 geneprediction 이 된 fasta 파일

-o: otuput

-c: identity cutoff (0.99: 99% 비슷한 transcripts 의 cluster 중 가장 긴 transcript 선택)

-T:  사용할 cpu 수

결과파일로 Trinity.fasta.transdecoder.pep.cdhit,  Trinity.fasta.transdecoder.pep.cdhit.clstr 파일이 생성되며 .cdhit 파일은 sequence, .cdhit.clstr 파일에는 cluster 가 어떻게 묶였는지에 대한 정보가 담겨있다.

이를 통해 얻어진 assembly 파일을 Non-redundant protein coding sequences (NRCDS) 로 명명하고 진행한다.

4. Gene expression level quantification

NRCDS 에 reads 를 mapping 하고 mapped reads count 를 기반으로 expression level 을 estimation 한다.

1) NRCDS_Trinity.fasta 파일 생성

그렇기 위해서는 NRCDS 파일에 대응되는 nucleotide sequence 가 필요하므로 NRCDS 파일의 sequence id 를  de novo assembly 파일인 Trinity.fasta 에 matching 하여 nucleotide sequence 를 얻는다.

파이썬 script 를 통해서 할 수도 있고, samtools faidx 를 통해 진행 할 수 있을것이다.

2) Read alignment & Abundance estimation

이제 NRCDS_Trinity.fasta 에 reads 를 mapping 할 차례이다. 이는 Trinity 의 util 내의 align_and_estimate_abundance.pl script 를 사용하면된다.

여기서의 reads 는 concatenation 된 reads 가 아닌, 각각의 sample 의 reads 이다.

Command

Trinity tool path/util/align_and_estimate_abundance.pl --transcripts  NRCDS_Trinity.fasta --seqType fq --left brain_1.fastq --right brain_2.fastq --est_method RSEM --aln_method bowtie2 --prep_reference --output_dir brain_rsem_out --output_prefix brain_rsem --thread_count 8

--transcripts: mapping 하고자 하는 assembly 파일 (NRCDS_Trinity.fasta)

--seqType: reads format 을 지정합니다. (fq: fastq, fa: fasta)

--left, right: foward, reverse reads 를 지정합니다.

--est_method: abundance estimation 하는 method 설정 (RSEM을 사용한다.)

--aln_method: read alignment 하는 method 설정 (bowtie2를 사용한다.)

--prep_reference: assembly 파일을 indexing 하는 과정 (처음에 한 sample 돌릴때 했다면 다음 sample 부터는 빼도 됨)

--output_dir: 생성될 output directory 입니다.

--output_prefix: 생성될 output 파일 앞에 붙는 이름입니다.

--thread_count: 사용할 cpu 수 입니다.

결과로는 .RSEM.isoforms.results 를 얻을 수 있는데 각 transcripts 의 read count, TPM, FPKM 정보가 담겨있다.

각 sample 별로 분석을 진행하면 된다.

5. Differentially expressed genes (DEGs)

각 sample 간의 DEGs 를 확인하고자 진행한다.

1) Gene expression matrix 생성

각 sample 로 부터 얻어진 RSEM.isoforms.results 를 하나의 matrix 로 합친다.

이를 위해서는 Trinity util directory 의 abundance_estimates_to_matrix.pl 을 활용한다.

Command

Trinity tool path/util/abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix ABC A.RSEM.isoforms.results B.RSEM.isoforms.results C.RSEM.isoforms.results

--est_method: abundance estimation 할때 썻던 method

--out_prefix: 생성될 output 파일 앞에 붙는 이름입니다.

각 sample 의 RSEM.isoforms.results 를 나열하면 된다.

결과로는 (outprefix).counts.matrix 를 포함한 여러 파일들이 생성됩니다.

2) Differential expression analysis

각 sample 의 발현값을 토대로 DE 분석을 진행한다.

여기서는 Trinity 의 Analysis/DifferentialExpression/run_DE_analysis.pl script 를 사용한다.

Command

Trinity tool path/Analysis/DifferentialExpression/run_DE_analysis.pl --matrix  counts.matrix  --method edgeR --output  edgeR_dir --dispersion 0.1

--matrix: 앞서 얻었던 couts.matrix 파일

--method: DE 분석에 사용할 method (R package 인 edgeR)

--output: output directory 이름

--dispersion: 같은 종의 sample 인 경우 0.1 을 사용한다. (edgeR manual 에 자세한 설명이 있으니 참고 바랍니다.)

여러 sample 인 경우에 모든 1:1 조합으로 비교하기 때문에 n combination 2 의 경우의 수가 나온다.

그러므로 많은 파일들이 생성될것이기 때문에 놀라지 말기 바람.

3) TMM normalization

각 sample 에서 얻은 gene expression level 을 normalization 하는 과정입니다.

(3-1) trasncript length 정보 준비

Command

cut -f 1,3,4 A.RSEM.isoforms.results >  Trinity.trans_lengths.txt

여기서 -f 는 1,3,4 번의 field (column) 만 추려내는 것으로 id, length, effective_length 정보를 준비한다.

(3-2) Normalization

RNA-seq normalization 방법중 하나인 TMM normalization 방법을 사용한다. (TMM: Trimmed mean of M-values)

Trinity 의 Analysis/DifferentialExpression/run_TMM_normalization_write_FPKM_matrix.pl script 를 사용한다.

Command

Trinity tool path/Analysis/DifferentialExpression/run_TMM_normalization_write_FPKM_matrix.pl --matrix  counts.matrix --length  Trinity.trans_lengths.txt

--matrix: 앞서 얻었던 couts.matrix 파일

--length: 앞서 얻었던 Trinity.trans_lengths.txt 파일

결과로  .counts.matrix.TMM_normalized.FPKM 파일이 생성된다.

4) Identifying DEGs

위에서 얻은  .counts.matrix.TMM_normalized.FPKM 파일을  edgeR_dir 로 이동시킨 후,

Trinity 의 Analysis/DifferentialExpression/analyze_diff_expr.pl script 를 사용한다.

Command

Trinity tool path/Analysis/DifferentialExpression/analyze_diff_expr.pl --matrix  .counts.matrix.TMM_normalized.FPKM -C 2 -P 0.001

--matrix: 앞서 얻었던 couts.matrix.TMM_normalized.FPKM 파일

-C: Log2 foldchange 로 2는 발현값이 4배 높거나 4배 낮은 기준 (DEGs cutoff)

-P: 통계값 corrected p-value (FDR) 로 0.001 은 0.001 미만의 기준이다. (DEGs cutoff)

결과로는 diffExpr 의 prefix 가 붙은 파일들이 생성되고 위의 기준 (-C, -P) 에 충족하는 DEGs 들이 모여있다.

여기서 log2.centered.dat 가 붙은 파일들이 있는데 이는 절대적인 발현값이 아닌 상대적인 발현값으로 계산한 정보이다.

ex)  log2.centered 값을 구하는 법

Actin 유전자의 A, B, C sample 의 발현값이 각각 100, 400, 250 이라고 할때,

1. 먼저 평균을 구한다.

(100 + 400 + 250) / 3 = 250

2. 각 sample 의 발현값과 평균을 log2 transformation 한다.

A: log2(100 + 1) = 6.66

B: log2(400 + 1) = 8.65

C: log2(250 + 1) = 7.97

Average: log2(250 + 1) = 7.97

여기서 1을 더해주는 경우는 만약 발현값이 0일 수도 있기 때문이다.

3. log2 transformation 된 sample 의 값에서 평균 값을 뺀다.

그렇게 되면 A, B, C 각각 -1.31, 0.68, 0 으로 상대적인 발현값을 구할 수 있다.

이는 낮게 발현하는 유전자와 높게 발현하는 유전자들에 의한 bias 를 줄일 수 있다.

6. Annotation

얻어진 유전자들에 대해 기능이 잘 알려진 database 에 homology-based search 를 통해 유사한 기능을 할것이라고 추측 할 수 있다.

사용할 수 있는 database 는 근연종이 genome 이 있다면 해당 genome data

그렇지 않다면 swissprot, KEGG, UniRef 등에 BLAST 를 통해 해당 유전자들의 기능을 유추 할 수 있을 것이다.

------------------------------------------------------------------------------------------------------------

References

[1] Brian J Haas et al., De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocol 2013

'Bioinformatics' 카테고리의 다른 글

Reference based transcriptome analysis  (0) 2020.01.15

Raw data preprocessing  (0) 2020.01.15
관련글 관련글 더보기
- Reference based transcriptome analysis
- Raw data preprocessing

ABOUT ME

1st quadrant 1st quadrant

리눅스 (Ubuntu 16.0.4) 운영체제에서 분석한 내용을 토대로 작성한 글입니다.

0. De novo transcriptome analysis

1. Concatenation of reads

'Bioinformatics' 카테고리의 다른 글

티스토리툴바

Reference based transcriptome analysis (0)	2020.01.15
Raw data preprocessing (0)	2020.01.15

ABOUT ME

리눅스 (Ubuntu 16.0.4) 운영체제에서 분석한 내용을 토대로 작성한 글입니다.

0. De novo transcriptome analysis

1. Concatenation of reads

'Bioinformatics' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바