수정 : 2024.01.10
| 개요
Dada2란?
R을 기반으로 qiime2(리눅스 기반)와 같이 미생물 분석에 사용되는 R 패키지이다.
튜토리얼 목표는?
Dada2로 ITS의 기본 분석법 익혀보자!
(이전 글 : Dada2 설치, Dada2를 사용한 amplicon분석 )
| ITS서열 분석
이 튜토리얼은 dada2 1.18 tutorial의 ITS 버전이다(현재 최신 버전인 1.26도 가능하다).
input 파일을 demultiplexed단계를 거친 illumina 시퀀서의 paired-end fastq파일이며, output은 amplicon sequence variant(ASV) table이다. 추가적으로 각 ASV에 UNITE database를 이용한 계통수를 부여(Assign)할 것이다.
amplicon분석과 다른 중요한 단계는, primer를 제거하는 단계이다(이 단계가 왜 필요한지 이해하자).
| ITS 1 (Internal Transcribed Spacer 1)란?
ITS1은 진핵생물의 rRNA내부의 전사 단위인 내부 전사 구역 1(Internal Transcribed Spacer 1)을 말한다. ITS1의 변이로 인해 종 간의 차이점을 식별할 수 있다.
위 그림과 같이 UNITE database는 ITS1부터 25S영역의 서열 데이터를 가지고 있기 때문에, ITS1뿐만 아니라, IT2, 25S를 이용한 분류에도 사용 가능하다.
일정한 길이를 가진 16S rRNA의 변이 영역들과 달리, ITS1과 ITS2는 200~600bp의 길이를 가지며, 이 길이는 생물학적으로 매우 다양하다.
이 ITS영역 길이의 다양성은 dada2 workflow에서 filtering과 trimming단계에서 중요한 영향을 미친다
첫 번째로, 16S에서처럼 일정한 길이로 서열을 잘라내는 방법을 사용하지 않는다. (trunclen 옵션 사용 X)
두 번째로, 아래 이미지처럼 ITS영역이 판독 길이(이미지의 R1, R2) 보다 짧을 때, primer영역을 제거하는 것은 복잡한 문제이다.
표준 16S workflow에서는 단순히 초반의 일부 서열을 제거함으로써 primer서열을 제거할 수 있다(filterAndTrim(..., trimLeft=(FWD_PRIMER_LEN, REV_PRIMER_LEN)). 그러나 ITS region에서는 단순히 제거하기 어렵다. 이를 위해 cutadapt 도구를 통해서 primer를 제거해 보자.
| 분석하기 전에 확인할 점
- 샘플은 demultiplexing 되어 있어야 하며, paired-end 데이터일 경우 각 forward, reverse파일의 서열 수는 같아야 한다.
- 이제 준비가 되었다면, 아래 Dada2 분석을 시작해 보자.
| 1. 분석 준비
패키지 준비
library(dada2)
packageVersion("dada2")
library(ShortRead)
packageVersion("ShortRead")
library(Biostrings)
packageVersion("Biostrings")
# [1] '1.8.0'
# [1] '1.36.1'
# [1] '2.46.0'
저 버전 이상이면 큰 문제는 없다.
데이터 준비
- 다운로드 링크 : https://www.ebi.ac.uk/ena/data/view/PRJNA377530
- 샘플 개수 : 61 (forward, reverse 합해서 122개)
- 샘플은 USDA Agricultural Research Service (CFMR) 배양 컬렉션에서 가져온 ITS Mock community이다. 이는 Illumina의 version 3 (600 cycle) reagent kit을 사용한 MiSeq을 통해서 시퀀싱 되었다.
다운로드한 파일의 위치는 본인의 컴퓨터의 경로에 따라 수정하길 바란다.
path <- "./input/ENA_Project" # 본인의 분석 경로에 따라 변경
list.files(path)
출력물을 보면 SAMPLENAME_1.fastq.gz이름 형식을 가진 forward 서열과 SAMPLENAME_2.fastq.gz형식의 reverse서열을 확인할 수 있다.
fnFs <- sort(list.files(path, pattern = "_1.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern = "_2.fastq.gz", full.names = TRUE))
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
write.table(sample.names, "./input/ENA_Project/name_list.txt", sep = "\t", row.names = F, col.names = F, quote = F)
name_list로 각 파일의 이름을 따로 저장해 두자. 이는 차후 Cutadapt단계에서 사용된다.
| 2. primer 확인하기
BITS3 (forward)와 B58S3 (reverse) primers가 샘플의 증폭에 사용되었다. 각 primer서열은 아래와 같다.
FWD <- "ACCTGCGGARGGATCA"
REV <- "GAGATCCRTTGYTRAAAGTT"
primer서열은 실험마다 다르므로, 다른 primer서열을 사용할 때는 수정해야 한다.
실험에 사용된 primer서열을 정확히 인식했다면 계속 진행하지 충분하다. 그러나 혹시 모를 primer서열의 오류를 위해 상보적인 서열을 시퀀싱 된 read에 매칭해 보자. 그러기 위해선 상보적인 서열을 만들어야 한다.
이는 Biostrings 패키지의 complement()와 reverseComplement() 통해 만들 수 있다.
allOrients <- function(primer) {
# Create all orientations of the input sequence
require(Biostrings)
dna <- DNAString(primer) # The Biostrings works w/ DNAString objects rather than character vectors
orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna),
RevComp = reverseComplement(dna))
return(sapply(orients, toString)) # Convert back to character vector
}
# 상보서열, reverse서열, 역상보 서열을 만들자
위 함수를 통해 forward, reverse primer를 각각 변환해 보자.
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients
# Forward Complement Reverse RevComp
# "ACCTGCGGARGGATCA" "TGGACGCCTYCCTAGT" "ACTAGGRAGGCGTCCA" "TGATCCYTCCGCAGGT"
REV.orients
# Forward Complement Reverse RevComp
# "GAGATCCRTTGYTRAAAGTT" "CTCTAGGYAACRAYTTTCAA" "TTGAAARTYGTTRCCTAGAG" "AACTTTYARCAAYGGATCTC"
일단 모호한 염기(ambiguous bases; Ns)를 제거하기 위해 사전 필터링을 수행해 줍니다. 이 단계에서는 Ns 만 제거하고 다른 제거 옵션은 수행하지 않습니다.
fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) # Put N-filterd files in cutadapt/ subdirectory
fnRs.filtN <- file.path(path, "filtN", basename(fnRs))
# filterAndTrim
filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN,
maxN = 0,
multithread = TRUE # 윈도우에서는 불가능
)
이제 필터링된 서열에서 primer서열과 얼마나 매칭되는지 살펴보자.
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
아래 함수를 통해서 첫 번째 샘플의 forward와 reverse 시퀀싱 샘플에 적용해 보자.
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.filtN[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.filtN[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))
# Forward Complement Reverse RevComp
# FWD.ForwardReads 4214 0 0 0
# FWD.ReverseReads 0 0 0 3743
# REV.ForwardReads 0 0 0 3590
# REV.ReverseReads 4200 0 0 0
예상했듯이, FWD primer(forward primer)는 forward 샘플에서만 발견되고, REV는 reverse 샘플에서만 발견된다.
(만약 RevComp에 hit가 많으면, REV를 REV.orient[["RevComp"]]이 내용으로 변환해라.)
| 3. Primer 서열 제거
이 primer들은 primer/adapter 제거 전용 도구를 통해서 제거해 보자. 우리는 Cutadapt라는 프로그램을 사용할 것이다.
- Cutadapt guide : https://cutadapt.readthedocs.io/en/stable/index.html
- Linux 설치 : https://bio-kcs.tistory.com/entry/Cutadapt-01Linux%EC%97%90-Cutadapt-42-ver-%EC%84%A4%EC%B9%98%ED%95%98%EA%B8%B0
만약 R을 쓰는 환경이 Linux기반의 서버일 때, Linux버전을 설치하고 기존 dada2 tutorial의 내용을 따라 하는 것이 좋다.
cutadapt <- "/Users/nbetrap/miniconda2/bin/cutadapt" # 설치된 cutadapt의 경로로, 본인의 컴퓨터에 맞게 수정해야 한다.
system2(cutadapt, args = "--version") # Run shell commands from R
path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))
FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC)
# 빈 문서 만들기
system2("echo", args = " '' > ./output.txt")
# Run Cutadapt
for(i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
"-o", fnFs.cut[i], "-p", fnRs.cut[i],
fnFs.filtN[i], fnRs.filtN[i], # input files
">> ./output.txt")) # output files
}
# 결과 정리
pass <- system("grep 'passing filters' ../02.cutadapt/output.txt | cut -f3 -d '(' | tr -d ')'", intern = TRUE)
filt <- system("grep 'filtered' ../02.cutadapt/output.txt | cut -f3 -d '(' | tr -d ')'", intern = TRUE)
output_summary <- data.frame(`name` = sample.names,
`passing filters` = pass,
`filtered` = filt)
위에는 Linux에 설치된 Cutadapt를 R의 system2 함수를 이용해 실행하는 방법이다.
아래는 윈도우에서 설치된 Cutadapt를 사용한 adapter trimming이다. 이는 R의 콘솔창 옆에 있는 Terminal에서 접속 가능하다. 우리는 이전에 만들어 놓은 name_list.txt파일을 사용할 것이다.
# Cutadapt
for /F "tokens=*" %f in (name_list.txt) do ; py -m cutadapt -g ACCTGCGGARGGATCA -a AACTTTYARCAAYGGATCTC -G GAGATCCRTTGYTRAAAGTT -A TGATCCYTCCGCAGGT -o "cutadapt/%f_1.fastq.gz" -p "cutadapt/%f_2.fastq.gz" -n 2 "filtN/%f_1.fastq.gz" "filtN/%f_2.fastq.gz" >> cutadapt_result.txt
# 결과 요약
type cutadapt_result.txt | grep "passing filters" cutadapt_result.txt | cut -f3 -d "(" | tr -d ")" > file1.txt
type cutadapt_result.txt | grep "filtered" cutadapt_result.txt | cut -f3 -d "(" | tr -d ")" > file2.txt
paste file1.txt file2.txt > Output.txt
primer 제거 단계를 거쳤으니, 위 샘플들은 이전에 primerHits() 함수로 primer서열을 검색 시 hit가 되지 않아야 한다.
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))
# Forward Complement Reverse RevComp
# FWD.ForwardReads 0 0 0 0
# FWD.ReverseReads 0 0 0 0
# REV.ForwardReads 0 0 0 0
# REV.ReverseReads 0 0 0 0
다시 확인해 보니 primer가 모두 제거된 것을 볼 수 있다.
primer서열이 제거된 파일은 이제 DADA2 pipeline을 통해 분석될 준비가 되었다. 이후의 단계는 amplicon분석과 유사하다.
| 4. quality 확인하기
# Forward and reverse fastq filenames have the format:
cutFs <- sort(list.files(path.cut, pattern = "_1.fastq.gz", full.names = TRUE))
cutRs <- sort(list.files(path.cut, pattern = "_2.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format:
get.sample.name <- function(fname) strsplit(basename(fname), "_")[[1]][1]
sample.names <- unname(sapply(cutFs, get.sample.name))
head(sample.names)
plotQualityProfile(cutFs[1:2])
위 그림은 quality profile plot으로 각 base의 위치(x축)에서 퀄리티의 중앙값은 초록선으로 표시되어 있다. 사분위수는 주황선으로 표시되어 있다.
위 그림을 보면 forward reads의 퀄리티는 매우 좋다. 빨간 선은 많은 양의 read가 150 nts로 잘려 있는 것을 보여준다. 이후 단계에서 다양한 길이의 ITS 영역의 손실을 최소화하기 위해 16S Tutorial Workflow와 달리 일정한 길이로 자르지 않는다.
| 5. Filter and trim
filtFs <- file.path(path.cut, "filtered", basename(cutFs))
filtRs <- file.path(path.cut, "filtered", basename(cutRs))
out <- filterAndTrim(cutFs, filtFs,
cutRs, filtRs,
maxN = 0, #
maxEE = c(2, 2),
truncQ = 2, # QC filtering > 20
minLen = 50, # 최소 nt수
rm.phix = TRUE, # 박테리오 파지 서열 제거
compress = TRUE, # output을 압축된 fastq.gz로 출력
multithread = F) # 윈도우에서는, multithread = FALSE 로 설정
head(out)
# reads.in reads.out
# SRR5314314_1.fastq.gz 6202 5658
# SRR5314315_1.fastq.gz 12325 10985
# SRR5314316_1.fastq.gz 16006 14046
# SRR5314317_1.fastq.gz 11801 10423
# SRR5314331_1.fastq.gz 17399 15213
# SRR5314332_1.fastq.gz 42604 36742
- tranQ(Truncation quality score): Default값은 2로, 이 수 이하의 quality score를
- minLen(Minimum length): Default값은 50으로, 이 길이 이하의 read는 제거한다.
- Maximum expected errors(maxEE): Default값은 5이며, truncation단계 이후 maxEE 보다 높은 Error를 가진 read는 버려진다. Errors는 아래 계산식으로 QC점수를 사용해 계산된다.
$$ EE = \sum{10^ \frac{-Q}{10} } $$
| 6. Learn the Error Rates
errF <- learnErrors(filtFs, multithread = TRUE)
errR <- learnErrors(filtRs, multithread = TRUE)
plotErrors(errF, nominalQ = TRUE)
샘플링된 서열을 기반으로 각 염기가 다른 염기로 치환되어 인식될 확률(Error rate)을 구한다. 빨간 선과 비슷한 추세를 보이면, 큰 에러가 없다고 판정하고 다음 단계로 넘어간다.
| 7. 중복 서열 제거
derepFs <- derepFastq(filtFs, verbose = TRUE)
derepRs <- derepFastq(filtRs, verbose = TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names
완전히 같은 read를 삭제한다. 컴퓨터가 느릴 때 사용하면 좋은 단계이다.
| 8. Sample Inference
- denoising 단계(자세한 것은 16S tutorial 참고)
dadaFs <- dada(derepFs, err = errF, multithread = TRUE)
# Sample 1 - 5658 reads in 775 unique sequences.
# Sample 2 - 10985 reads in 1839 unique sequences.
# Sample 3 - 14046 reads in 2327 unique sequences.
# Sample 4 - 10423 reads in 1963 unique sequences.
# Sample 5 - 15213 reads in 2287 unique sequences.
# Sample 6 - 36742 reads in 5874 unique sequences.
# Sample 7 - 25570 reads in 3370 unique sequences.
# Sample 8 - 27714 reads in 3657 unique sequences.
# Sample 9 - 32045 reads in 3970 unique sequences.
# Sample 10 - 39414 reads in 6216 unique sequences.
# Sample 11 - 30257 reads in 4577 unique sequences.
# Sample 12 - 27993 reads in 4159 unique sequences.
# Sample 13 - 22950 reads in 3419 unique sequences.
# Sample 14 - 55781 reads in 5782 unique sequences.
# Sample 15 - 55788 reads in 5510 unique sequences.
# Sample 16 - 21981 reads in 2619 unique sequences.
# Sample 17 - 41275 reads in 5400 unique sequences.
# Sample 18 - 40151 reads in 4664 unique sequences.
# Sample 19 - 46942 reads in 5222 unique sequences.
# Sample 20 - 23311 reads in 2694 unique sequences.
# Sample 21 - 18902 reads in 2759 unique sequences.
# Sample 22 - 20918 reads in 2539 unique sequences.
# Sample 23 - 80377 reads in 7416 unique sequences.
# Sample 24 - 37812 reads in 3707 unique sequences.
# Sample 25 - 35665 reads in 3596 unique sequences.
# Sample 26 - 31147 reads in 3461 unique sequences.
# Sample 27 - 37460 reads in 4373 unique sequences.
# Sample 28 - 22458 reads in 2876 unique sequences.
# Sample 29 - 16228 reads in 3109 unique sequences.
# Sample 30 - 44136 reads in 4547 unique sequences.
# Sample 31 - 30970 reads in 3558 unique sequences.
# Sample 32 - 47614 reads in 16535 unique sequences.
# Sample 33 - 39554 reads in 14747 unique sequences.
# Sample 34 - 42266 reads in 16695 unique sequences.
# Sample 35 - 15178 reads in 1195 unique sequences.
# Sample 36 - 30347 reads in 2048 unique sequences.
# Sample 37 - 28106 reads in 2392 unique sequences.
# Sample 38 - 48407 reads in 7520 unique sequences.
# Sample 39 - 68671 reads in 8049 unique sequences.
# Sample 40 - 60984 reads in 12437 unique sequences.
# Sample 41 - 55777 reads in 13145 unique sequences.
# Sample 42 - 27950 reads in 1859 unique sequences.
# Sample 43 - 40954 reads in 3041 unique sequences.
# Sample 44 - 60516 reads in 22591 unique sequences.
# Sample 45 - 39329 reads in 18182 unique sequences.
# Sample 46 - 50685 reads in 7728 unique sequences.
# Sample 47 - 72010 reads in 9758 unique sequences.
# Sample 48 - 21509 reads in 1798 unique sequences.
# Sample 49 - 68006 reads in 8628 unique sequences.
# Sample 50 - 60518 reads in 9291 unique sequences.
# Sample 51 - 52006 reads in 18077 unique sequences.
# Sample 52 - 1389 reads in 345 unique sequences.
# Sample 53 - 22 reads in 19 unique sequences.
# Sample 54 - 67123 reads in 30166 unique sequences.
# Sample 55 - 15 reads in 15 unique sequences.
# Sample 56 - 65743 reads in 28297 unique sequences.
# Sample 57 - 73133 reads in 30094 unique sequences.
# Sample 58 - 40745 reads in 5397 unique sequences.
# Sample 59 - 20698 reads in 3072 unique sequences.
# Sample 60 - 57878 reads in 16127 unique sequences.
# Sample 61 - 37287 reads in 6059 unique sequences.
dadaRs <- dada(derepRs, err = errR, multithread = TRUE)
# Sample 1 - 5658 reads in 1245 unique sequences.
# Sample 2 - 10985 reads in 3609 unique sequences.
# Sample 3 - 14046 reads in 5255 unique sequences.
# Sample 4 - 10423 reads in 3958 unique sequences.
# Sample 5 - 15213 reads in 2920 unique sequences.
# Sample 6 - 36742 reads in 5561 unique sequences.
# Sample 7 - 25570 reads in 4889 unique sequences.
# Sample 8 - 27714 reads in 4640 unique sequences.
# Sample 9 - 32045 reads in 5016 unique sequences.
# Sample 10 - 39414 reads in 5362 unique sequences.
# Sample 11 - 30257 reads in 5128 unique sequences.
# Sample 12 - 27993 reads in 4663 unique sequences.
# Sample 13 - 22950 reads in 4211 unique sequences.
# Sample 14 - 55781 reads in 7260 unique sequences.
# Sample 15 - 55788 reads in 6457 unique sequences.
# Sample 16 - 21981 reads in 3177 unique sequences.
# Sample 17 - 41275 reads in 4038 unique sequences.
# Sample 18 - 40151 reads in 4824 unique sequences.
# Sample 19 - 46942 reads in 5206 unique sequences.
# Sample 20 - 23311 reads in 2510 unique sequences.
# Sample 21 - 18902 reads in 1948 unique sequences.
# Sample 22 - 20918 reads in 2441 unique sequences.
# Sample 23 - 80377 reads in 9005 unique sequences.
# Sample 24 - 37812 reads in 4887 unique sequences.
# Sample 25 - 35665 reads in 4479 unique sequences.
# Sample 26 - 31147 reads in 4257 unique sequences.
# Sample 27 - 37460 reads in 4205 unique sequences.
# Sample 28 - 22458 reads in 2736 unique sequences.
# Sample 29 - 16228 reads in 1876 unique sequences.
# Sample 30 - 44136 reads in 5292 unique sequences.
# Sample 31 - 30970 reads in 3421 unique sequences.
# Sample 32 - 47614 reads in 21442 unique sequences.
# Sample 33 - 39554 reads in 17281 unique sequences.
# Sample 34 - 42266 reads in 20027 unique sequences.
# Sample 35 - 15178 reads in 3082 unique sequences.
# Sample 36 - 30347 reads in 5709 unique sequences.
# Sample 37 - 28106 reads in 4750 unique sequences.
# Sample 38 - 48407 reads in 10760 unique sequences.
# Sample 39 - 68671 reads in 11111 unique sequences.
# Sample 40 - 60984 reads in 17368 unique sequences.
# Sample 41 - 55777 reads in 17100 unique sequences.
# Sample 42 - 27950 reads in 4976 unique sequences.
# Sample 43 - 40954 reads in 8196 unique sequences.
# Sample 44 - 60516 reads in 26382 unique sequences.
# Sample 45 - 39329 reads in 22230 unique sequences.
# Sample 46 - 50685 reads in 11324 unique sequences.
# Sample 47 - 72010 reads in 14012 unique sequences.
# Sample 48 - 21509 reads in 3553 unique sequences.
# Sample 49 - 68006 reads in 13746 unique sequences.
# Sample 50 - 60518 reads in 13581 unique sequences.
# Sample 51 - 52006 reads in 21906 unique sequences.
# Sample 52 - 1389 reads in 760 unique sequences.
# Sample 53 - 22 reads in 19 unique sequences.
# Sample 54 - 67123 reads in 35506 unique sequences.
# Sample 55 - 15 reads in 15 unique sequences.
# Sample 56 - 65743 reads in 33305 unique sequences.
# Sample 57 - 73133 reads in 36194 unique sequences.
# Sample 58 - 40745 reads in 9883 unique sequences.
# Sample 59 - 20698 reads in 5411 unique sequences.
# Sample 60 - 57878 reads in 20965 unique sequences.
# Sample 61 - 37287 reads in 10724 unique sequences.
| 9. Merge paired reads
- paired-end 데이터의 경우 forward, reverse서열을 합쳐준다.
- 그러나 ITS1을 사용하는 많은 분석의 경우 퀄리티가 좋은 forward서열만 사용하는 경우가 많다. 실제 분석에서는 Merge 한 후 얻어낸 ASV보다 forward에서만 분석할 때 ASV의 개수가 더 많은 경우가 종종 있다.
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
# 5548 paired-reads (in 58 unique pairings) successfully merged out of 5590 (in 69 pairings) input.
# 10462 paired-reads (in 204 unique pairings) successfully merged out of 10696 (in 247 pairings) input.
# 13806 paired-reads (in 334 unique pairings) successfully merged out of 13893 (in 369 pairings) input.
# 10217 paired-reads (in 299 unique pairings) successfully merged out of 10318 (in 327 pairings) input.
# 15085 paired-reads (in 17 unique pairings) successfully merged out of 15207 (in 22 pairings) input.
# 34971 paired-reads (in 18 unique pairings) successfully merged out of 36721 (in 26 pairings) input.
# 24877 paired-reads (in 19 unique pairings) successfully merged out of 25548 (in 25 pairings) input.
# 27017 paired-reads (in 22 unique pairings) successfully merged out of 27269 (in 35 pairings) input.
# 31699 paired-reads (in 22 unique pairings) successfully merged out of 32027 (in 32 pairings) input.
# 38850 paired-reads (in 27 unique pairings) successfully merged out of 39329 (in 54 pairings) input.
# 30014 paired-reads (in 32 unique pairings) successfully merged out of 30186 (in 65 pairings) input.
# 27758 paired-reads (in 32 unique pairings) successfully merged out of 27945 (in 63 pairings) input.
# 22777 paired-reads (in 37 unique pairings) successfully merged out of 22881 (in 60 pairings) input.
# 46644 paired-reads (in 20 unique pairings) successfully merged out of 55576 (in 40 pairings) input.
# 46726 paired-reads (in 16 unique pairings) successfully merged out of 55760 (in 28 pairings) input.
# 18967 paired-reads (in 13 unique pairings) successfully merged out of 21961 (in 20 pairings) input.
# 31868 paired-reads (in 16 unique pairings) successfully merged out of 41260 (in 29 pairings) input.
# 30925 paired-reads (in 16 unique pairings) successfully merged out of 40114 (in 25 pairings) input.
# 36434 paired-reads (in 18 unique pairings) successfully merged out of 46917 (in 44 pairings) input.
# 22605 paired-reads (in 23 unique pairings) successfully merged out of 23282 (in 44 pairings) input.
# 16660 paired-reads (in 22 unique pairings) successfully merged out of 18886 (in 43 pairings) input.
# 20356 paired-reads (in 24 unique pairings) successfully merged out of 20900 (in 38 pairings) input.
# 73620 paired-reads (in 17 unique pairings) successfully merged out of 80346 (in 39 pairings) input.
# 37090 paired-reads (in 14 unique pairings) successfully merged out of 37775 (in 26 pairings) input.
# 34331 paired-reads (in 13 unique pairings) successfully merged out of 35523 (in 18 pairings) input.
# 28448 paired-reads (in 15 unique pairings) successfully merged out of 31127 (in 21 pairings) input.
# 34007 paired-reads (in 18 unique pairings) successfully merged out of 37441 (in 29 pairings) input.
# 20393 paired-reads (in 16 unique pairings) successfully merged out of 22433 (in 28 pairings) input.
# 14616 paired-reads (in 17 unique pairings) successfully merged out of 16219 (in 33 pairings) input.
# 42812 paired-reads (in 23 unique pairings) successfully merged out of 44100 (in 35 pairings) input.
# 28157 paired-reads (in 20 unique pairings) successfully merged out of 30938 (in 41 pairings) input.
# 44707 paired-reads (in 297 unique pairings) successfully merged out of 47007 (in 359 pairings) input.
# 37570 paired-reads (in 316 unique pairings) successfully merged out of 39124 (in 345 pairings) input.
# 40650 paired-reads (in 324 unique pairings) successfully merged out of 41894 (in 347 pairings) input.
# 14705 paired-reads (in 93 unique pairings) successfully merged out of 15100 (in 106 pairings) input.
# 29441 paired-reads (in 185 unique pairings) successfully merged out of 30232 (in 211 pairings) input.
# 27079 paired-reads (in 148 unique pairings) successfully merged out of 27905 (in 171 pairings) input.
# 46694 paired-reads (in 379 unique pairings) successfully merged out of 48276 (in 402 pairings) input.
# 64173 paired-reads (in 354 unique pairings) successfully merged out of 67392 (in 435 pairings) input.
# 55292 paired-reads (in 329 unique pairings) successfully merged out of 60715 (in 349 pairings) input.
# 51849 paired-reads (in 501 unique pairings) successfully merged out of 55091 (in 527 pairings) input.
# 27182 paired-reads (in 117 unique pairings) successfully merged out of 27847 (in 140 pairings) input.
# 39829 paired-reads (in 348 unique pairings) successfully merged out of 40770 (in 400 pairings) input.
# 57134 paired-reads (in 360 unique pairings) successfully merged out of 59596 (in 419 pairings) input.
# 38939 paired-reads (in 350 unique pairings) successfully merged out of 39078 (in 359 pairings) input.
# 49387 paired-reads (in 400 unique pairings) successfully merged out of 50528 (in 427 pairings) input.
# 69114 paired-reads (in 391 unique pairings) successfully merged out of 71296 (in 450 pairings) input.
# 20815 paired-reads (in 80 unique pairings) successfully merged out of 21379 (in 98 pairings) input.
# 65245 paired-reads (in 363 unique pairings) successfully merged out of 67191 (in 419 pairings) input.
# 58129 paired-reads (in 440 unique pairings) successfully merged out of 60103 (in 475 pairings) input.
# 49835 paired-reads (in 324 unique pairings) successfully merged out of 51388 (in 349 pairings) input.
# 1234 paired-reads (in 42 unique pairings) successfully merged out of 1250 (in 45 pairings) input.
# 9 paired-reads (in 1 unique pairings) successfully merged out of 9 (in 1 pairings) input.
# 63045 paired-reads (in 676 unique pairings) successfully merged out of 66499 (in 718 pairings) input.
# No paired-reads (in ZERO unique pairings) successfully merged out of 15 pairings) input.
# 63323 paired-reads (in 612 unique pairings) successfully merged out of 65294 (in 655 pairings) input.
# 69257 paired-reads (in 579 unique pairings) successfully merged out of 72589 (in 624 pairings) input.
# 37587 paired-reads (in 290 unique pairings) successfully merged out of 40370 (in 362 pairings) input.
# 19331 paired-reads (in 118 unique pairings) successfully merged out of 20514 (in 146 pairings) input.
# 54622 paired-reads (in 874 unique pairings) successfully merged out of 57278 (in 914 pairings) input.
# 35001 paired-reads (in 499 unique pairings) successfully merged out of 36996 (in 568 pairings) input.
mergePairs() 각 옵션은 아래와 같다.
- Just concatenate(justConcatenate): Default값은 FALSE로 Merge가 아니라, forward와 reverse서열 사이에 10개의 NNNNNNNNNN 을 추가하여 연장된다.
- Maximum mismatches(maxMismatch): Default값은 0으로, overaping 영역에 불일치하는 nt의 최대 개수를 정할 수 있다. - Trim overhanging sequence(trimOverhang): Default값은 False로, paired end read를 merge 한 수에, 각 read에서 겹치는 길이의 초과 부분을 잘라낸다(아래 그림과 같음)
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
merge 된 서열을 OTU table로 만들어 주자.
(실제 분석에서는 forward로만 만든 seqtab을 만든 후, 두 결과값을 비교해 보는 것이 좋다.)
| 10. chimeras 서열 제거
만약 primers서열이 제거되지 않았다면 이 Chimera removal 옵션을 False로 변경하는 것을 추천한다(Default: TRUE).
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
table(nchar(getSequences(seqtab.nochim)))
각 단계를 거쳐온 서열을 한눈에 확인해 보자.
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers,
getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace
# sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged",
"nonchim")
rownames(track) <- sample.names
head(track)
# input filtered denoisedF denoisedR merged nonchim
# SRR5314314 6202 5658 5639 5594 5548 5016
# SRR5314315 12325 10985 10890 10722 10462 6076
# SRR5314316 16006 14046 14025 13901 13806 7264
# SRR5314317 11801 10423 10389 10341 10217 5485
# SRR5314331 17399 15213 15209 15209 15085 15085
# SRR5314332 42604 36742 36721 36734 34971 34971
각 단계를 거치고 남은 read의 수를 관찰할 수 있다.
| 11. Assign taxonomy
- UNITE database : https://unite.ut.ee/repository.php
general FASTA release의 downloadhttps://doi.plutof.ut.ee/doi/10.15156/BIO/2938067링크를 클릭하자. 그럼 아래의 사이트로 이동할 수 있다.
현재 최신 버전은 2023-07-18에 업로드 되었고, 2023-07-25에 업데이트 되었다. 가장 최신 파일인 위 사이트의 가장 최신 업데이트된 sh_general_release_25.07.2023.tgz 파일을 클릭하여 다운로드 후, tgz의 압축을 풀고 fastq 파일을 리눅스 서버에 위치시킨다.
unite.ref <- "./tax/UNITE/sh_general_release_dynamic_25.07.2023.fasta" # 위치는 본인의 분석 환경에 따라 수정하기
taxa <- assignTaxonomy(seqtab.nochim, unite.ref, tryRC = TRUE, # multithread = F
)
taxa.print <- taxa # Removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print)
# Kingdom Phylum Class Order Family Genus Species
# [1,] "k__Fungi" "p__Basidiomycota" "c__Tremellomycetes" "o__Filobasidiales" "f__Filobasidiaceae" "g__Naganishia" "s__albida"
# [2,] "k__Fungi" "p__Ascomycota" "c__Sordariomycetes" "o__Hypocreales" "f__Nectriaceae" "g__Fusarium" NA
# [3,] "k__Fungi" "p__Ascomycota" "c__Taphrinomycetes" "o__Taphrinales" "f__Protomycetaceae" "g__Saitoella" "s__complicata"
# [4,] "k__Fungi" "p__Ascomycota" "c__Dothideomycetes" "o__Pleosporales" "f__Pleosporaceae" "g__Alternaria" "s__prunicola"
# [5,] "k__Fungi" "p__Ascomycota" "c__Sordariomycetes" "o__Hypocreales" "f__Nectriaceae" "g__Fusarium" "s__graminearum"
# [6,] "k__Fungi" "p__Mortierellomycota" "c__Mortierellomycetes" "o__Mortierellales" "f__Mortierellaceae" "g__Podila" "s__humilis"
ASV개수와 컴퓨터 사양에 따라 다르지만, 생각보다 많은 시간이 걸린다.
| 12. To Phyloseq
편하게 분석하고 시각화 하기 위해서, 위 데이터를 phyloseq object로 만들어 주자.
library(phyloseq); packageVersion("phyloseq")
library(Biostrings); packageVersion("Biostrings")
library(ggplot2); packageVersion("ggplot2")
위 데이터의 예제 메타 데이터를 가지고 있지 않음으로, 일단 가상의 데이터를 만들어 분석해 보자.
samdf <- data.frame(row.names = sample.names,
`Type` = c(rep("A", 20),rep("B", 20), rep("c", 21)),
`SampleID` = sample.names)
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE),
sample_data(samdf),
tax_table(taxa))
ps
# phyloseq-class experiment-level object
# otu_table() OTU Table: [ 313 taxa and 61 samples ]
# sample_data() Sample Data: [ 61 samples by 1 sample variables ]
# tax_table() Taxonomy Table: [ 313 taxa by 7 taxonomic ranks ]
ps <- subset_samples(ps, SampleID != "SRR5838526")
ps
# phyloseq-class experiment-level object
# otu_table() OTU Table: [ 313 taxa and 60 samples ]
# sample_data() Sample Data: [ 60 samples by 2 sample variables ]
# tax_table() Taxonomy Table: [ 313 taxa by 7 taxonomic ranks ]
ps.rel <- transform_sample_counts(ps, function(x) x/sum(x))
오류를 방지하기 위해 SRR5838526 샘플 제거한 후 phyloseq 으로 변환한다.
이후 상대적 분포를 보기 위해 transform_sample_counts 함수를 써서 각 리드수를 샘플의 총 리드수로 나누어 준다.
Alpha diversity
plot_richness(ps.rel, x="Type", measures=c("Shannon", "Simpson"), color = "Type")
Beta diversity
ord.p.bray <- ordinate(ps.rel, method="PCoA", distance="bray")
plot_ordination(ps.rel, ord.p.bray, color="Type", title="Bray PCoA")
| 출처 및 참고
- 홈페이지 : https://benjjneb.github.io/dada2/ITS_workflow.html
- 예제 자료 : https://www.ebi.ac.uk/ena/browser/view/PRJNA377530
- Cutadapt : http://cutadapt.readthedocs.io/en/stable/index.html
- Dada2 ITS tutorial(공식 X) : https://nephele.niaid.nih.gov/details_dada2_its/
- 이미지 자료(ITS1) : Usyk M, Zolnik CP, Patel H, Levi MH, Burk RD. Novel ITS1 Fungal Primers for Characterization of the Mycobiome. mSphere. 2017 Dec 13;2(6):e00488-17. doi: 10.1128/mSphere.00488-17. PMID: 29242834; PMCID: PMC5729218.
- 이미지 자료(ITS1서열 길이의 다양성) : Motooka D, Fujimoto K, Tanaka R, Yaguchi T, Gotoh K, Maeda Y, Furuta Y, Kurakawa T, Goto N, Yasunaga T, Narazaki M, Kumanogoh A, Horii T, Iida T, Takeda K, Nakamura S. Fungal ITS1 Deep-Sequencing Strategies to Reconstruct the Composition of a 26-Species Community and Evaluation of the Gut Mycobiota of Healthy Japanese Individuals. Front Microbiol. 2017 Feb 15;8:238. doi: 10.3389/fmicb.2017.00238. PMID: 28261190; PMCID: PMC5309391.