Bioinformatics/Taxonomy
[ips::raxml] R에서 RAxML을 사용한 Maximum likelihood tree를 만들어 보자
김해김씨99대손
2024. 5. 9. 16:39
이전 글
Linux기반에서 수행될 때를 기준으로 작성되었습니다.
먼저 cd ~ 위치에 도구들을 위한 tools 폴더를 만들어 봅시다.
cd ~
mkdir tools/
1. Raxml 설치
- github: https://github.com/stamatak/standard-RAxML
서버에 파일 다운로드하기
cd tools/
wget https://github.com/stamatak/standard-RAxML/archive/master.zip
unzip master.zip
rm master.zip
컴파일하기
cd standard-RAxML-master/
make -f Makefile.gcc # basic non-parallized version
make -f Makefile.SSE3.gcc # x86 processor optimized version
make -f Makefile.PTHREADS.gcc # parallelized version
make -f Makefile.SSE3.PTHREADS.gcc # parallelized and x86 processor optimized version
ls raxmlHPC*
설치되었는지 확인
./tools/standard-RAxML-master/raxmlHPC-PTHREADS-SSE3 -v
접근 가능하게 만들려면, RAxML의 위치를 ~/.bachrc에 입력합니다.
export PATH=$HOME/tools/RAxML/standard-RAxML-master:$PATH
2. R에서 서열 정렬
먼저 필요한 패키지는 아래와 같습니다.
library(ips); packageVersion("ips") # ‘0.0.12’
library(DECIPHER); packageVersion("DECIPHER") # ‘2.26.0’
library(phangorn); packageVersion("phangorn") # ‘2.10.0’
library(ape); packageVersion("ape") # ‘2.10.0’
일반적인 DADA2 pipeline에 은 DECIPER로 정렬 후, phangorn으로 계통수를 제작합니다.
여기서는 ips 패키지의 raxml을 사용하여 계통수를 제작하겠습니다.
## 1.Input sequence
# 각 파일 타입에 따라 1) ~ 3) 을 선택하여 수행
# 1) phyloseq에서 불러올 때
ps.forward <- readRDS("~/phyloseq.rds")
sequences <- colnames(otu_table(ps.forward))
names(sequences) <- sequences
# 2) otu table에서 불러올 때
sequences <- colnames(seqtab.nochim)
names(sequences) <- sequences
# 3) fa 파일에서 불러올 떄
fas <- "seqs.fa"
sequences <- readDNAStringSet(fas)
sequences <- OrientNucleotides(sequences)
## 2. Run sequence alignment (MSA) using DECIPHER
alignment <- AlignSeqs(DNAStringSet(sequences), anchor = NA)
phang.align <- phangorn::phyDat(as(alignment, "matrix"), type="DNA")
write.phyDat(phang.align, file="alignment.fasta", format="fasta")
my_alignment <- ape::read.dna("alignment.fasta",format="fasta",as.matrix=TRUE)
3. R에서 RAxML 돌리기
## 3. run raxml
alignment.rax.gtr <- raxml(my_alignment,
m="GTRGAMMAIX", # model
f="a", # best tree and bootstrap
p=1234, # random number seed
x=2345, # random seed for rapid bootstrapping
N=100, # number of bootstrap replicates
file="alignment", # name of output files
exec="home/usr/tools/standard-RAxML-master/raxmlHPC-PTHREADS-SSE3", # 위치에 맞게 수정
threads=20
)
# You are using a proportion of Invariable sites estimate, although I don't
# like it. The likelihood epsilon "-f e" will be automatically lowered to 0.001
# to avoid unfavorable effects caused by simultaneous optimization of alpha and P-Invar
#
#
# Using BFGS method to optimize GTR rate parameters, to disable this specify "--no-bfgs"
#
#
# This is the RAxML Master Pthread
# This is RAxML Worker Pthread Number: 2
# This is RAxML Worker Pthread Number: 3
# This is RAxML Worker Pthread Number: 1
# This is RAxML Worker Pthread Number: 5
# This is RAxML Worker Pthread Number: 6
# This is RAxML Worker Pthread Number: 7
# This is RAxML Worker Pthread Number: 8
# This is RAxML Worker Pthread Number: 4
# This is RAxML Worker Pthread Number: 10
# This is RAxML Worker Pthread Number: 11
# This is RAxML Worker Pthread Number: 9
# This is RAxML Worker Pthread Number: 12
# This is RAxML Worker Pthread Number: 13
# This is RAxML Worker Pthread Number: 14
# This is RAxML Worker Pthread Number: 15
# This is RAxML Worker Pthread Number: 16
# This is RAxML Worker Pthread Number: 17
# This is RAxML Worker Pthread Number: 18
# This is RAxML Worker Pthread Number: 19
#
#
# This is RAxML version 8.2.12 released by Alexandros Stamatakis on May 2018.
# With greatly appreciated code contributions by:
# Andre Aberer (HITS)
# Simon Berger (HITS)
# Alexey Kozlov (HITS)
# Kassian Kobert (HITS)
# David Dao (KIT and HITS)
# Sarah Lutteropp (KIT and HITS)
# Nick Pattengale (Sandia)
# Wayne Pfeiffer (SDSC)
# Akifumi S. Tanabe (NRIFS)
# Charlie Taylor (UF)
#
#
# Alignment has 766 distinct alignment patterns
#
# Found 0 invariant alignment patterns that correspond to 0 columns
# Proportion of gaps and completely undetermined characters in this alignment: 72.79%
#
# RAxML rapid bootstrapping and subsequent ML search
#
# Using 1 distinct models/data partitions with joint branch length optimization
#
#
# Executing 100 rapid bootstrap inferences and thereafter a thorough ML search
#
# All free model parameters will be estimated by RAxML
# GAMMA+P-Invar model of rate heterogeneity, ML estimate of alpha-parameter
# GAMMA+P-Invar Model parameters will be estimated up to an accuracy of 0.0010000000 Log Likelihood units
#
# Partition: 0
# Alignment Patterns: 766
# Name: No Name Provided
# DataType: DNA
# Substitution Matrix: GTR
# Base frequencies: ML estimate
#
#
#
# RAxML was called as follows:
#
# /home/usr/tools/standard-RAxML-master/raxmlHPC-PTHREADS-SSE3 -T 20 -f a -p 1234 -x 2345 -m GTRGAMMAIX -N 100 -s alignment.phy -n alignment
#
#
#
# Time for BS model parameter optimization 3.820551
# Bootstrap[0]: Time 119.091122 seconds, bootstrap likelihood -133564.390079, best rearrangement setting 14
# Bootstrap[1]: Time 75.170364 seconds, bootstrap likelihood -119015.558564, best rearrangement setting 12
# Bootstrap[2]: Time 77.333730 seconds, bootstrap likelihood -136101.608342, best rearrangement setting 12
# Bootstrap[3]: Time 96.199627 seconds, bootstrap likelihood -128573.521899, best rearrangement setting 13
# Bootstrap[4]: Time 108.406754 seconds, bootstrap likelihood -123148.556330, best rearrangement setting 14
# Bootstrap[5]: Time 51.153752 seconds, bootstrap likelihood -125431.736714, best rearrangement setting 9
# Bootstrap[6]: Time 97.653022 seconds, bootstrap likelihood -121143.307189, best rearrangement setting 13
# Bootstrap[7]: Time 53.892158 seconds, bootstrap likelihood -132917.810564, best rearrangement setting 9
# Bootstrap[8]: Time 135.423945 seconds, bootstrap likelihood -122008.239925, best rearrangement setting 15
# Bootstrap[9]: Time 81.147537 seconds, bootstrap likelihood -130929.923241, best rearrangement setting 12
# Bootstrap[10]: Time 116.630507 seconds, bootstrap likelihood -132404.970655, best rearrangement setting 15
# ...
- 출처를 보면 5,000개의 ITS 서열로 제작 시 9 일이 걸린다고 한다 (실제 1,350 개 서열로 계통수를 제작해 본 결과 14시간이 걸렸다).
4. Phyloseq으로 저장하기
tree <- read_tree("RAxML_bipartitionsBranchLabels.alignment")
phy_tree(phyloseq) <-phy_tree(tree)
saveRDS(ps, file="ps.rds")
🚨 Taxon Name too long at taxon 1, adapt constant nmlngth in axml.h, current setting 256
말 그대로 taxon이 이름이 너무 길어서 생긴 에러이다. dada2의 경우 서열의 이름을 서열 그대로 쓰는 경우가 많다. 이때는 ASV를 붙여서 나타내주자.
# 예시코드
silva <- readRDS("./output_16Sfull/3.Phyloseq/phyloseq_silva_20240528.rds")
sequences <- colnames(otu_table(silva))
names(sequences) <- sequences
silva.rel <- transform_sample_counts(silva, function(x) x/sum(x))
taxa_ <- taxa_sums(silva.rel) %>% sort(decreasing = T)
sort_names <- names(taxa_)
sequences2 <- sequences[sort_names]
length(sequences2) # 980
names(sequences2) <- sprintf("ASV_%03d", 1:length(sequences2)) # 3자리 까지 나타냄
names(sequences2) %>% head
참고
- 설치: https://www.metagenomics.wiki/tools/phylogenetic-tree/construction/raxml/install
- DADA2 과정에서 수행: https://github.com/benjjneb/dada2/issues/88
반응형