🟦 샘플 필터링은 왜 필요한가?
◼ 마이크로 바이옴 분석에서 데이터는 보통 대용량 데이터이다.
◼ 분석 시 컴퓨터의 과부하를 줄이기 위하여 일부 데이터만 추출하여 비교하기도 한다.
- ex) major한 taxa에서 differencfial abundance test를 위해 read수가 많은 순으로 10%만 골라낸다
◼ 일정 depth를 충족하지 못한 샘플은 왜곡이 많을 것이라고 가정해 제거하고 분석을 수행한다.
🟦 Phyloseq 데이터를 이용한 필터링 해보기
library(phyloseq)
data(GlobalPatterns)
GlobalPatterns
# hyloseq-class experiment-level object
# otu_table() OTU Table: [ 19216 taxa and 26 samples ]
# sample_data() Sample Data: [ 26 samples by 7 sample variables ]
# tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
# phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
| Taxonomy Filtering
1. Read Count 로 filtering 하기
- Read Count가 10 이 넘은 taxa만 남기기
GlobalPatterns.2 = filter_taxa(GlobalPatterns, function(x) sum(x) > 10, TRUE)
GlobalPatterns.2
# phyloseq-class experiment-level object
# otu_table() OTU Table: [ 12051 taxa and 26 samples ]
# sample_data() Sample Data: [ 26 samples by 7 sample variables ]
# tax_table() Taxonomy Table: [ 12051 taxa by 7 taxonomic ranks ]
# phy_tree() Phylogenetic Tree: [ 12051 tips and 12050 internal nodes ]
2. Relative abundance로 filtering 하기
- 전체 샘플에서 realative abundance가 1% 이상인 otu만 보고 싶을 때 사용한다
GlobalPatterns.rAB = transform_sample_counts(GlobalPatterns, function(x) x / sum(x) )
GlobalPatterns.rAB.2 = filter_taxa(GlobalPatterns.rAB, function(x) sum(x) > .01, TRUE)
GlobalPatterns.rAB.2
# phyloseq-class experiment-level object
# otu_table() OTU Table: [ 375 taxa and 26 samples ]
# sample_data() Sample Data: [ 26 samples by 7 sample variables ]
# tax_table() Taxonomy Table: [ 375 taxa by 7 taxonomic ranks ]
# phy_tree() Phylogenetic Tree: [ 375 tips and 374 internal nodes ]
- average realative abundance가 1% 이상인 otu만 보고 싶을 때 사용한다
GlobalPatterns.rAB = transform_sample_counts(GlobalPatterns, function(x) x / sum(x) )
GlobalPatterns.rAB.3 = filter_taxa(GlobalPatterns.rAB, function(x) mean(x) > .01, TRUE)
GlobalPatterns.rAB.3
# phyloseq-class experiment-level object
# otu_table() OTU Table: [ 12 taxa and 26 samples ]
# sample_data() Sample Data: [ 26 samples by 7 sample variables ]
# tax_table() Taxonomy Table: [ 12 taxa by 7 taxonomic ranks ]
# phy_tree() Phylogenetic Tree: [ 12 tips and 11 internal nodes ]
+) 그렇다면 1% 이하의 샘플만 관찰하려면?
under0.01 = filter_taxa(GlobalPatterns, function(x) mean(x) < .01,TRUE)
taxa_names_0.01 = taxa_names(under0.01)
all_taxa_names = taxa_names(GlobalPatterns)
Taxa_0.01 = all_taxa_names[!all_taxa_names %in% taxa_names_0.01]
GlobalPatterns_under0.01 <- prune_taxa(Taxa_0.01,GlobalPatterns)
GlobalPatterns_under0.01
# phyloseq-class experiment-level object
# otu_table() OTU Table: [ 18988 taxa and 26 samples ]
# sample_data() Sample Data: [ 26 samples by 7 sample variables ]
# tax_table() Taxonomy Table: [ 18988 taxa by 7 taxonomic ranks ]
# phy_tree() Phylogenetic Tree: [ 18988 tips and 18987 internal nodes ]
(출처 : https://github.com/joey711/phyloseq/issues/694)
3. 특정 taxa 제거
예를들어 피부 sample인데 mitochondria, Chloroplast, 혹은 이름 없는 taxa가 검출되었을 때, 삭제하기 위하여 수행한다
ps <- subset_taxa(ps, !is.na(Phylum) & !Phylum %in% c("", "uncharacterized"))
4. Prevalence Filtering
- 모든 샘플에서 50% 이상 검출된 taxa만 남겨보자.
일단 전체 ASV에 대한 prevelance를 계산해보자.
prevelancedf = apply(X = otu_table(GlobalPatterns),
MARGIN = 1,
FUN = function(x){sum(x > 0)})
# Add taxonomy and total read counts to this data.frame
prevelancedf = data.frame(Prevalence = prevelancedf,
TotalAbundance = taxa_sums(GlobalPatterns),
tax_table(GlobalPatterns))
prevelancedf1 = subset(prevelancedf, Phylum %in% get_taxa_unique(GlobalPatterns, taxonomic.rank = "Phylum"))
0.5 이상의 prevelance를 가진 asv를 추출하자
prevalenceThreshold = 0.50 * nsamples(GlobalPatterns)
keepTaxa = rownames(prevelancedf1)[(prevelancedf1$Prevalence >= prevalenceThreshold)]
이제 이 값을 phyloseq 개체에 적용하여 필터링하자
GlobalPatterns.2 = prune_taxa(keepTaxa, GlobalPatterns)
GlobalPatterns.2
# phyloseq-class experiment-level object
# otu_table() OTU Table: [ 1797 taxa and 26 samples ]
# sample_data() Sample Data: [ 26 samples by 7 sample variables ]
# tax_table() Taxonomy Table: [ 1797 taxa by 7 taxonomic ranks ]
# phy_tree() Phylogenetic Tree: [ 1797 tips and 1796 internal nodes ]
(출처 : http://web.stanford.edu/class/bios221/MicrobiomeWorkflowII.html#using_phyloseq)
5. Top N의 taxa만 보기
- Top 30을 관찰해 보자
most_abundant_taxa <- sort(taxa_sums(GlobalPatterns), TRUE)[1:30]
Global.30 <- prune_taxa(names(most_abundant_taxa), GlobalPatterns)
Global.30
# phyloseq-class experiment-level object
# otu_table() OTU Table: [ 30 taxa and 26 samples ]
# sample_data() Sample Data: [ 26 samples by 7 sample variables ]
# tax_table() Taxonomy Table: [ 30 taxa by 7 taxonomic ranks ]
# phy_tree() Phylogenetic Tree: [ 30 tips and 29 internal nodes ]
6. 테스트용 데이터 만들기 : 랜덤 한 taxa를 뽑아보자
randomSpecies100 <- sample(taxa_names(GlobalPatterns), 100, replace=FALSE)
random <- prune_taxa(randomSpecies100, GlobalPatterns)
random
# phyloseq-class experiment-level object
# otu_table() OTU Table: [ 100 taxa and 26 samples ]
# sample_data() Sample Data: [ 26 samples by 7 sample variables ]
# tax_table() Taxonomy Table: [ 100 taxa by 7 taxonomic ranks ]
# phy_tree() Phylogenetic Tree: [ 100 tips and 99 internal nodes ]
| Sample Filtering
1. Sample의 총 read수의 합이 특정 수를 넘는 샘플만 filtering
- GlobalPatterns 데이터의 샘플 수를 보자
sample_sums(GlobalPatterns)
# CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong LMEpi24M
# 864077 1135457 697509 1543451 2076476 718943 433894 186297 2000402 100187 2117592
# SLEpi20M AQC1cm AQC4cm AQC7cm NP2 NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28
# 1217312 1167748 2357181 1699293 523634 1478965 1652754 58688 493126 279704 937466
# TS29 Even1 Even2 Even3
# 1211071 1216137 971073 1078241
대략 15만으로 잡고 필터링해보자
GlobalPatterns.3 = prune_samples(sample_sums(GlobalPatterns) >= 150000, GlobalPatterns)
GlobalPatterns.3
# phyloseq-class experiment-level object
# otu_table() OTU Table: [ 19216 taxa and 24 samples ]
# sample_data() Sample Data: [ 24 samples by 7 sample variables ]
# tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
# phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
2. 원하는 Type의 데이터로 filtering 하기
일단 phyloseq의 sample_data()를 이용해서 메타데이터를 관찰하자.
library(dplyr)
meta <- sample_data(GlobalPatterns)
meta %>% .$SampleType %>% unique
# [1] Soil Feces Skin Tongue Freshwater
# [6] Freshwater (creek) Ocean Sediment (estuary) Mock
여기서 Soil에 해당하는 데이터를 추출해 보자
GlobalPatterns.Skin =subset_samples(GlobalPatterns, SampleType == "Skin")
GlobalPatterns.Skin
# phyloseq-class experiment-level object
# otu_table() OTU Table: [ 19216 taxa and 3 samples ]
# sample_data() Sample Data: [ 3 samples by 7 sample variables ]
# tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
# phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
만약 여러 type을 보고 싶다면
SampleType %in% c("Skin", "Soil") 등으로 바꾸어서 입력하면 된다
3. 원하는 sam ple data의 rowname으로 filtering 하기
sub1 <- prune_samples("CL3", GlobalPatterns)
sub1
# phyloseq-class experiment-level object
# otu_table() OTU Table: [ 19216 taxa and 1 samples ]
# sample_data() Sample Data: [ 1 samples by 7 sample variables ]
# tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
# phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
🟦 Reference
- https://joey711.github.io/phyloseq/preprocess.html#preprocessing