We will filter out species whose relative abundance does not exceed 0.001 in any sample. This criterion is derived from the study by Zeller et al., the same study that provided the dataset (Zeller et al. 2014).
After applying this abundance filter, the dataset is reduced to 491 species. We will now proceed with model development using these filtered species.
import numpy as npimport matplotlib.pyplot as plt# dataset containing only bacterial microoganism's relative abundacemicrobiome = zeller_db[bacteria_colnames]# converting data typesfor col in microbiome: microbiome.loc[:,col] = pd.to_numeric(microbiome[col], errors='coerce')# fetching names of columns with abundance exceeding .001columns_to_fetch = microbiome.columns[microbiome.max(axis=0) >0.001]# filtering datasetmicrobiome_filtered = microbiome[columns_to_fetch]plt.figure()plt.bar([1,2],[len(microbiome.columns),len(columns_to_fetch)],alpha=.8)plt.xticks([1,2],['before','after'])plt.ylabel('Number of microbial species')plt.title('Before and after species filtering')plt.show()
Figure 1: Before and after species filtering
References
Zeller, Georg, Julien Tap, Anita Y Voigt, Shinichi Sunagawa, Jens Roat Kultima, Paul I Costea, Aurélien Amiot, et al. 2014. “Potential of Fecal Microbiota for Early-Stage Detection of Colorectal Cancer.”Molecular Systems Biology 10 (11): 766.