We will begin with a subset of dataset from Zeller et al. (2014). This dataset is from the fecal samples collected from 156 French patients. In this section, we will load the dataset and explore its basic characteristics.
We will also extract the bacterial species column names. Those column names starting with k__Bacteria represents bacterial species.
import pandas as pd# loading tab-seperated data file using pandas and transposing itdata = pd.read_csv('Nine_CRC_cohorts_taxon_profiles.tsv',sep='\t',header=None).T# setting the first row as column names and then removing itdata = data.rename(columns=data.loc[0]).drop(0, axis=0)# accessing Zeller et al., 2014 datasetzeller_db = data.loc[data['dataset_name'] =='ZellerG_2014',:]# fetching microbacterial organism information-related columnsbacteria_colnames = [col for col in data.columns if'k__Bacteria'in col]# metadata colnamesmetadata_colnames = ['dataset_name', 'sampleID', 'subjectID', 'body_site', 'study_condition','disease', 'age', 'age_category', 'gender', 'country','ajcc','alcohol','antibiotics_current_use','curator','disease_subtype','ever_smoke','fobt','hba1c','hdl','ldl','location','BMI']print('Total features: ', zeller_db.shape[1])
Total features: 829
For the rest of our analysis, we will focus on five metadata features along with OTUs. Those features are age, gender, BMI, study_condition and ajcc.
Age, BMI and gender distribution
Figure fig-dist shows distributions of age, BMI, and gender of patients across different study conditions: control, adenoma, and CRC.
CRC patients are slightly older than control cases.
There is an increase in BMI for adenoma compared to control cases.
There are more males with CRC than females.
(a) Age distribution
(b) BMI distribution
(c) Gender distribution
Figure 1: Distribution across study conditions
Zeller, Georg, Julien Tap, Anita Y Voigt, Shinichi Sunagawa, Jens Roat Kultima, Paul I Costea, Aurélien Amiot, et al. 2014. “Potential of Fecal Microbiota for Early-Stage Detection of Colorectal Cancer.”Molecular Systems Biology 10 (11): 766.