Genome sequencing data processing pipelines
BioTech
Genome Sequencing
Python
R
Signal Processing
Developed upstream and downstream pipelines to process amplicon sequencing data
Project overview
This project goal was to establish genome sequencing data processing pipelines to process raw sequencing reads (i.e., FASTQ reads) into microbacterial compositional data. It was a part of a larger project aiming at building a machine learning classifier for colorectal cancer using fecal sample.
Key Features
- End-to-End Automation: Fully automated processing from raw sequencing data to analytical results
- Reproducible Workflows: Pipeline built with Snakemake for reproducible bioinformatics analysis
- Comprehensive Analysis: Produces taxonomic profiles and microbial composition data
- Integration Ready: Output formatted for downstream machine learning applications
- Machine Learning Modeling: Microbial compositional data analyzed using LASSO model to predict colorectal cancer.
⚡ Key Contributions
Developed a fully automated upstream pipeline and downstream pipeline featuring
- Automated quality control and adapter trimming
- Taxonomic classification using QIIME2
- Generation of microbial abundance profiles
- Exploratory analysis of microbial abundance profiles
- Building of a LASSO model for colorectal cancer detection
Skills Applied
Python
, R
, Signal Processing
Tool/Libraries Used
pandas
,matplotlib
,ggplot2
,qiime2
,snakemake