Genome sequencing data processing pipelines
BioTech
Genome Sequencing
Python
R
Signal Processing
Developed upstream and downstream pipelines to process amplicon sequencing data
![]()
Project overview
This project goal was to establish genome sequencing data processing pipelines to process raw sequencing reads (i.e., FASTQ reads) into microbacterial compositional data. It was a part of a larger project aiming at building a machine learning classifier for colorectal cancer using fecal sample.
Key Features
- End-to-End Automation: Fully automated processing from raw sequencing data to analytical results
- Reproducible Workflows: Pipeline built with Snakemake for reproducible bioinformatics analysis
- Comprehensive Analysis: Produces taxonomic profiles and microbial composition data
- Integration Ready: Output formatted for downstream machine learning applications
- Machine Learning Modeling: Microbial compositional data analyzed using LASSO model to predict colorectal cancer.
⚡ Key Contributions
Developed a fully automated upstream pipeline and downstream pipeline featuring
- Automated quality control and adapter trimming
- Taxonomic classification using QIIME2
- Generation of microbial abundance profiles
- Exploratory analysis of microbial abundance profiles
- Building of a LASSO model for colorectal cancer detection
Skills Applied
Python, R, Signal Processing
Tool/Libraries Used
pandas,matplotlib,ggplot2,qiime2,snakemake
