Fetching taxonomic information for any micro-bacterium using Python
python
statistics
bioinformatics
Author
Pankaj Chejara
Published
July 25, 2025
In a recent project, I needed to devise a way to harmonize different gut-based bio-markers for colorectal cancer found in the literature. Since studies have reported their results at different taxonomic levels (e.g., species, genus), synthesizing the literature required additional information for each reported biomarker.
To address this, I decided to complement each reported biomarker by incorporating lineage information.
In this post, I will walk you through a Python script that retrieves taxonomic level information for any bacteria using Python.
Goal: Develop a script that fetches the complete lineage of a given bacterial species.
Solution: A quick search on internet led me to a Python package ete3 that provides access to the NCBI taxonomic database. In NCBI database, each taxonomic label is stored along with a unique identifier (i.e., Taxa ID). This Taxa ID is what we need to obtain first before fetching required information.
First Version
The first version simply fetches the ID of given bacterium (or organism) and then uses that ID get obtain lineage information.
from ete3 import NCBITaxa# Initialize NCBIncbi = NCBITaxa()def get_taxonomic_annotation(name):""" Function to fetch information about given bacterium at different taxonomic levels. Args: ---- name: str Name of bacterium Returns: dict: A dictionary containing information about each taxonomic level """try:# Get taxid for the name name2taxid = ncbi.get_name_translator([name])if name notin name2taxid:returnf"Taxon '{name}' not found."# Extracting taxa id taxid = name2taxid[name][0]# Get lineage lineage = ncbi.get_lineage(taxid)# Get ranks and names names = ncbi.get_taxid_translator(lineage) ranks = ncbi.get_rank(lineage)# Build result dictionary full_annotation = {ranks[t]: names[t] for t in lineage if ranks[t] !='no rank'}return full_annotationexceptExceptionas e:returnf"Error: {str(e)}"get_taxonomic_annotation('Streptococcus vestibularis')
The initial script failed to handle the case where bacterial names contained typos, which was a common case with the dataset extracted from research papers.
Second Version with Fuzzy Matching
To address this, I enhanced the script by integrating a fuzzy matching step. If an exact match for a given name is not found, the script returns closest matching (based on distance metric) bacterial name along with lineage.
For fuzzy matching, I used rapidfuzz package. Below is the code implementing fuzzy matching.
from rapidfuzz import process # use this instead of fuzzywuzzy for speed# Load all available taxon names (only once)all_taxa_names =list(ncbi.get_taxid_translator(ncbi.get_descendant_taxa(1, collapse_subspecies=False)).values())def get_closest_taxon_name(query_name, threshold=80): match, score, _ = process.extractOne(query_name, all_taxa_names)if score >= threshold:return matchreturnNone
Next, one minor change is made to the get_taxonomic_annotation function. It returns the closest matching taxa’s ID if exact match is not found.
def get_taxonomic_annotation(name, fuzzy=True, threshold=80):try:# Try direct match first name2taxid = ncbi.get_name_translator([name])if name notin name2taxid:# If not found, use fuzzy matchingif fuzzy: corrected_name = get_closest_taxon_name(name, threshold)ifnot corrected_name:returnf"No close match found for '{name}'." name = corrected_name name2taxid = ncbi.get_name_translator([name])else:returnf"Taxon '{name}' not found." taxid = name2taxid[name][0] lineage = ncbi.get_lineage(taxid) names = ncbi.get_taxid_translator(lineage) ranks = ncbi.get_rank(lineage) full_annotation = {ranks[t]: names[t] for t in lineage if ranks[t] !='no rank'}return full_annotationexceptExceptionas e:returnf"Error: {str(e)}"get_taxonomic_annotation('Streptocous vestibularis')