In a recent project, I needed to devise a way to harmonize different gut-based bio-markers for colorectal cancer found in the literature. Since studies have reported their results at different taxonomic levels (e.g., species, genus), synthesizing the literature required additional information for each reported biomarker.

To address this, I decided to complement each reported biomarker by incorporating lineage information.

In this post, I will walk you through a Python script that retrieves taxonomic level information for any bacteria using Python.

Goal: Develop a script that fetches the complete lineage of a given bacterial species.

Solution: A quick search on internet led me to a Python package ete3 that provides access to the NCBI taxonomic database. In NCBI database, each taxonomic label is stored along with a unique identifier (i.e., Taxa ID). This Taxa ID is what we need to obtain first before fetching required information.

First Version

The first version simply fetches the ID of given bacterium (or organism) and then uses that ID get obtain lineage information.

from ete3 import NCBITaxa

# Initialize NCBI
ncbi = NCBITaxa()

def get_taxonomic_annotation(name):
    """
    Function to fetch information about given bacterium at different taxonomic levels.
    
    Args:
    ----
        name: str
            Name of bacterium

    Returns:
        dict: A dictionary containing information about each taxonomic level
    """
    try:
        # Get taxid for the name
        name2taxid = ncbi.get_name_translator([name])
        if name not in name2taxid:
            return f"Taxon '{name}' not found."

        # Extracting taxa id
        taxid = name2taxid[name][0]

        # Get lineage
        lineage = ncbi.get_lineage(taxid)

        # Get ranks and names
        names = ncbi.get_taxid_translator(lineage)
        ranks = ncbi.get_rank(lineage)

        # Build result dictionary
        full_annotation = {ranks[t]: names[t] for t in lineage if ranks[t] != 'no rank'}
        return full_annotation

    except Exception as e:
        return f"Error: {str(e)}"

get_taxonomic_annotation('Streptococcus vestibularis')

{'cellular root': 'cellular organisms',
 'domain': 'Bacteria',
 'kingdom': 'Bacillati',
 'phylum': 'Bacillota',
 'class': 'Bacilli',
 'order': 'Lactobacillales',
 'family': 'Streptococcaceae',
 'genus': 'Streptococcus',
 'species': 'Streptococcus vestibularis'}

Challenge

The initial script failed to handle the case where bacterial names contained typos, which was a common case with the dataset extracted from research papers.

Second Version with Fuzzy Matching

To address this, I enhanced the script by integrating a fuzzy matching step. If an exact match for a given name is not found, the script returns closest matching (based on distance metric) bacterial name along with lineage.

For fuzzy matching, I used rapidfuzz package. Below is the code implementing fuzzy matching.

from rapidfuzz import process  # use this instead of fuzzywuzzy for speed

# Load all available taxon names (only once)
all_taxa_names = list(ncbi.get_taxid_translator(ncbi.get_descendant_taxa(1, collapse_subspecies=False)).values())

def get_closest_taxon_name(query_name, threshold=80):
    match, score, _ = process.extractOne(query_name, all_taxa_names)
    if score >= threshold:
        return match
    return None

Next, one minor change is made to the get_taxonomic_annotation function. It returns the closest matching taxa’s ID if exact match is not found.

def get_taxonomic_annotation(name, fuzzy=True, threshold=80):
    try:
        # Try direct match first
        name2taxid = ncbi.get_name_translator([name])

        if name not in name2taxid:
            # If not found, use fuzzy matching
            if fuzzy:
                corrected_name = get_closest_taxon_name(name, threshold)
                if not corrected_name:
                    return f"No close match found for '{name}'."
                name = corrected_name
                name2taxid = ncbi.get_name_translator([name])
            else:
                return f"Taxon '{name}' not found."

        taxid = name2taxid[name][0]
        lineage = ncbi.get_lineage(taxid)
        names = ncbi.get_taxid_translator(lineage)
        ranks = ncbi.get_rank(lineage)

        full_annotation = {ranks[t]: names[t] for t in lineage if ranks[t] != 'no rank'}
        return full_annotation

    except Exception as e:
        return f"Error: {str(e)}"

get_taxonomic_annotation('Streptocous vestibularis')

{'cellular root': 'cellular organisms',
 'domain': 'Bacteria',
 'kingdom': 'Bacillati',
 'phylum': 'Bacillota',
 'class': 'Bacilli',
 'order': 'Lactobacillales',
 'family': 'Streptococcaceae',
 'genus': 'Streptococcus',
 'species': 'Streptococcus vestibularis',
 'strain': 'Streptococcus vestibularis F0396'}

I hope you find this implementation helpful for your Bioinformatics projects!.