PredCMB: Predicting Changes in Microbial metaBolites from shotgun metagenome data

This page provides the download and related descriptions of PredCMB. The PredCMB software is presented in "PredCMB: Predicting changes in microbial metabolites based on the gene-metabolic network analysis of shotgun metagenome data".

Introduction

PredCMB is a method to predict changes in individual metabolites from shotgun metagenome data by utilizing enzymatic gene-metabolite networks. The differentially of enzymatic genes are evaluated between two given conditions, and their contribution to the production and consomption of individual metabolites are evaluated by leveraging the structure of enzymatic gene-metabolite networks. PredCMB provides correlated predictions with actual measurements of metabolites, and can identify classes of metabolites that show major changes. This implementation of PredCMB is freely available to non-commercial users.

Requirements

The programs in this page were developed and tested on python environment (version 3.10) with some packages including pydeseq2.

Additionally, you may need HUMAnN to pre-process your metagenome data into input-ready form.

Downloads

PredCMB code: A python code to run PredCMB and to summarize its result. This zip file includes single python file (predcmb.py).

Enzyme-metabolite information: A file of curated information of compounds and enzymatic gene families that are involved with their production and consumption. For enzymes that work in bidirectional reactions, the direction of reaction was fixed to one direction using eQuilibrator. Enzyme information is in UniRef ID and metabolite information is in KEGG ID. This information has been curated from KEGG and UniRef90 of the version November 2021.

Metabolite & metabolite class information: Files that map IDs to metabolite names and metabolite classes.

Preparing PredCMB

1. Download all the files into your working directory.

2. Unzip the zipped files.

Input data preparation

A major input to PredCMB is a file that contains the abundances of enzymatic gene families for samples. This file can be obtained from the output files that are generated by HUMAnN, where the abundances of sequence reads that are mapped to each enzymatic gene family are estimated from the FASTQ files of shotgun metagenome sequencing data. Using the following steps can prepare input for PredCMB from the enzymatic gene family abundance output from HUMAnN. (Please refer the manual documents of HUMAnN for the following steps if necessary.)

0. Use HUMAnN with your shotgun metagenome FASTQ files to make an enzymatic gene family abundance file.

1. Re-normalization of the enzymatic gene family abundances to CoPM (copies per million) values.

$ humann_renorm_table --input gene_families.tsv --units "cpm" --output gene_families_cpm.tsv

2. Make an unstratified version of the enzymatic gene family abundance file.

$ humann_split_stratified_table --input gene_families_cpm.tsv --output /output_directory/

How to run PredCMB

$ python predcmb.py $INPUTDATA $METADATA $CONTROL $EXPERIMENT $OUTPUT

$INPUTDATA Enzymatic gene family abundance file in .tsv format

$METADATA Sample group information file in .tsv format (Check the example file below for its format.)

$CONTROL The group name for control samples

$EXPERIMENT The group name for experiment samples (target of analysis)

$OUTPUT Result folder name

To see the complete list of optional arguments, use the "--help" option.

Example files

Input format example

genefamilies_ex.txt Tab-delimited text file with enzymatic gene family abundances.

metadata_ex.txt Tab-delimted text file with sample group specification.

Example run:

$ python predcmb.py genefamilies_ex.txt metadata_ex.txt Control Experiment example_run

Using HUMAnN with a custom EC-filtered ChocoPhlAn DB

Preparing an input for PredCMB needs HUMAnN to prepare the abundance information of enzymatic gene families from raw FASTQ files. Conventional running of HUMAnN includes profiling of all gene families, while PredCMB needs abundance information of enzymatic gene families only.

In order to run HUMAnN with limited gene family coverage to enzymatic gene families, a custom ChocoPhlAn DB with only enzymatic gene family sequences is necessary for the procedure of nucleotide sequence alignment, and EC-filtered UniRef DB should be used for the process of amino acid sequence alignment. EC-filtered UniRef 90/50 DB is provided in HUMAnN, while no custom ChocoPhlAn DB with only enzymatic gene families is provided.

HUMAnN can be used only with the EC-filtered UniRef DBs, but using a custom EC-filtered ChocoPhlAn DB can further reduce the running time of HUMAnN. For this purpose, we built a custom EC-filtered ChocoPhlAn DB from the original ChocoPhlAn DB of HUMAnN (original ChocoPhlAn download date: April 12, 2024).

Running HUMAnN with this custom EC-filtered ChocoPhlAn DB can reduce the time in preparing input for PredCMB. (Roughly 1/3 of the conventional running time. It can vary depending on input data.)

NOTE: Using this custom EC-filtered ChocoPhlAn DB can reduce the performance of PredCMB, as false positive sequence alignments happen to the enzymatic gene families. (See the Supplementary material of the publication)

Downloads (needs both):

EC-filtered ChocoPhlAn DB (about 11 GB)

Annotation file (about 200 MB)

How to use:

1. Download both the EC-filtered ChocoPhlAn DB and the annotation file.

2. Unzip the files.

3. Run HUMAnN with the downloaded EC-filtered ChocoPhlAn DB and the annotation file, while setting the protein-database option with EC-filtered UniRef DB. (Please refer the HUMAnN document for information of detailed configuration.)

Example)

$ human --input input.fastq --output output_name --bypass-nucleotide-index --nucleotide-database <path to the folder of EC-filtered ChocoPhlAn DB> --id-mapping <path to the annotation file> --protein-database <path to the HUMAnN provided EC-filtered Uniref DB>