top of page

PredCMB: Predicting Changes in Microbial metaBolites from shotgun metagenome data

This page provides the download and related descriptions of PredCMB. The PredCMB software is presented in "PredCMB: Predicting changes in microbial metabolites based on the gene-metabolic network analysis of shotgun metagenome data".

Introduction

PredCMB is a method to predict changes in individual metabolites from shotgun metagenome data by utilizing enzymatic gene-metabolite networks. The differentially of enzymatic genes are evaluated between two given conditions, and their contribution to the production of individual metabolites are evaluated by leveraging the structure of enzymatic gene-metabolite networks. PredCMB provides correlated predictions with actual measurements of metabolites, and can identify classes of metabolites that show major changes. This implementation of PredCMB is freely available to non-commercial users.

Requirements

The programs in this page were developed and tested on the R environment (version 4.0.2) with the following packages:

1. DESeq2 (version 1.30)

2. Piano (version 2.6.0)

Python(version >= 3.7.7) is also required to use the optionally provided python script.

Additionally, you may need HUMAnN to pre-process your metagenome data into input-ready form.

Downloads

PredCMB scripts: Scripts to run PredCMB and to summarize its result. This zip file includes the following two files.

run_prediction.R        R script to run PredCMB

run_summary.R          R script to summarize the metabolite-level result of PredCMB into metabolite class-level information

convert_abundance2int.py        Optional Python script that can be used to prepare integer CPM values.

Enzyme-metabolite information: A file of curated enzyme-metabolite pairs, which describes the enzymes that contribute to the production of metabolites. For enzymes that may work in bidirectional reactions, the direction of reaction was fixed to one direction using eQuilibrator. Enzyme information is in UniRef ID and metabolite information is in KEGG ID. The enzyme-metabolite information has been curated from KEGG and UniRef90 of the version November 2021.

Metabolite & metabolite class information: Files that map IDs to metabolite names and metabolite classes.

Preparing PredCMB

1. Download all the files into your working directory.

2. Unzip the zipped files.

Input data preparation

A major input to PredCMB is a file that contains the abundances of enzymatic gene families for samples. This file can be obtained from the output files that are generated by HUMAnN, where the abundances of sequence reads that are mapped to each enzymatic gene family are estimated from the FASTQ files of shotgun metagenome sequencing data. Using the following steps can prepare input for PredCMB from the enzymatic gene family abundance output from HUMAnN. (Please refer the manual documents of HUMAnN for the following steps 0 ~ 2 if necessary.)

0. Use HUMAnN with your shotgun metagenome FASTQ files to make an enzymatic gene family abundance file.

1. Re-normalization of the enzymatic gene family abundances to CPM(copies per million) values.

$ humann_renorm_table --input gene_families.tsv --units "cpm" --output gene_families_cpm.tsv

2. Make an unstratified version of the enzymatic gene family abundance file.

$ humann_split_stratified_table --input gene_families_cpm.tsv --output /output_directory/

3. With the unstratified version of the enzymatic gene family abundance file (for example, "genefamilies_cpm_unstratified.tsv"), convert the CPM values into integer values so that it can be used for DESeq2. If necessary, the optionally provided Python script can be used for this step as follows:

$ python convert_abundance2int.py -i genefamilies_cpm_unstratified.tsv -o genefamilies_cpm_unstratified_int.tsv

How to run PredCMB

$ Rscript run_prediction.R -i $INPUTDATA -m $METADATA -o $OUTPUT -r $CONTROL -gb $ENZ_MET -p $PTHRESHOLD -c $CORENUM

$INPUTDATA        Enzymatic gene family abundance file in .tsv format (abundances in integer)

 

$METADATA         Sample group information file in .tsv format (Check the example file below for its format.)

 

$OUTPUT             Result file name

 

$CONTROL          The group name for control samples

 

$ENZ_MET           Enzyme-metabolite information file (Use the provided file.)

 

$PTHRESHOLD    Threshold for the statistical significance p-value from DESeq2 result. Enzymatic gene families that show changes compared to control with p-values less than this value will be considered in predicting the changes of metabolites.

$CORENUM          The number of CPU cores that will be used during the analysis

Example files

Input and output example

        genefamilies_ex.tsv        Tab-delimited text file with enzymatic gene family abundances.

        metadata_ex.tsv             Tab-delimted text file with sample group specification.

        output_ex.tsv                  Output file

Example run:

$ Rscript run_prediction.R -i genefamilies_ex.tsv -m metadata_ex.tsv -o output_ex.tsv -r "Control" -gb uniref_com_kerk_oc.tsv -p 0.05 -c 4

Development site

Relevant resources and source codes under development are available at GitHub:

https://github.com/jungyongji/PredCMB

bottom of page