free mobile website builder


Critical Assessment of Function Annotation

CAFA is a community-wide challenge designed to provide a large-scale assessment of computational methods dedicated to predicting protein function.
More information can be found at as well as the CAFA2 paper (Jiang et al, 2016)
This toolset provides an assessment for CAFA submissions based on precision and recall.

Github repository is here.

Reconstruction of Ancestral Gene Blocks Using Events

ROAGUE is a tool to reconstruct ancestors of gene blocks in prokaryotic genomes. Gene blocks are genes co-located on the chromosome. In many cases, gene blocks are conserved between bacterial species, sometimes as operons, when genes are co-transcribed. The conservation is rarely absolute: gene loss, gain, duplication, block splitting and block fusion are frequently observed.

Github repository is here.

Bacteriocin prediction using Word Embedding with Deep Recurrent Neural Networks

Antibiotic resistance is a major public health crisis, and finding new sources of antimicrobial drugs is crucial to solving it. Bacteriocins, which are bacterially-produced antimicrobial peptide products, are candidates for broadening our pool of antimicrobials. The discovery of new bacteriocins by genomic mining is hampered by their sequences' low complexity and high variance, which frustrates sequence similarity-based searches. Here we use word embeddings of protein sequences to represent bacteriocins, and subsequently apply Recurrent Neural Networks and Support Vector Machines to predict novel bacteriocins from protein sequences without using sequence similarity. We developed a word embedding method that accounts for sequence order, providing a better classification than a simple summation of the same word embeddings. We use the Uniprot/TrEMBL database to acquire the word embeddings taking advantage of a large volume of unlabeled data. Our method predicts, with a high probability, six yet unknown putative bacteriocins in Lactobacillus. Generalized, the representation of sequences with word embeddings preserving sequence order information can be applied to protein classification problems for which sequence homology cannot be used.

The associated paper can be found here.

Github repository is here.


Fastclust is orthomcl-like tool. It identifies orthologs, paralogs and co-orthologs for genomes

Github repository is here.

A pipline for operon Evaluation in Metagenomes

POEM is a pipeline which predicts operons from metagenomic data, identifies core functions from predicted operons and visualizes the results.

Github repository is here.