ABSTRACT
As metagenomic projects gather large quantities of genomic data from novel microbial species, there is a great need for improved computational tools to sort and understand such data.
An important part is the ``binning problem'', namely to determine the number of species present in a sample and to sort DNA fragments by species of origin.
We present PuzzleCluster, a new clustering algorithm for unsupervised binning in metagenomics. Besides implementing a new clustering approach, PuzzleCluster introduces several other new features, such as using word agreement information for increased clustering accuracy, and estimating clustering parameters by fitting data with the expectation maximization algorithm.
PuzzleCluster uses no prior assumptions about the genetic makeup or number of species present. Our tests show that PuzzleCluster frequently outperforms the best existing unsupervised binning programs.
PAPER
A Novel Unsupervised Clustering Algorithm for Binning DNA Fragments in Metagenomics, submitted.
DOWNLOAD
Click here to download the Python source code.
AUTHORS
Kyler Siegel, Kristen Altenburger, Yusing Hon, Jessey Lin, and Chenglong Yu
Last updated: October 2011