Metagenomics Binning

Genome classification (clustering) based on the k-mer frequencies

Mission of the project

In this project, I'm trying to classify gnomes of different bacteria using the frequencies of the k-mers reads in their genomes

Jan 23 - Jan 29

  • Created the website and selected the project.

    • I've chosen to work on the metagenomics binning project. which is the second project mentioned on the project ideas page.

  • Working on the homework

Jan 30 - Feb 5

Downloading the Data from the GenBank website. The data is for more than 23,000 bacteria and takes more than 24GB of storage.
I selected a f
ew (50) to start with.

Feb 6 - Feb 12

No changes, working on the homework

Feb 13 - Feb 19

Getting familiar with simulated read applications such as ART, SimSeq and Flux Simulator. In the end, I decided to go with dwgsim because it is easier to install in Linux and therefore easier to work in the google colab platform. dwgsim is based off of wgsim found in SAMtools.

I've also read the documentation of dwgsim and generated the simulated reads to perform binning on them later.

Feb 20 - Feb 26

Creating the 4-mers


Counting frequencies


performing k-means


Feb 27 - Mar 5

Working on presentation.
I did not receive any good results after performing k-means and therefore I spent my time reading more about other approaches such as [1], [2], and [3].
also doing different experiments on data to understand it better:

Mar 6 - Mar 13

In this week, I've added different clustering methods suggested in this link.

The result was pretty close to the result of kmeans.

The metrics to compute accuracy is the adjusted rand score which is a measure of similarity between two data clusterings. This value is between -1 and 1 and the closer to 1 means a better match (0 represents a random clustering)