Genome classification (clustering) based on the k-mer frequencies
Mission of the project
In this project, I'm trying to classify gnomes of different bacteria using the frequencies of the k-mers reads in their genomes
Jan 23 - Jan 29
Created the website and selected the project.
I've chosen to work on the metagenomics binning project. which is the second project mentioned on the project ideas page.
Working on the homework
Jan 30 - Feb 5
Downloading the Data from the GenBank website. The data is for more than 23,000 bacteria and takes more than 24GB of storage.
I selected a few (50) to start with.
Feb 13 - Feb 19
Getting familiar with simulated read applications such as ART, SimSeq and Flux Simulator. In the end, I decided to go with dwgsim because it is easier to install in Linux and therefore easier to work in the google colab platform. dwgsim is based off of wgsim found in SAMtools.
I've also read the documentation of dwgsim and generated the simulated reads to perform binning on them later.
Feb 20 - Feb 26
Creating the 4-mers
Feb 27 - Mar 5
Working on presentation.
I did not receive any good results after performing k-means and therefore I spent my time reading more about other approaches such as , , and .
also doing different experiments on data to understand it better:
Mar 6 - Mar 13
In this week, I've added different clustering methods suggested in this link.
The result was pretty close to the result of kmeans.
The metrics to compute accuracy is the adjusted rand score which is a measure of similarity between two data clusterings. This value is between -1 and 1 and the closer to 1 means a better match (0 represents a random clustering)