Genome classification (clustering) based on the k-mer frequencies
Mission of the project
In this project, I'm trying to classify gnomes of different bacteria using the frequencies of the k-mers reads in their genomes
Jan 30 - Feb 5
Downloading the Data from the GenBank website. The data is for more than 23,000 bacteria and takes more than 24GB of storage.
I selected a few (50) to start with.
Feb 13 - Feb 19
Getting familiar with simulated read applications such as ART, SimSeq and Flux Simulator. In the end, I decided to go with dwgsim because it is easier to install in Linux and therefore easier to work in the google colab platform. dwgsim is based off of wgsim found in SAMtools.
I've also read the documentation of dwgsim and generated the simulated reads to perform binning on them later.
Feb 20 - Feb 26
Creating the 4-mers
The metrics to compute accuracy is the adjusted rand score which is a measure of similarity between two data clusterings. This value is between -1 and 1 and the closer to 1 means a better match (0 represents a random clustering)