Publications

This page shows all publications that appeared in the IASI annual research reports. Authors currently affiliated with the Institute are always listed with the full name.

You can browse through them using either the links of the following line or those associated with author names.

Show all publications of the year  2013, with author ALL, in the category IASI Research Reports (or show them all):


IASI Research Report n. 13-19  (Previous    Next)  

Weitschek E., Cunial F, Giovanni Felici

Discovering genome-wide k-mer compositional rules using logic formulas

ABSTRACT
The increasing availability of biological sequences from massive experiments lead to the growth of the field of sequence analysis. In this field the similarity of sequences is used to prove related biological functions or detect common organisms. Analysis algorithms include methods and techniques from statistics and computer science. Most current sequence analysis methods are based on alignment, i.e. align areas of the sequences sharing common properties. These algorithms are computational demanding and the complexity is exponential in the length of the sequences, therefore heuristics have been proposed that solve the sequence alignment problem. Alternative methods for sequences classification rely on string matching, pattern recognition and alignment free techniques, that can be also combined with supervised and unsupervised machine learning algorithms. In alignment free methods the similarity of two sequences is assessed based only on the dictionary of subsequences that appear in the strings, irrespective of their relative position. The subsequences can be represented in a feature vector and then treated in a mathematical space and eventually combined with machine learning algorithms, e.g., logic data mining. In this work a method for classifying biological sequences is proposed. The method is based on an alignment free feature vector representation of biological sequences in combination with logic data mining algorithms. The method classifies biological sequences without the strict requirement of alignments or of overlapping gene regions. The method is tested on bacterial whole genomes with promising results and classification accuracy. Finally, the strengths of the method are highlighted: promising classification results on bacterial sequences, no necessity to align them and identification of common subsequences (kmers) for each class (taxon) present in the data set.
back
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -