ABSTRACT
Motif discovery from a set of sequences is a very
important problem in biology. Although a lot of research
has been done on computational techniques for
(sequence) motif discovery, discovering motifs in a large
number of sequences still remains challenging. We
propose a novel computational framework that combines
multiple computational techniques such as pairwise
sequence comparison, clustering, HMM based sequence
search, motif finding, and block comparisons. We tested
this computational framework in its ability to extract
motifs from disease resistance genes and candidates in
Arabidopsis thaliana genome and discovered all known
motifs relating to disease resistance. When the same set
of sequences was submitted to MEME and Pratt (motif
discovery tools) as a whole without clustering, they failed
to detect disease resistance gene motifs. The crucial
component in this framework is clustering. Among the
benefits of clustering is computational efficiency since the
set of sequences are divided into smaller groups using a
clustering algorithm.