Markov Model for Full Genome Sequence Generation
Abstract
This work is devoted to introducing a Markov Chain method to generate a long sequence written in this four-letter alphabet namely; Adenine (A), Cytosine (C), Guanine (G) and Thymine (T). The algorithm can be used to generate a new genomic DNA sequence that captures the statistical properties of the original sequence as well as preserve its statistical properties of the sequence for any case of N-grams. An N-grams is a subsequence of length N in the genomic DNA. Later, by counting the occurrence of different N-grams, and a signature vector of a genetic text, called contrast value is constructed. With the contrast value vector and correlation as distance measures, a phylogenetic tree is constructed. The phylogenetic trees manage to group the organisms according to its kingdom which does not against the commonly accepted phylogenetic tree.