// Copyright (c) 1997-99 by Arthur Delcher, Steven Salzberg, Simon // Kasif, and Owen White. All rights reserved. Redistribution // is not permitted without the express written permission of // the authors. Program build-imm.c creates and outputs both an interpolated Markov model (IMM) and, optionally, a standard fixed-order Markov model for the strings specified in its input file. Input comes from the file named on the command-line. Format should be one string per line. Each line has an ID string followed by white space followed by the sequence itself. Output goes to 2 files:.deltas.context contains the IMM. This is stored as a conventional fixed-order Markov model with the linear combination of the various lesser-order models computed and saved. The length of this model is given by the constant MODEL_LEN . .deltas.simple contains a fixed-order Model. The length of this model is also MODEL_LEN . This file is created just to compare with the IMM. The constant SIMPLE_MODEL_LEN is used in the comparison to specify which "level" within this model to use. This file is generated only if the -s option is specified. The format of these files is a single header line giving the model length, the alphabet size, and then the number of subsequence entries. The order of the following entries is: P (empty string) = 1 P (a | empty string) P (c | empty string) P (g | empty string) P (t | empty string) P (a | a) P (c | a) P (g | a) P (t | a) P (a | c) P (c | c) P (g | c) P (t | c) : P (t | t) P (a | aa) P (a | ac) : P (t | tt..t) The IMM is constructed as follows: For a given context, say acgtta, we want to estimate the probability distribution of the next character. We shall do this as a linear combination of the observed probability distributions for this context and all of its suffixes, i.e., cgtta, gtta, tta, ta, a and empty. By observed distributions I mean the counts of the number of occurrences of these strings in the training set. The linear combination is determined by a set of probabilities, lambda, one for each context string. For context acgtta the linear combination coefficients are: lambda (acgtta) (1 - lambda (acgtta)) x lambda (cgtta) (1 - lambda (acgtta)) x (1 - lambda (cgtta)) x lambda (gtta) (1 - lambda (acgtta)) x (1 - lambda (cgtta)) x (1 - lambda (gtta)) x lambda (tta) : (1 - lambda (acgtta)) x (1 - lambda (cgtta)) x (1 - lambda (gtta)) x (1 - lambda (tta)) x (1 - lambda (ta)) x (1 - lambda (a)) We compute the lambda values for each context as follows: - If the number of observations in the training set is >= the constant SAMPLE_SIZE_BOUND, the lambda for that context is 1.0 - Otherwise, do a chi-square test on the observations for this context compared to the distribution predicted for the one-character shorter suffix context. If the chi-square significance < 0.5, set the lambda for this context to 0.0 Otherwise set the lambda for this context to: (chi-square significance) x (# observations) / SAMPLE_WEIGHT To compile the program: g++ build-imm.c -lm -o build-imm Uses include files delcher.h context.h strarray.h gene.h To run the program: build-imm train.seq [-s] This will produce the file train.seq.deltas.context and, if the -s option is specified, the file train.seq.deltas.simple also.