//    Copyright (c) 1997-99 by Arthur Delcher, Steven Salzberg, Simon
//    Kasif, and Owen White.  All rights reserved.  Redistribution
//    is not permitted without the express written permission of
//    the authors.

Program  build-imm.c  creates and outputs both an interpolated Markov
model (IMM) and, optionally,  a standard fixed-order Markov model for the
strings specified in its input file.

Input comes from the file named on the command-line.  Format should be
one string per line.  Each line has an ID string followed by white space
followed by the sequence itself.

Output goes to 2 files:

   .deltas.context  contains the IMM.  This is stored as a
       conventional fixed-order Markov model with the linear combination
       of the various lesser-order models computed and saved.
       The length of this model is given by the constant  MODEL_LEN .

   .deltas.simple  contains a fixed-order Model.  The length
       of this model is also  MODEL_LEN .  This file is created just to
       compare with the IMM.  The constant  SIMPLE_MODEL_LEN  is used
       in the comparison to specify which "level" within this model to use.
       This file is generated only if the  -s  option is specified.
       

The format of these files is a single header line giving the model length,
the alphabet size, and then the number of subsequence entries.  The order
of the following entries is:
    P (empty string) = 1
    P (a | empty string)
    P (c | empty string)
    P (g | empty string)
    P (t | empty string)
    P (a | a)
    P (c | a)
    P (g | a)
    P (t | a)
    P (a | c)
    P (c | c)
    P (g | c)
    P (t | c)
         :
    P (t | t)
    P (a | aa)
    P (a | ac)
         :
    P (t | tt..t)


The IMM is constructed as follows: For a given context, say
acgtta, we want to estimate the probability distribution of the
next character.  We shall do this as a linear combination of the
observed probability distributions for this context and all of
its suffixes, i.e., cgtta, gtta, tta, ta, a and empty.  By
observed distributions I mean the counts of the number of
occurrences of these strings in the training set.  The linear
combination is determined by a set of probabilities, lambda, one
for each context string.  For context acgtta the linear combination
coefficients are:
    lambda (acgtta)
    (1 - lambda (acgtta)) x lambda (cgtta)
    (1 - lambda (acgtta)) x (1 - lambda (cgtta)) x lambda (gtta)
    (1 - lambda (acgtta)) x (1 - lambda (cgtta)) x (1 - lambda (gtta)) x lambda (tta)
                   :
    (1 - lambda (acgtta)) x (1 - lambda (cgtta)) x (1 - lambda (gtta)) 
             x (1 - lambda (tta))  x (1 - lambda (ta))  x (1 - lambda (a))

We compute the lambda values for each context as follows:
  - If the number of observations in the training set is >= the constant
    SAMPLE_SIZE_BOUND, the lambda for that context is 1.0
  - Otherwise, do a chi-square test on the observations for this context
    compared to the distribution predicted for the one-character shorter
    suffix context.
    If the chi-square significance < 0.5, set the lambda for this context to 0.0
    Otherwise set the lambda for this context to:
       (chi-square significance) x (# observations) / SAMPLE_WEIGHT


To compile the program:

      g++ build-imm.c -lm -o build-imm

  Uses include files  delcher.h  context.h  strarray.h  gene.h


To run the program:

      build-imm train.seq [-s]

  This will produce the file  train.seq.deltas.context  and,
  if the  -s  option is specified, the file  train.seq.deltas.simple
  also.