BLASTCLUST - BLAST score-based single-linkage clustering.

1. Clustering procedure.

BLASTCLUST automatically and systematically clusters protein sequences
based on pairwise matches found using the BLAST algorithm. BLASTCLUST 
finds pairs of sequences that have statistically significant matches 
and clusters them using single-linkage clustering.

BLAST parameters used by BLASTCLUST are the default values for protein
sequences (matrix BLOSUM62; gap opening cost 11; gap extension cost 1;
no low-complexity filtering); e-value threshold is 1e-6. For each pair
of sequences the top-scoring alignment is evaluated according to the 
following criteria:

       x1                   x2      HSP length on seqX: Hx = x2-x1+1 
        |                    |      gaps in seqX: Gx
seqX ---======================----- seqX length: Lx
         \\|||||||||||||||||//      BLAST score: S 
seqY ----====================------ number of identical residues: N
         |                  |       seqY length: Ly
        y1                 y2       gaps in seqY: Gy
                                    HSP length on seqY: Hy = y2-y1+1 

coverage of seqX: Cx = Hx/Lx
coverage of seqY: Cy = Hy/Ly
coverage:         max(Cx,Cy) or min(Cx,Cy), depending on the value of -b option 
alignment length  Al = Hx+Gx = Hy+Gy
score density:    S/min(Hx,Hy) or N/Al*100%

If the coverage is above a certain threshold
 AND
the score density is above a certain threshold,

these two sequences are considered to be neighbored.

Thus determined neighbor relationships is considered symmetric and provides
the base for clustering by a single-linkage method (which puts a sequence
to a cluster if the sequence is a neighbor to at least one sequence in the
cluster).

2. Input formats.

The primary input format for BLASTCLUST is a FASTA-format sequence file.
Each sequence should have a unique identifier (as defined by formatdb).
BLASTCLUST formats this sequence set into a BLASTable database
(in the directory pointed to by the environment variable TMPDIR or in
the current directory), then removes the database.

Instead of a FASTA file, a database prepared by formatdb with -o option
set to TRUE can be supplied as an input.

Another type of input is a sequence hit-list previously saved by
BLASTCLUST (in this case BLASTCLUST will use pre-computed HSP data
instead of making de novo comparisons).

You can restrict clustering to a subset of your data by supplying an ID
list file (IDs separated by spaces, tabs, newlines, commas or semicolons).
This is supposed to be used for re-clustering subsets of sequences using
the previously computed hit-list file.

3. Output format.

BLASTCLUST prints out clusters sorted from largest to smallest (by ID of
the first sequence if of the same size), separating clusters by a newline
character. Sequence identifiers within a cluster are space-separated and
sorted from longest to shortest sequence (by IDs if of the same length).

4. Crash recovery.

If the program crashed because of system error you can restart it
using crash recovery mode. This works only if you were saving
hit-list during the clustering. Start the job with the same command
line as before, specifying the hit-list saving to the same file but
also set the "continue unfinished clustering" option to TRUE. The
process will restart from the last saved point and will append the
hit-list file.

5. Environment.

BLASTCLUST is supposed to work in a normal NCBI environment, in
particular:

BLOSUM62 matrix is available via .ncbirc or BLASTMAT environment
variable.

6. Program options

Input:

 -i  sequence file in the FASTA format (default = stdin)
 -d  sequence database name
 -r  name of a hit-list file saved by BLASTCLUST

 These three options are mutually exclusive.

 -l  a file with a list of IDs to restrict the clustering
 
Thresholds:

 -S  similarity threshold
    if <3 then the threshold is set as a BLAST score density
    (0.0 to 3.0; default = 1.75)
    if >=3 then the threshold is set as a percent of identical
    residues (3 to 100)
 -L  minimum length coverage (0.0 to 1.0; default = 0.9)
 -b  require coverage as specified by -L and -S on both (T) or
    only one (F) sequence of a pair (default = TRUE)

Output:

 -o  file to save cluster list (default = stdout)
 -s  file to save hit-list (this file may be not portable across
    platforms)

Misc:

 -C  continue unfinished clustering (crash recovery mode).
    (default = FALSE)
 -a  Number of CPU's to use in a multi-thread mode
    (default = 1).
 -v  Progress report destination (printed every 1000 sequences).
    Set to F to suppress report messages (default = stderr).

7. Credits and complaints:

Ilya Dondoshansky (dondosha@ncbi.nlm.nih.gov)
Yuri Wolf (wolf@ncbi.nlm.nih.gov)

05 August, 2000

APPENDIX A.
Format of the hit-list file.

The hit-list file consists of the following parts:

 - header
 - sequence ID list
 - sequence length list
 - hit list

The byte-by-byte layout is platform-dependent; field sizes given here
are true for most UNIX platforms.

A.1. Header.

	1-byte boolean	IDtype	1 if numeric IDs; 0 is string IDs
	4-byte integer	ListSz	size of the ID list; if IDs are numeric this
				is the number of SeqID records, otherwise this
				is the length of the ID list (in bytes)

A.2. Sequence ID list.

If IDtype is 1 (numeric IDs) then the list is ListSz records of

	4-byte integer	SeqID	sequence ID (numeric)

If IDtype is 0 (string IDs) then the list is a list of records of

	var-length char	SeqID	sequence ID (string)
	space (' ')		separator

(total length is ListSz bytes; the number of sequences is equal to the number
of spaces).

A.3. Sequence length list.

This is a list of

	4-byte integer	SeqLen	sequence length

A.4. Hit list.

The list consists of the following records going to the end of file:

	4-byte integer	N1	ordinal number of the 1st sequence
	4-byte integer	N2	ordinal number of the 2nd sequence
	4-byte integer	HSPL1	HSP length on the 1st sequence
	4-byte integer	HSPL2	HSP length on the 2nd sequence
	8-byte float	Score	BLAST score
	8-byte float	PercId	Percent of identical residues