, and up to three optional input files.
If the user wants EREM to use a specific phylogenetic tree, it should provide it as a separate
text file in a
format. If the user wants EREM to use
specific
upon extant species (as
opposed to simulating them), then he/she should provide two separate text files. The first lists all
the unique patterns observed, and the second lists, for each gene, the number of times each of
the unique patterns appear.
This file provides EREM with all the necessary information to carry out the computation
in full. The name of this file can be anything with an extention *.TXT. However, the convention
PROJECTNAME.input.txt is preferred. Within the file, a percentage symbol (%) anywhere in a line
indicates a comment, and the rest of the line is ignored. A value of -1 in any of the parameters
instructs EREM to take the default value for that parameter.
The format of the input file is pretty rigid, so please follow carefully the instructions. The
easiest way to produce an input file would probably be either to modify one of our example
files, or to use the Matlab function
makeinput. In the following,
we shall describe in detail the structure of the input file line by line:
Line 1: % Phylogenetic Tree:
This is a comment line, indicating the beginning of the section in the input file that
defines what phylogenetic tree should be used in the computations.
Line 2: time units: MYrs % either {MYrs|BYrs}
Defines the time units used to measure the branch lengths on the tree. The user can specify
either MYrs (million years) or BYrs
(billion years).
Line 3: source: tree.phyl phyl % either
{rand|file_name file_format}
If the user provides the phylogenetic tree, he/she should provide an apporpriate
phylogenetic tree input file, and to put the name of this file as
file_name. The
file_format
instruction indicates what is the format of
file_name.
Currently, we support either the Newick format (
file_format
= "newick"), or our own format (
file_format
= "phyl"). If the user wants EREM to randomly generate a phylogenetic tree, he/she should
put here the instruction
rand.
Line 4: number of leaves: 19 % effect only random trees
This line is meaningful only if the phylogenetic tree is simulated, as it
determines the number of terminal nodes in the tree.
Line 5: total time span: 1970.000000 % effects only random trees
This line is meaningful only if the phylogenetic tree is simulated, as it
determines the time span of the entire tree. The time units are as indicated in
the 2nd line.
Line 6:
an empty line, dividing between two logical sections of the input file.
Line 7: % alignment data (-1 stand for default)
This is a comment line, indicating the beginning of the section in the input file that
defines what input data should be used.
Line 8: source: patterns wp.txt ngp.txt
% either {simulation|patterns wp_file ngp_file}
If the user provides this input, he/she should provide an apporpriate
wp and
ngp files, and to put the
name of this files as
wp_file and
ngp_file, respectively. If the user wants EREM to generate
these files by simulation, he/she should put here the instruction
simulation.
Line 9: number of genes: 1 % effective only in simulative
mode {1,2,3,...}
This line is meaningful only if the data are simulated, as it determines the number of genes
in the input data.
Line 10: number of sites: rand -1.000000 % effective only in simulative
mode {rand avg_val{5000}|fixed val_list|const val}
This line is meaningful only if the data are simulated, as it determines the number of
sites in the multiple alignemnt of each gene. There are three modes to provide EREM with
information regarding this number. If the user chooses the
rand mode, the number of sites of each gene is picked at
random from a uniform distribution in the range (1,2*avg_val).
By default, avg_val is 5000. If the user chooses the
fixed mode, it should provide a list of integers that exacly
matches the number of genes determined in the 9th line. If the user chooses the
const mode, all multiple alignments will have precisely the
same length.
Line 11: fraction of unknowns: 0.000000 % effective only in simulative
mode {0.0 - 1.0}
This line is meaningful only if the data are simulated, as it determines the fraction of
sites designated as missing data in the input. If this fraction is greater than zero, EREM,
upon completing the simulations, randomly substitutes in this fraction of the input sites
the value *, indicating missing data (unknown state).
Line 12: track numbers of present characters: on % effective only in
simulative mode {on|off}
This line is meaningful only if the data are simulated, as it instructs EREM to save complete
information about the number of characters in state 1 (during the simulation) in each gene and
in each node in the phylogenetic tree, as well the number of gain, loss, and retention events
for each gene along each branch. This information is stored in
track files,
and can be used to evaluate the accuracy of the ancestral reconstruction.
Line 13: filter unknowns: none % filter input patterns
{none|ukf frac}
EREM is capable of basic filtering of the input data. If the user selects the
none mode, no filtering is performed. If the user
selects the ukf mode, any input pattern with a
fraction of frac or more missing values is discarded.
Line 14: p0: -1.000000 % lower threshold {0.95}; used for random
generator {0.0 - 1.0}
This line is meaningful only if EREM needs to pick a random value to
p0 (prior probability of the root node to be in state 0
at a particular site). In that case, EREM assumes a uniform distribution in the
range [threshold, 1]. By default,
threshold is set to 0.95.
Line 15: theta: -1.000000 % Poisson parameter {1.0}; used for
random generator {>0.0}
This line is meaningful only if EREM needs to pick a random value to
theta (gene-specific loss parameter). In that case,
EREM picks theta at random, independently for each gene,
assuming exponential distribution with mean parameter.
By default, parameter is set to 1.00.
Line 16: phi: -1.000000 Alpha parameter (alpha) {0.4}; Poisson
parameter (mu) {1.0}; used for random generator
This line is meaningful only if EREM needs to pick a random value to
phi (branch-specific loss parameter). In that case,
EREM finds the two deepest branches (those that originate from the root), and assigns
phi values to them by random, assuming a probability
distribution, denoted pdf(mu), and obtained by taking an
exponential random variable x with mean
mu, and taking phi = 1 - exp(x*T),
where T is the branch length. Then,
phi is recursively computed down
the tree, where the value on each branch is determined by two contributions. A
fraction alpha takes the value of the parent branch,
and a fraction 1-alpha is determined at random using
the distribution pdf(mu). Positive
alpha limits the amount by which the value on a branch
can deviate from the value in the parent branch. By default,
alpha is set to 0.4, and mu
is set to 1.0.
Line 17: eta: -1.000000 % Poisson parameter {1.5}; used for random
generator {>0.0}
This line is meaningful only if EREM needs to pick a random value to
eta (gene-specific gain parameter). In that case,
EREM picks eta at random, independently for each gene,
assuming exponential distribution with mean parameter.
By default, parameter is set to 1.5.
Line 18: xi: -1.000000 % % Alpha parameter (alpha) {0.4}; Poisson
parameter (mu) {1.5}; used for random generator
This line is meaningful only if EREM needs to pick a random value to
xi (branch-specific gain parameter). In that case,
EREM finds the two deepest branches (those that originate from the root), and assigns
xi values to them by random, assuming a nonuniform
distribution, denoted pdf(mu), and obtained by taking an
exponential random variable x with mean
mu, and taking xi = 1 - exp(x*T),
where T is the branch length. Then,
xi is recursively computed down
the tree, where the value on each branch is determined by two contributions. A
fraction alpha takes the value of the parent branch,
and a fraction 1-alpha is determined at random using
the distribution pdf(mu). Positive
alpha limits the amount by which the value on a branch
can deviate from the value in the parent branch. By default,
alpha is set to 0.4, and mu
is set to 1.5.
Line 19: nu: -1.000000 % lower threshold {0.8}; used for random
generator {0.0 - 1.0}
This line is meaningful only if EREM needs to pick a random value to
nu (fraction of sites invariant to gain). In that case,
EREM assumes a uniform distribution in the range
[threshold, 1]. By default,
threshold is set to 0.80.
Line 20: l-lambda: -1.000000 % Poisson parameter {1.0}; used for
random generator {>=0.0}
This line is meaningful only if EREM needs to pick a random value to
l-lambda (shape parameter of the loss rate coefficient).
In that case, EREM assumes exponential distribution with mean
parameter. If l-lambda is
greater than the maximal value allowed (5000), or is lower than the lowest value
allowed (see line 31), then it is drawn again, until the value is obtained at the
right range. By default, parameter is set to 1.0.
Line 21: g-lambda: -1.000000 % Poisson parameter {1.0}; used for
random generator {>=0.0}
This line is meaningful only if EREM needs to pick a random value to
g-lambda (shape parameter of the gain rate coefficient).
In that case, EREM assumes exponential distribution with mean
parameter. If g-lambda is
greater than the maximal value allowed (5000), or is lower than the lowest value
allowed (see line 32), then it is drawn again, until the value is obtained at the
right range. By default, parameter is set to 1.0.
Line 22:
an empty line, dividing between two logical sections of the input file.
Line 23: % model parameters
This is a comment line, indicating the beginning of the section in the input file that
defines what EREM should do with the model parameters.
Line 24: p0: estimated % p0 {estimated|fixed rand|fixed val}
This line determines whether p0 should be estimated by
EREM (estimated), or should a constant value be used
throughtout the computation (fixed). In the latter case,
this value can be provided by the user (val), or it can
be drawn at random by EREM (rand), using the distribution
fucntion defined on line 14.
Line 25: theta: estimated % theta {estimated|fixed rand|
fixed val val ...|const val|mixed gene val ...}
This line determines whether all theta's should be estimated
by EREM (estimated), whether fixed values should be used for
all theta's throughout the computation
(fixed or const), or whether
some of the theta's are fixed and some are estimated
(mixed). If all values are fixed, the user can ask
EREM to pick values by random (fixed rand), using the distribution
fucntion defined on line 15. Or, he/she can provide an identical value to all
theta's (const val), or
he/she can designate the value of theta for each gene
by giving a list of values (fixed val val ...). If the
number of genes is larger than 10, multiple lines should be used, with exactly 10
values per line (except for the last line). If only part of the
theta's are determined by the user, they should be listed
as mixed gene val gene val ..., with pairs of gene number
and its theta value. If the list is long, multiple lines
are permitted, without a restriction on the number of items in each line.
Line 26: phi: estimated % phi {estimated|fixed rand|fixed val val ...|
const val|mixed node val ...}
This line determines whether all phi's should be estimated
by EREM (estimated), whether fixed values should be used for
all phi's throughout the computation
(fixed or const), or whether
some of the phi's are fixed and some are estimated
(mixed). If all values are fixed, the user can ask
EREM to pick values by random (fixed rand), using the distribution
fucntion defined on line 16. Or, he/she can provide an identical value to all
phi's (const val), or
he/she can designate the value of phi for each branch
by giving a list of values (fixed val val ...). If the
number of branches is larger than 10, multiple lines should be used, with exactly 10
values per line (except for the last line). If only part of the
phi's are determined by the user, they should be listed
as mixed node val node val ..., with pairs of branch number
and its phi value. If the list is long, multiple lines
are permitted, without a restriction on the number of items in each line.
Line 27: eta: estimated % eta {estimated|fixed rand|
fixed val val ...|const val|mixed gene val ...}
This line determines whether all eta's should be estimated
by EREM (estimated), whether fixed values should be used for
all eta's throughout the computation
(fixed or const), or whether
some of the eta's are fixed and some are estimated
(mixed). If all values are fixed, the user can ask
EREM to pick values by random (fixed rand), using the distribution
fucntion defined on line 17. Or, he/she can provide an identical value to all
eta's (const val), or
he/she can designate the value of eta for each gene
by giving a list of values (fixed val val ...). If the
number of genes is larger than 10, multiple lines should be used, with exactly 10
values per line (except for the last line). If only part of the
eta's are determined by the user, they should be listed
as mixed gene val gene val ..., with pairs of gene number
and its eta value. If the list is long, multiple lines
are permitted, without a restriction on the number of items in each line.
Line 28: xi: estimated % xi {estimated|fixed rand|fixed val val ...|
const val|mixed node val ...}
This line determines whether all xi's should be estimated
by EREM (estimated), whether fixed values should be used for
all xi's throughout the computation
(fixed or const), or whether
some of the xi's are fixed and some are estimated
(mixed). If all values are fixed, the user can ask
EREM to pick values by random (fixed rand), using the distribution
fucntion defined on line 18. Or, he/she can provide an identical value to all
xi's (const val), or
he/she can designate the value of xi for each branch
by giving a list of values (fixed val val ...). If the
number of branches is larger than 10, multiple lines should be used, with exactly 10
values per line (except for the last line). If only part of the
xi's are determined by the user, they should be listed
as mixed node val node val ..., with pairs of branch number
and its xi value. If the list is long, multiple lines
are permitted, without a restriction on the number of items in each line.
Line 29: nu: estimated % nu {estimated|estimated fixed|fixed rand|
fixed val}
This line determines whether nu should be estimated by
EREM (estimated), or should a constant value be used
throughtout the computation. In the latter case, this value can be provided by the user
(fixed val), or it can be drawn at random by EREM
(fixed rand), using the distribution function defined on
line 19. The special option estimated fixed instructs EREM
to use in the initialization phase a crude estimated value for nu,
using a method determined in line 49, and then fix its value throughout the rest of the
computation on 0.9 * the value of the crude estimate.
Line 30: minimal l-lambda value: 0.050000 % minimum value of L-Lambda
{0.05}
For numerical reasons, very low values of l-lambda cannot
be supported. This parameter defines the minimal allowed value for
l-lambda. By default, this value is 0.05. Typically, values
lower than 0.05 give the same discrete gamma distribution as 0.05, with the last category
being the only one significantly different from zero.
Line 31: minimal g-lambda value: 0.060000 % minimum value of G-Lambda
{0.06}
For numerical reasons, very low values of g-lambda cannot
be supported. This parameter defines the minimal allowed value for
g-lambda. By default, this value is 0.06. Typically, values
lower than 0.06 give the same discrete gamma distribution as 0.06, with the last category
being the only one significantly different from zero.
Line 32: l-lambda: fixed inf % loss lambda {estimated|fixed rand|
fixed inf|fixed val}
This line determines whether l-lambda should be estimated by
EREM (estimated), or should a constant value be used
throughtout the computation. In the latter case, this value can be provided by the user
(fixed val), or it can be drawn at random by EREM
(fixed rand), using the distribution function defined on
line 20. The user can use no loss rate variability by setting l-lambda
to infinity using the instruction fixed inf.
Line 33: g-lambda: fixed inf % gain lambda {estimated|fixed rand|
fixed inf|fixed val}
This line determines whether g-lambda should be estimated by
EREM (estimated), or should a constant value be used
throughtout the computation. In the latter case, this value can be provided by the user
(fixed val), or it can be drawn at random by EREM
(fixed rand), using the distribution function defined on
line 21. The user can use no gain rate variability (except for invariant sites) by
setting g-lambda to infinity using the instruction
fixed inf.
Line 34:
an empty line, dividing between two logical sections of the input file.
Line 35: % control parameters
This is a comment line, indicating the beginning of the section in the input file that
defines various parameter that control some overall features of the algorithm.
Line 36: # categories (loss distribution): 4 % in rendering Gamma
discrete {4}
The distribution of the loss rate variable is rendered discrete using the quantile
method to produce equal probability categories. This parameter sets the number of
categories. By default, it is 4. If the user sets l-lambda
to infinity on line 32, EREM overwrite any other value, and force exactly one category.
Line 37: # categories (gain distribution): 5 % in rendering Gamma
discrete {5}
The distribution of the gain rate variable is rendered discrete using the quantile
method to produce equal probability categories for the gamma distributed part. This
parameter sets the number of categories. By default, it is 5. If the user sets
g-lambda to infinity on line 33, EREM overwrite any other
value, and force exactly two categories (one for the invariant sites, and one for the
non-invariant ones).
Line 38: likelihood tolerance: 1.000000e-07 % likelihood convergence
tolerance
EREM stops only if two stopping criteria are met. This parameter sets the likelihood
convergence stopping criterion, which in practice is the one that really determines when
EREM stops. For the problems that I have studied, a tolerance of 1.0e-7 was more than enough.
Line 39: parameter tolerance: 1.000000e+000 % parameter convergence
tolerance
EREM stops only if two stopping criteria are met. This parameter sets the parameter
convergence stopping criterion. In practice, we recommend using a very loose tolerance
(like in this example) to allow for the likelihood tolerance to determine when EREM
should stop.
Line 40: internal tolerance: 1.000000e-002 % tolerance used for the
M-step
The maximization of the auxiliary function in the M-step use this parameter to stop
iterating. We recommend using low-tolerance maximization, like in this example, to
avoid unnecessary time consuming iterations.
Line 41: maximum # iterations: 50000 % maximum number of iterations
allowed
EREM stops after reaching a certain number of iteration. By default, it is 5000. If this
number is set on zero (typically, when only simulation is desired), EREM does not perform
any iterations, and does not produce an output file.
Line 42: ancestral reconstruction: on % {on|off|detailed}
This line determines whether EREM makes no ancestral reconstruction
(off), makes a total (over all genes) ancestral
reconstruction (on), or make both a total and
a gene-by-gene ancestral reconstruction (detailed).
Line 43:
an empty line, dividing between two logical sections of the input file.
Line 44: % initialization parameters
This is a comment line, indicating the beginning of the section in the input file that
defines various parameter that determine how the parameters of the model are
initialized.
Line 45: p0: observed % {observed|rand th_val|fixed val}
This line determines how to initialize the parameter p0.
(1) fixed val. The initial value is provided by
the user. (2) rand th_val. The initial value
is randomly drawn by EREM assuming a uniform distribution in the range
[th_val, 1]. By default, th_val
is set to 0.95. (3) observed. The initial value
is computed from the raw data, as the ratio between the number of zero-state sites
to the number of all known (i.e., not missing) sites.
Line 46: theta: rand -1 % {rand mu_val|fixed val val ...|const val}
This line determines how to initialize the parameters theta.
(1) fixed val val .... The initial values are provided by
the user. If the number of genes is larger than 10, multiple lines should be used, with exactly 10
values per line (except for the last line). (2) const val.
The initial values are all equal to val. (3)
rand mu_val. The initial values are randomly drawn by EREM,
independently for each gene, assuming exponential distribution with mean
mu_val. By default, mu_val is set to 1.00.
Line 47: phi: dollo % {dollo|fixed val val ...}
This line determines how to initialize the parameters phi.
(1) fixed val val .... The initial values are provided by
the user. If the number of genes is larger than 10, multiple lines should be used, with exactly 10
values per line (except for the last line). (2) dollo.
he initial values are computed from the raw data. The phi of a branch
is computed as the ratio between the number of loss events to the total number of both loss
events and retention events.
Line 48: eta: rand -1 % {rand mu_val|fixed val val ...|const val}
This line determines how to initialize the parameters eta.
(1) fixed val val .... The initial values are provided by
the user. If the number of genes is larger than 10, multiple lines should be used, with exactly 10
values per line (except for the last line). (2) const val.
The initial values are all equal to val. (3)
rand mu_val. Historically, the initial values were randomly drawn by
EREM. However, we have observed that the likelihood function is multimodal, and that the
interesting maximum is always the first. Therefore, instead of initializing
eta by random, we just pick a value which is very close to zero.
Line 49: xi: dollo % {dollo|fixed val val ...}
This line determines how to initialize the parameters xi.
(1) fixed val val .... The initial values are provided by
the user. If the number of genes is larger than 10, multiple lines should be used, with exactly 10
values per line (except for the last line). (2) dollo.
he initial values are computed from the raw data. The xi of a branch
is computed as the ratio between the number of gain events to the total number of both gain
events and retention events.
Line 50: nu: observed % {observed|rand th_val|fixed val}
This line determines how to initialize the parameter nu.
(1) fixed val. The initial value is provided by
the user. (2) rand th_val. The initial value
is randomly drawn by EREM assuming a uniform distribution in the range
[th_val, 1]. By default, th_val
is set to 0.80. (3) observed. The initial value
is computed from the raw data, as the ratio between the number of all-zero patterns
to the number of all known pattern (i.e., patterns that do not include missing data).
Line 51: l-lambda: rand -1 % {rand mu_val|fixed val}
This line determines how to initialize the parameter l-lambda.
(1) fixed val. The initial value is provided by
the user. (2) rand mu_val. The initial value
is randomly drawn by EREM assuming an exponential distribution with mean
mu_val. By default, mu_val
is set to 1.00.
Line 52: g-lambda: rand -1 % {rand mu_val|fixed val}
This line determines how to initialize the parameter g-lambda.
(1) fixed val. The initial value is provided by
the user. (2) rand mu_val. The initial value
is randomly drawn by EREM assuming an exponential distribution with mean
mu_val. By default, mu_val
is set to 1.00.