EREM: Evolutionary Reconstruction by Expectation-Maximization

written by: Liran Carmel

last modified:

General Description
We have designed a comprehensive probabilistic model that can be used to describe the evolution of many types of binary characters. A detailed description of the model can be found in Carmel et al. (2005), and in Carmel et al. (2007). EREM is designed to carry out three possible tasks: EREM requires one mandatory input file, and three optional ones. When the computation is over, it produces a number of output files, some of them optional. All these files are text files that can be readily produced or read by the user. A full description of the input files and the output files, as well as a number of example can be found below.

A number of Matlab functions were written to allow for a Matlab user to interact with EREM. These function allow, among other things, to write input files, to read output files, and to visualize some of the output parameters. A full description of these Matlab functions can be found below.

EREM is freely available, both as a source code and as an executable (tested under Windows XP). Also, we freely distribute the supporting Matlab functions, and some example files. To download the software please go to the download section.

Navigate to:     General Description     Input Files     Output Files     Matlab Functions     Download

Input Files
EREM takes a mandatory input file, and up to three optional input files. If the user wants EREM to use a specific phylogenetic tree, it should provide it as a separate text file in a phylogenetic tree format. If the user wants EREM to use specific observations of the binary character upon extant species (as opposed to simulating them), then he/she should provide two separate text files. The first lists all the unique patterns observed, and the second lists, for each gene, the number of times each of the unique patterns appear.

Input File (mandatory):

This file provides EREM with all the necessary information to carry out the computation in full. The name of this file can be anything with an extention *.TXT. However, the convention PROJECTNAME.input.txt is preferred. Within the file, a percentage symbol (%) anywhere in a line indicates a comment, and the rest of the line is ignored. A value of -1 in any of the parameters instructs EREM to take the default value for that parameter.

The format of the input file is pretty rigid, so please follow carefully the instructions. The easiest way to produce an input file would probably be either to modify one of our example files, or to use the Matlab function makeinput. In the following, we shall describe in detail the structure of the input file line by line:
Line 1: % Phylogenetic Tree:
This is a comment line, indicating the beginning of the section in the input file that defines what phylogenetic tree should be used in the computations.
Line 2: time units: MYrs     % either {MYrs|BYrs}
Defines the time units used to measure the branch lengths on the tree. The user can specify either MYrs (million years) or BYrs (billion years).
Line 3: source: tree.phyl     phyl     % either {rand|file_name file_format}
If the user provides the phylogenetic tree, he/she should provide an apporpriate phylogenetic tree input file, and to put the name of this file as file_name. The file_format instruction indicates what is the format of file_name. Currently, we support either the Newick format (file_format = "newick"), or our own format (file_format = "phyl"). If the user wants EREM to randomly generate a phylogenetic tree, he/she should put here the instruction rand.
Line 4: number of leaves: 19     % effect only random trees
This line is meaningful only if the phylogenetic tree is simulated, as it determines the number of terminal nodes in the tree.
Line 5: total time span: 1970.000000     % effects only random trees
This line is meaningful only if the phylogenetic tree is simulated, as it determines the time span of the entire tree. The time units are as indicated in the 2nd line.
Line 6:
an empty line, dividing between two logical sections of the input file.
Line 7: % alignment data (-1 stand for default)
This is a comment line, indicating the beginning of the section in the input file that defines what input data should be used.
Line 8: source: patterns wp.txt ngp.txt     % either {simulation|patterns wp_file ngp_file}
If the user provides this input, he/she should provide an apporpriate wp and ngp files, and to put the name of this files as wp_file and ngp_file, respectively. If the user wants EREM to generate these files by simulation, he/she should put here the instruction simulation.
Line 9: number of genes: 1     % effective only in simulative mode {1,2,3,...}
This line is meaningful only if the data are simulated, as it determines the number of genes in the input data.
Line 10: number of sites: rand -1.000000     % effective only in simulative mode {rand avg_val{5000}|fixed val_list|const val}
This line is meaningful only if the data are simulated, as it determines the number of sites in the multiple alignemnt of each gene. There are three modes to provide EREM with information regarding this number. If the user chooses the rand mode, the number of sites of each gene is picked at random from a uniform distribution in the range (1,2*avg_val). By default, avg_val is 5000. If the user chooses the fixed mode, it should provide a list of integers that exacly matches the number of genes determined in the 9th line. If the user chooses the const mode, all multiple alignments will have precisely the same length.
Line 11: fraction of unknowns: 0.000000     % effective only in simulative mode {0.0 - 1.0}
This line is meaningful only if the data are simulated, as it determines the fraction of sites designated as missing data in the input. If this fraction is greater than zero, EREM, upon completing the simulations, randomly substitutes in this fraction of the input sites the value *, indicating missing data (unknown state).
Line 12: track numbers of present characters: on     % effective only in simulative mode {on|off}
This line is meaningful only if the data are simulated, as it instructs EREM to save complete information about the number of characters in state 1 (during the simulation) in each gene and in each node in the phylogenetic tree, as well the number of gain, loss, and retention events for each gene along each branch. This information is stored in track files, and can be used to evaluate the accuracy of the ancestral reconstruction.
Line 13: filter unknowns: none     % filter input patterns {none|ukf frac}
EREM is capable of basic filtering of the input data. If the user selects the none mode, no filtering is performed. If the user selects the ukf mode, any input pattern with a fraction of frac or more missing values is discarded.
Line 14: p0: -1.000000     % lower threshold {0.95}; used for random generator {0.0 - 1.0}
This line is meaningful only if EREM needs to pick a random value to p0 (prior probability of the root node to be in state 0 at a particular site). In that case, EREM assumes a uniform distribution in the range [threshold, 1]. By default, threshold is set to 0.95.
Line 15: theta: -1.000000     % Poisson parameter {1.0}; used for random generator {>0.0}
This line is meaningful only if EREM needs to pick a random value to theta (gene-specific loss parameter). In that case, EREM picks theta at random, independently for each gene, assuming exponential distribution with mean parameter. By default, parameter is set to 1.00.
Line 16: phi: -1.000000     Alpha parameter (alpha) {0.4}; Poisson parameter (mu) {1.0}; used for random generator
This line is meaningful only if EREM needs to pick a random value to phi (branch-specific loss parameter). In that case, EREM finds the two deepest branches (those that originate from the root), and assigns phi values to them by random, assuming a probability distribution, denoted pdf(mu), and obtained by taking an exponential random variable x with mean mu, and taking phi = 1 - exp(x*T), where T is the branch length. Then, phi is recursively computed down the tree, where the value on each branch is determined by two contributions. A fraction alpha takes the value of the parent branch, and a fraction 1-alpha is determined at random using the distribution pdf(mu). Positive alpha limits the amount by which the value on a branch can deviate from the value in the parent branch. By default, alpha is set to 0.4, and mu is set to 1.0.
Line 17: eta: -1.000000     % Poisson parameter {1.5}; used for random generator {>0.0}
This line is meaningful only if EREM needs to pick a random value to eta (gene-specific gain parameter). In that case, EREM picks eta at random, independently for each gene, assuming exponential distribution with mean parameter. By default, parameter is set to 1.5.
Line 18: xi: -1.000000     % % Alpha parameter (alpha) {0.4}; Poisson parameter (mu) {1.5}; used for random generator
This line is meaningful only if EREM needs to pick a random value to xi (branch-specific gain parameter). In that case, EREM finds the two deepest branches (those that originate from the root), and assigns xi values to them by random, assuming a nonuniform distribution, denoted pdf(mu), and obtained by taking an exponential random variable x with mean mu, and taking xi = 1 - exp(x*T), where T is the branch length. Then, xi is recursively computed down the tree, where the value on each branch is determined by two contributions. A fraction alpha takes the value of the parent branch, and a fraction 1-alpha is determined at random using the distribution pdf(mu). Positive alpha limits the amount by which the value on a branch can deviate from the value in the parent branch. By default, alpha is set to 0.4, and mu is set to 1.5.
Line 19: nu: -1.000000     % lower threshold {0.8}; used for random generator {0.0 - 1.0}
This line is meaningful only if EREM needs to pick a random value to nu (fraction of sites invariant to gain). In that case, EREM assumes a uniform distribution in the range [threshold, 1]. By default, threshold is set to 0.80.
Line 20: l-lambda: -1.000000     % Poisson parameter {1.0}; used for random generator {>=0.0}
This line is meaningful only if EREM needs to pick a random value to l-lambda (shape parameter of the loss rate coefficient). In that case, EREM assumes exponential distribution with mean parameter. If l-lambda is greater than the maximal value allowed (5000), or is lower than the lowest value allowed (see line 31), then it is drawn again, until the value is obtained at the right range. By default, parameter is set to 1.0.
Line 21: g-lambda: -1.000000     % Poisson parameter {1.0}; used for random generator {>=0.0}
This line is meaningful only if EREM needs to pick a random value to g-lambda (shape parameter of the gain rate coefficient). In that case, EREM assumes exponential distribution with mean parameter. If g-lambda is greater than the maximal value allowed (5000), or is lower than the lowest value allowed (see line 32), then it is drawn again, until the value is obtained at the right range. By default, parameter is set to 1.0.
Line 22:
an empty line, dividing between two logical sections of the input file.
Line 23: % model parameters
This is a comment line, indicating the beginning of the section in the input file that defines what EREM should do with the model parameters.
Line 24: p0: estimated     % p0 {estimated|fixed rand|fixed val}
This line determines whether p0 should be estimated by EREM (estimated), or should a constant value be used throughtout the computation (fixed). In the latter case, this value can be provided by the user (val), or it can be drawn at random by EREM (rand), using the distribution fucntion defined on line 14.
Line 25: theta: estimated     % theta {estimated|fixed rand| fixed val val ...|const val|mixed gene val ...}
This line determines whether all theta's should be estimated by EREM (estimated), whether fixed values should be used for all theta's throughout the computation (fixed or const), or whether some of the theta's are fixed and some are estimated (mixed). If all values are fixed, the user can ask EREM to pick values by random (fixed rand), using the distribution fucntion defined on line 15. Or, he/she can provide an identical value to all theta's (const val), or he/she can designate the value of theta for each gene by giving a list of values (fixed val val ...). If the number of genes is larger than 10, multiple lines should be used, with exactly 10 values per line (except for the last line). If only part of the theta's are determined by the user, they should be listed as mixed gene val gene val ..., with pairs of gene number and its theta value. If the list is long, multiple lines are permitted, without a restriction on the number of items in each line.
Line 26: phi: estimated     % phi {estimated|fixed rand|fixed val val ...| const val|mixed node val ...}
This line determines whether all phi's should be estimated by EREM (estimated), whether fixed values should be used for all phi's throughout the computation (fixed or const), or whether some of the phi's are fixed and some are estimated (mixed). If all values are fixed, the user can ask EREM to pick values by random (fixed rand), using the distribution fucntion defined on line 16. Or, he/she can provide an identical value to all phi's (const val), or he/she can designate the value of phi for each branch by giving a list of values (fixed val val ...). If the number of branches is larger than 10, multiple lines should be used, with exactly 10 values per line (except for the last line). If only part of the phi's are determined by the user, they should be listed as mixed node val node val ..., with pairs of branch number and its phi value. If the list is long, multiple lines are permitted, without a restriction on the number of items in each line.
Line 27: eta: estimated     % eta {estimated|fixed rand| fixed val val ...|const val|mixed gene val ...}
This line determines whether all eta's should be estimated by EREM (estimated), whether fixed values should be used for all eta's throughout the computation (fixed or const), or whether some of the eta's are fixed and some are estimated (mixed). If all values are fixed, the user can ask EREM to pick values by random (fixed rand), using the distribution fucntion defined on line 17. Or, he/she can provide an identical value to all eta's (const val), or he/she can designate the value of eta for each gene by giving a list of values (fixed val val ...). If the number of genes is larger than 10, multiple lines should be used, with exactly 10 values per line (except for the last line). If only part of the eta's are determined by the user, they should be listed as mixed gene val gene val ..., with pairs of gene number and its eta value. If the list is long, multiple lines are permitted, without a restriction on the number of items in each line.
Line 28: xi: estimated     % xi {estimated|fixed rand|fixed val val ...| const val|mixed node val ...}
This line determines whether all xi's should be estimated by EREM (estimated), whether fixed values should be used for all xi's throughout the computation (fixed or const), or whether some of the xi's are fixed and some are estimated (mixed). If all values are fixed, the user can ask EREM to pick values by random (fixed rand), using the distribution fucntion defined on line 18. Or, he/she can provide an identical value to all xi's (const val), or he/she can designate the value of xi for each branch by giving a list of values (fixed val val ...). If the number of branches is larger than 10, multiple lines should be used, with exactly 10 values per line (except for the last line). If only part of the xi's are determined by the user, they should be listed as mixed node val node val ..., with pairs of branch number and its xi value. If the list is long, multiple lines are permitted, without a restriction on the number of items in each line.
Line 29: nu: estimated     % nu {estimated|estimated fixed|fixed rand| fixed val}
This line determines whether nu should be estimated by EREM (estimated), or should a constant value be used throughtout the computation. In the latter case, this value can be provided by the user (fixed val), or it can be drawn at random by EREM (fixed rand), using the distribution function defined on line 19. The special option estimated fixed instructs EREM to use in the initialization phase a crude estimated value for nu, using a method determined in line 49, and then fix its value throughout the rest of the computation on 0.9 * the value of the crude estimate.
Line 30: minimal l-lambda value: 0.050000     % minimum value of L-Lambda {0.05}
For numerical reasons, very low values of l-lambda cannot be supported. This parameter defines the minimal allowed value for l-lambda. By default, this value is 0.05. Typically, values lower than 0.05 give the same discrete gamma distribution as 0.05, with the last category being the only one significantly different from zero.
Line 31: minimal g-lambda value: 0.060000     % minimum value of G-Lambda {0.06}
For numerical reasons, very low values of g-lambda cannot be supported. This parameter defines the minimal allowed value for g-lambda. By default, this value is 0.06. Typically, values lower than 0.06 give the same discrete gamma distribution as 0.06, with the last category being the only one significantly different from zero.
Line 32: l-lambda: fixed inf     % loss lambda {estimated|fixed rand| fixed inf|fixed val}
This line determines whether l-lambda should be estimated by EREM (estimated), or should a constant value be used throughtout the computation. In the latter case, this value can be provided by the user (fixed val), or it can be drawn at random by EREM (fixed rand), using the distribution function defined on line 20. The user can use no loss rate variability by setting l-lambda to infinity using the instruction fixed inf.
Line 33: g-lambda: fixed inf     % gain lambda {estimated|fixed rand| fixed inf|fixed val}
This line determines whether g-lambda should be estimated by EREM (estimated), or should a constant value be used throughtout the computation. In the latter case, this value can be provided by the user (fixed val), or it can be drawn at random by EREM (fixed rand), using the distribution function defined on line 21. The user can use no gain rate variability (except for invariant sites) by setting g-lambda to infinity using the instruction fixed inf.
Line 34:
an empty line, dividing between two logical sections of the input file.
Line 35: % control parameters
This is a comment line, indicating the beginning of the section in the input file that defines various parameter that control some overall features of the algorithm.
Line 36: # categories (loss distribution): 4     % in rendering Gamma discrete {4}
The distribution of the loss rate variable is rendered discrete using the quantile method to produce equal probability categories. This parameter sets the number of categories. By default, it is 4. If the user sets l-lambda to infinity on line 32, EREM overwrite any other value, and force exactly one category.
Line 37: # categories (gain distribution): 5     % in rendering Gamma discrete {5}
The distribution of the gain rate variable is rendered discrete using the quantile method to produce equal probability categories for the gamma distributed part. This parameter sets the number of categories. By default, it is 5. If the user sets g-lambda to infinity on line 33, EREM overwrite any other value, and force exactly two categories (one for the invariant sites, and one for the non-invariant ones).
Line 38: likelihood tolerance: 1.000000e-07     % likelihood convergence tolerance
EREM stops only if two stopping criteria are met. This parameter sets the likelihood convergence stopping criterion, which in practice is the one that really determines when EREM stops. For the problems that I have studied, a tolerance of 1.0e-7 was more than enough.
Line 39: parameter tolerance: 1.000000e+000     % parameter convergence tolerance
EREM stops only if two stopping criteria are met. This parameter sets the parameter convergence stopping criterion. In practice, we recommend using a very loose tolerance (like in this example) to allow for the likelihood tolerance to determine when EREM should stop.
Line 40: internal tolerance: 1.000000e-002     % tolerance used for the M-step
The maximization of the auxiliary function in the M-step use this parameter to stop iterating. We recommend using low-tolerance maximization, like in this example, to avoid unnecessary time consuming iterations.
Line 41: maximum # iterations: 50000     % maximum number of iterations allowed
EREM stops after reaching a certain number of iteration. By default, it is 5000. If this number is set on zero (typically, when only simulation is desired), EREM does not perform any iterations, and does not produce an output file.
Line 42: ancestral reconstruction: on     % {on|off|detailed}
This line determines whether EREM makes no ancestral reconstruction (off), makes a total (over all genes) ancestral reconstruction (on), or make both a total and a gene-by-gene ancestral reconstruction (detailed).
Line 43:
an empty line, dividing between two logical sections of the input file.
Line 44: % initialization parameters
This is a comment line, indicating the beginning of the section in the input file that defines various parameter that determine how the parameters of the model are initialized.
Line 45: p0: observed     % {observed|rand th_val|fixed val}
This line determines how to initialize the parameter p0. (1) fixed val. The initial value is provided by the user. (2) rand th_val. The initial value is randomly drawn by EREM assuming a uniform distribution in the range [th_val, 1]. By default, th_val is set to 0.95. (3) observed. The initial value is computed from the raw data, as the ratio between the number of zero-state sites to the number of all known (i.e., not missing) sites.
Line 46: theta: rand -1     % {rand mu_val|fixed val val ...|const val}
This line determines how to initialize the parameters theta. (1) fixed val val .... The initial values are provided by the user. If the number of genes is larger than 10, multiple lines should be used, with exactly 10 values per line (except for the last line). (2) const val. The initial values are all equal to val. (3) rand mu_val. The initial values are randomly drawn by EREM, independently for each gene, assuming exponential distribution with mean mu_val. By default, mu_val is set to 1.00.
Line 47: phi: dollo     % {dollo|fixed val val ...}
This line determines how to initialize the parameters phi. (1) fixed val val .... The initial values are provided by the user. If the number of genes is larger than 10, multiple lines should be used, with exactly 10 values per line (except for the last line). (2) dollo. he initial values are computed from the raw data. The phi of a branch is computed as the ratio between the number of loss events to the total number of both loss events and retention events.
Line 48: eta: rand -1     % {rand mu_val|fixed val val ...|const val}
This line determines how to initialize the parameters eta. (1) fixed val val .... The initial values are provided by the user. If the number of genes is larger than 10, multiple lines should be used, with exactly 10 values per line (except for the last line). (2) const val. The initial values are all equal to val. (3) rand mu_val. Historically, the initial values were randomly drawn by EREM. However, we have observed that the likelihood function is multimodal, and that the interesting maximum is always the first. Therefore, instead of initializing eta by random, we just pick a value which is very close to zero.
Line 49: xi: dollo     % {dollo|fixed val val ...}
This line determines how to initialize the parameters xi. (1) fixed val val .... The initial values are provided by the user. If the number of genes is larger than 10, multiple lines should be used, with exactly 10 values per line (except for the last line). (2) dollo. he initial values are computed from the raw data. The xi of a branch is computed as the ratio between the number of gain events to the total number of both gain events and retention events.
Line 50: nu: observed     % {observed|rand th_val|fixed val}
This line determines how to initialize the parameter nu. (1) fixed val. The initial value is provided by the user. (2) rand th_val. The initial value is randomly drawn by EREM assuming a uniform distribution in the range [th_val, 1]. By default, th_val is set to 0.80. (3) observed. The initial value is computed from the raw data, as the ratio between the number of all-zero patterns to the number of all known pattern (i.e., patterns that do not include missing data).
Line 51: l-lambda: rand -1     % {rand mu_val|fixed val}
This line determines how to initialize the parameter l-lambda. (1) fixed val. The initial value is provided by the user. (2) rand mu_val. The initial value is randomly drawn by EREM assuming an exponential distribution with mean mu_val. By default, mu_val is set to 1.00.
Line 52: g-lambda: rand -1     % {rand mu_val|fixed val}
This line determines how to initialize the parameter g-lambda. (1) fixed val. The initial value is provided by the user. (2) rand mu_val. The initial value is randomly drawn by EREM assuming an exponential distribution with mean mu_val. By default, mu_val is set to 1.00.

Phylogenetic tree file (required unless simulated):

This file provides EREM with all the necessary information about the phylogenetic tree, which is its topology and the branch lengths. Two file formats are supported by EREM. (1) The NEWICK format, which is a standard in phylogenetics. (2) The PHYL format, which was developed by us. We provide Matlab functions that can read and write trees in the PHYL format.

The PHYL format is easy for a human to read throught. However, it is pretty rigid, so please follow carefully the instructions. The easiest way to produce a PHYL file would probably be to use the Matlab function dump. In the following, we shall describe in detail the structure of the PHYL format line by line:
Line 1: N Nodes (ID name age):
Here, N is the number of nodes in the graph. The fields in parenthesis are the headers of the columns of the lines to follow.
Lines 2-(N + 1): d: node_name node_age
In this part, information about nodes is provided. Each node in the tree is indexed by a unique integer (ID) between 1 and N. Here, d is the ID of a node, and is followed by the node name (e.g., 'Chordata') and its age in million years ago (e.g., 770).
Line N + 2:
empty line.
Line N + 3: Edges (ID from to):
The fields in parenthesis are the headers of the columns of the lines to follow.
Lines (N + 4)-(2N + 2): d: from to
In this part, information about branches is provided. Each branch in the tree is indexed by a unique, albeit completely arbitrary, integer (ID) between 1 and (N - 1). Here, d is the ID of a branch, and is followed by the node ID from where the branch begins, and the node ID where the branch ends.

observed character values (required unless simulated):

A pattern is the set of the states of all extant species (in the above phylogenetic tree) in a particular site (location). A state can be zero (absence of the character), one (presence of the character) or two (missing data, lack of knowledge about the character). In any site, we observe one particular pattern. In order to summarize this infromation for a particular data set, the user is required to provide two text files.

WP file. This file lists all the unique patterns observed in the data set. To this end, let S be the number of extant species (terminal nodes in the phylogenetic tree), and let Omega be the number of unique patterns observed. Then, the input file is a text files comprised of a matrix of values 0, 1, or 2, arranged as a matrix of size Omega-by-S, such that each row describes a single pattern. Any name can be chosen for this file. As a convention, we prefer the prefix WP for these files.

NGP file. This file counts how many of each of the unique patterns are observed for each individual gene. Each line describes a single gene, and is comprised of the gene name followed by a list of Omega nonnegative integers, each counts the frequency of the corresponding unique pattern in this gene. Any name can be chosen for this file. As a convention, we prefer the prefix NGP for these files.

Output Files
Once the computation is completed, EREM produces a number of output files. In brief, the summary of the input and the results is given in the output file. The values of the model parameter and the likelihood are constantly recorded in the history files. Detailed ancestral reconstruction is given in the detailed ancestral reconstruction file. In simulation mode, the drawn model parameters are saved in the Matlab model file; the drawn phylogenetic tree is saved in PHYL format; the drawn alignments are saved as two text alignment files. Also, the simulated positions of the present character throughout the evolution along the phylogenetic tree can be recorded in the track file. Upon completion of the computation, EREM produces an empty file by the name DONE.TXT. The only purpose of this file is to allow external software to know when EREM has completed.

Output File (always produced):

This file is the main output report of EREM, summarizing the user input parameters, the user input, the result of the parameter estimation, and the results of the overall ancestral reconstruction. The file is textual, and self-explanatory. The name of this file is always PROJECTNAME.output.txt. In the following, we shall review the differnt sections of this file:
Job Statistics:
Information on the number of iterations and the total time required for the computation.
Phylogenetic Tree:
Link to the phylogenetic tree file that had been used for the computation. If the tree was not provided by the user, EREM saves the tree in PHYL format, under the default name PROJECTNAME.PHYL. In this section, the total number of leaves (extant species), and the total time span are also indicated.
Alignment data:
Summary statistic of the input data. The first part summarizes information about the genes themselves - how many there are, what is the length of each, and a number of statistical features of the length distribution (average, minimum, and maximum). If a filteration took place (see Line 13 in the input file), these statistics are computed for both filtered and non-filtered genes. The second part summarizes the pattern observed. This includes a list of the unique patterns and their frequencies, and a list of the number of unique patterns observed in each individual gene (see also the NGP file).
Fixed parameters:
Values of those model parameters that are not estimated by EREM. That is, the value of these parameters were fixed by the user. The data structure used in EREM to represent a phylogenetic tree always assigns a virtual branch leading into the root. Therefore, the gain and loss coefficients associated with this branch are meaningless, and by convention always fixed to zero. Therefore, these two parameters will always be a part of the list of fixed parameters.
Estimated parameters:
Values of those model parameters that are estimated by EREM. Next to the name of each parameter, we indicate in square parenthesis the allowed range of this parameter.
Loss Rate distribution:
The distribution of the loss rate variable used by EREM is a discrete approximation of the true continuous gamma distribution. We use Ktheta equal-probability categories, and this section lists the value of the rate variable in each category.
Gain Rate distribution:
The distribution of the gain rate variable used by EREM is a discrete approximation of the true continuous invariant-gamma distribution. We use Keta categories, with the first (corresponding to the invariant sites) having a probability nu, and the rest have equal probabilities. and this section lists the value of the rate variable in each category (of course, this value is always zero in the first category).
Control parameters:
The values of the parameters that control the way EREM runs, see lines 35-42 in the input file.
Ancestral reconstruction (optional):
Summary over all genes of the ancestral reconstruction. For each node, the number of present characters is given, as well as the number of events (gain, loss, retention) along the branch leading to it.

History Files (always produced):

The value of all model parameters, as well as the likelihood, is recorded in each iteration, and saved in the form of history files. For each parameter, the file's name is PROJECTNAME.HISTORY.PARAMETERNAME.TXT. For the likelihood, it is PROJECTNAME.HISTORY.LIKELIHOOD.TXT. For estimated model parameters, each line in the file corresponds to a single iteration. If the parameter is not a scalar (e.g., theta), all values are listed in a single line, even if some of the values are fixed. For fixed model parameters, the history file contains a single line with the fixed value(s) of that parameter. As the initial values are recorded in this file too (as the first line), the number of lines in each parameter history file is number of iterations + 1. For the likelihood, the value is recorded after each individual parameter is optimized, and therefore the history file may contain multiple values for a single iteration.

Detailed Ancestral Reconstruction File (optional):

This file, by the name PROJECTNAME.AR, is produced if the user reqests for detailed ancestral reconstruction (see line 42 in the input file). The ancestral reconstruction is provided in a gene-by-gene basis. For each gene, the number of present characters in each node is estimated, as well the number of events (gain, loss, retention) along the branch leading to that node.

Phylogenetic Tree File (optional):

This file is produced in simulation mode, if the phylogenetic tree is drawn by EREM. The tree is saved in PHYL format under the name PROJECTNAME.PHYL.

Model File (optional):

This file is produced in simulation mode, if the model parameters are drawn by EREM. This is a Matlab file by the name PROJECTNAME_TRUEMODEL.M, but it can be easily read also as a text file.

Alignment Files (optional):

This file is produced in simulation mode, if the presence/absence maps of the binary character are simulated by EREM. This inormation is given in the form of the WP file PROJECTNAME.WP.TXT and the NGP file PROJECTNAME.NGP.TXT.

Track File (optional):

This file can be produced in simulation mode, if the presence/absence maps of the binary character are simulated by EREM (see line 12 in the input file). This file records the "true" evolutionary history of the binary character, by counting the number of present characters in each node, and the number of events (gain, loss, retention) on each branch. This information might later be compared with the reconstructed evolution to test for the accuracy of the reconstruction. This file has the name PROJECTNAME.TRACK.TXT. It keeps information on each gene separately, as well as summarizes it for the collection of all genes.

Done File (always produced):

An empty text file with the reserved name DONE.TXT, created once EREM completes the computation. It is used in external software to test when EREM completes its operation.

Matlab Files
We have written a couple of Matlab function to allow a Matlab user easily analyze and visualize the results of EREM. Here we give a one-line description of each of the functions. A detailed help on each function can be obrained by typing in Matlab: help function_name.

Analyzing EREM output:

gaindecomp
decomposes raw gain data to branch & gene contributions.
lossdecomp
decomposes raw loss data to branch & gene contributions.

Pattern manipulations:

condensepatterns
merges identical patterns to form a list of unique patterns only.
groupal
takes alignment data and builds the WP and NGP files.
mergepatterns
merges two different sets of WP and NGP files.
removegenes
removes a selected set of genes from the data.
removepatterns
removes a selected set of patterns from the data.

Performance testing:

comparechars
compares the actual number of present characters in each node with the reconstructed number, and plots the corresponding bar plots (simulation mode only).
compareevents
compares the actual number of evolutionary events along each branch with the reconstructed numbers, and plots the corresponding bar plots (simulation mode only).

Reading and writting files:

makeinput
prepares the input file for EREM.
out2inp
prepares an input file for EREM based on an output file.
out2pv
reads an output file and prepares a structure of the model parameters.
readar
reads-in a detailed ancestral reconstruction file.
readngp
reads-in an NGP file.
readoutput
reads-in an output file.
readstrata
reads-in internal division of genes performed by the function multigene.
readtrack
reads-in a track file.
readwp
reads-in a WP file.
writepatterns
creates the WP and NGP files.

Running EREM:

erem
runs EREM from within Matlab
hasfinished
checks if EREM has finished to computation.
multigene
builds input files and runs the heterogeneous phase.
proflike
profile likelihood technique to find confidence intervals.

Visualization:

ploteta
plots the history file of eta.
plotglambda
plots the history file of g-lambda.
plotlikelihood
plots the history file of the likelihood.
plotllambda
plots the history file of l-lambda.
plotnu
plots the history file of nu.
plotp0
plots the history file of p0.
plotphi
plots the history file of phi.
plottheta
plots the history file of theta.
plotxi
plots the history file of xi.

Download
The software and its code are freely available from this site, as well as sample files and a set of supporting Matlab functions designed for Matlab users. The latest release is EREM_19Oct2006. To recieve updates, new releases, etc, please subscribe below:
Name: email:
To stop recieving emails with updates, new releases, etc, please unsubscribe below:
email: