A simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets

Amino acid replacement matrices are essential for many statistical methods to analyze protein sequences. Maximum likelihood (ML) methods typically generate the best replacement matrices because they can fully utilize information in the multiple protein alignments. However, for this application ML analysis is complex and computational expensive.

We proposed a modified ML procedure to estimate amino acid replacement rate matrices from large data sets, called FastMG, to overcome the obstacle. The key innovation of the FastMG is the alignment-splitting algorithm that splits alignments with many sequences into smaller sub-alignments in such a way that each still contains enough information for estimating the amino acid replacement rates.

We designed two splitting algorithms: naive random splitting algorithm, tree-based splitting algorithm. The tree-based splitting algorithm follows the NJ algorithm to split large alignments into sub-alignments.

FastMG is a simple, fast and accurate procedure to estimate amino acid substitution matrices for large data sets. FastMGT (FastMG with tree-based splitting algorithm) with k≥16 was about an order of magnitude faster than the standard estimation procedure while it did not reduce the quality of estimated matrices. We strongly suggest using k=16 as a good choice for the FastMGT estimation procedure.


Last edited Mar 17, 2015 at 2:34 PM by cuongdc, version 20