bioinformatics - home page of Juergen Haas

Automated Hypothesis Formation of Molecular Structures and Interactions Using Physics-Based Modeling and Simulation

Juergen Haas and Dimitris N. Metaxas

DNA superhelical structure a close-up view of the DNA structure

Figure 1. Snapshot of simulation showing DNA super-helical structure interacting with small molecules.

Figure 2. Close-up view showing a recognition event: a protein binds to a site on the DNA that matches its structure and charge distribution.

Described below is the development of efficient representational and algorithmic techniques for automating the process of hypothesizing properties of molecules, mechanisms, and pathways in molecular genetics. Physics-based modeling and simulation is used to efficiently utilize structural and physical/chemical information and for visualization and manipulation in a graphical user interface. Abstract reasoning and learning is used for efficient search for configurations that are consistent with the given constraints. Thus our system CASDIM (Computer Aided Scientific Discovery In Molecular Biology) consists of two tightly integrated components: the simulation environment and the learning/reasoning component. We present simulation experiments in CASDIM demonstrating recognition events as part of a hypothetical gene regulation mechanism.

Introduction

Understanding control and regulation of gene expression remains an important and largely unsolved problem in molecular biology. The complexity of the molecular networks involved in gene regulation is increasingly recognized as more and more experimental data become available on DNA sequences, expression levels, and molecular interactions. Until now, comprehensive hypotheses about molecular mechanisms and networks involved in gene regulation have been formulated exclusively by humans. We are not aware of any software that formulates such hypotheses using physics-based models and simulation, a powerful analysis technique that we believe is similar to the way human biologists often think and reason about biomolecular systems.

Problem statement

CASDIM deals with biomolecular events such as transcription complex formation or chains of molecular reactions/interactions that lead to the synthesis of protein complexes. Many of these reactions are reversible in a molecular system so that processes can be regulated in response to environmental changes. This means the changes in a molecular system over time are highly sensitive to the initial conditions. Due to incomplete knowledge, the complexity of the molecular system, and the random thermal vibrations it is impossible to precisely specify the initial conditions, and a probabilistic approach has to be taken. Thus the kind of problem solved by CASDIM is described by the following input/output specification:

Input: A set of pairs ("training instances"), where Ii is a description of an initial state, and Gi is a description of the experimentally observed final (goal) state (including a time period Ti and a probability Pi specifying the probability of achieving Gi within time Ti); A (user specified) set of parameters PS regarding the structures, properties, and concentrations of the molecules that are subject to modification/exploration by the system.

Output: Parameter settings for PS such that simulation leads any initial states consistent with Ii to final states Fi that are consistent with Gi with probability Pi within time Ti.
System Description

Molecules such as proteins and DNA are modeled as agglomerations of complex three-dimensional polygons. Objects can be linked to form flexible chains such as biopolymers (DNA, RNA, protein sequences). In order to facilitate simulation of a large number of molecules, large objects are hierarchically composed from smaller ones whose locations are specified with respect to the model coordinate system of the large object. Thus, a large molecule can be moved or rotated simply by moving or rotating its coordinate system. The coordinates of the component objects have to be dereferenced only if proximity to other objects has been determined. Proximity of objects is efficiently computed by structuring them in roughly spherical components with precomputed radii. Electrostatic fields are approximated using distributions of point charges of varying strengths. We have devised a procedure for constructing DNA models that takes as input a sequence of nucleotide names (A,C,G,T) and constructs a corresponding three-dimensional DNA model including the super-helical structure (see Fig.1).

The computations of the displacements of the molecules at each step of the simulation are based on the Einstein-Relation, which specifies the relationship between the mean-square displacement and diffusion coefficient: , where is the position of the particle at time t, and D is the diffusion coefficient: D = (RT/N) x [1/(6 x k P)], where N = 6 x 10, T = absolute temperature, k = viscosity coefficient, P = radius of particle, and R = gas constant.

We have implemented a random number generator that generates a distribution of three-dimensional displacements such that the average displacement is consistent with the Einstein relation. The Box-Muller method is used to obtain a gaussian distribution. Translocation and rotation of each molecule are determined by diffusion, electrostatic fields, and collisions. Coulomb's law is used to compute the electrostatic forces that determine the translocations and rotations: , whereis the force experienced by charge . This environment correctly simulates the binding of molecules to each other if they contain sites with complementary structure and charge distribution (Fig. 2).

CASDIM will try to determine values for incompletely specified parameters, such that applying simulations to states consistent with the initial state description of each training pair will lead to a final state consistent with the corresponding final state description. In general it will be necessary to run a set of simulations for each training pair and parameter setting, since final state descriptions are typically probabilistic in nature. E.g., a protein may bind to a particular location on the DNA with 80% probability given a certain amount of time.

Now, the search strategy for determining consistent parameter settings. At an abstract level, the search for correct parameter settings can be viewed as a set of planning problems. Each training instance specifies (1) a partial initial state description Ii , and (2) a partial goal state description Gi . Such descriptions involve for example specification of concentrations of certain molecules or molecule complexes. The system also has a set of rules of the form (C1--> C2), where C1 (the antecedent) is a set of conditions at time t1 and C2 is a set of conditions at time t2 and t1 < t2. This means that if C1 holds in a state S1, then continuing the simulation will lead to a state S2 in which C2 holds. For example, such rules may specify how the change of concentrations of certain molecules influences the concentrations of other molecules; how binding properties between molecules influence concentrations; how the recognition ability of a protein can be disabled through binding to another molecule; etc. These rules are essentially descriptive abstractions of molecular processes and may be either supplied by the user or inferred by the system. CASDIM uses these rules to compile a plan that leads from the initial state description Ii to the final state description Gi. Such a plan may specify for example that some molecule M1 should bind to another molecule M2. CASDIM would then try to modify M1 and/or M2 by changing parameter values and testing it in the simulation environment to achieve this requirement.

In order to obtain a plan for a particular training instance efficiently, CASDIM reasons back from the final state (goal state) description. It searches for rules that achieve (parts) of the goal state description. Then it searches for rules that achieve the antecedents of those rules, etc. until a state description is reached that is consistent with the initial state description. Next, using a special set of rules, parameter settings are determined such that the initial states (in the simulation) are consistent with these refined initial state descriptions. And finally, simulation is used to check if these parameter settings actually lead to the final states specified in the training pairs.

Note: All presentations at this site are in abstract format. Feel free to email me for detailed documents.

Related Publications

"Case-Based Reasoning Driven Gene Annotation," G. Christian Overton and Juergen Haas, book chapter in: Computational Methods in Molecular Biology, Edited by Steven Salzberg, David Searls, and Simon Kasif Published by Elsevier Science, 1998.

"EpoDB: A Bioinformatics System for the Analysis of Gene Expression during Erythropoiesis" presented by Christian J. Stoeckert at the Bioinformatics and Genome Research Conference, Boston, Massachusetts, June 15-16, 1998.

"EpoDB: A Database of Genes Expressed during Vertebrate Erythropoiesis," F. Salas, J. Haas, Brian Brunk, C. J. Stoeckert and G. C. Overton, Nucleic Acids Research, Vol. 26 no. 1, 288-289, 1998.

"Automated Hypothesis Formation of Molecular Structures and Interactions using Physics-based Modeling and Simulation," J. Haas and D.N. Metaxas. Proceedings of the German Conference on Bioinformatics, Leipzig, Germany, Sept. 30, 1996.

"EpoDB: An Erythropoiesis Gene Expression Database/Knowledge Base," F. Salas, J. Haas, C.J. Stoeckert Jr., and G.C. Overton, In: Bioinformatics- German Conference on Bioinformatics, Leipzig, Germany, September/October 1996, Springer Lecture Notes in Computer Science, pp. 52-61.

"QGB: A System for Querying Sequence Database Fields and Features," G.C. Overton, J.S. Aaronson, J. Haas, and J. Adams, Journal of Computational Biology, Vol 1 no.1, 3-14, Spring 1994.

"Using Analogical Reasoning for Knowledge Discovery in a Molecular Biology Database," J. Haas, J.S. Aaronson, and G.C. Overton, Proceedings of the Second International Conference on Information and Knowledge Management, ACM Press, New York, NY, 554-564, 1993.

"Knowledge Discovery in GenBank," J.S. Aaronson, J. Haas, and G.C. Overton, Proceedings of the First International Conference on Intelligent Systems for Molecular Biology, AAAI/MIT Press, 3-11, 1993.

Top