MATHEMATICAL BIOTECHNOLOGY ?

MARCELLO BUIATTI(1) and STEFANO RUFFO(2)

(1) Department of Animal Biology and Genetics
University of Florence, Italy.

(2) Department of Energy ``Sergio Stecco''
University of Florence, and INFN, Italy

Introduction

Modern biotechnology employs recombinant DNA techniques for the improvement of productive characters of organisms useful in agriculture or industry and for therapeutic reasons. For this purpose DNA sequences are transferred from one organism to another disregarding their phylogenetic distance, thus overcoming natural barriers.

The basic principles on which biotechnology is founded have very old and deep roots in the history of twentieth century biology. It was a physicist, Erwin Schrödinger who, in a famous serie of lectures in the early forties, [1] foresaw the existence of a macromolecular information container and introduced the concept that the structure-function of an organism can be wholly predicted simply by reading heritable information. The discovery that DNA has structural and functional features very similar to those of the predicted macromolecule [2] led to the formulation of the so called ``central dogma'' of molecular genetics stating the existence of an unambiguous, deterministic, undirectional flow of information from DNA to RNA to proteins, or, in other terms, from genotypes to phenotypes. In the early seventies the discovery of restriction enzymes and of other biological tools allowing the cutting and joining of DNA sequences at specific, known, sites opened the way to the construction of new genetic combinations which, according to the deterministic additive rules derived from the central dogma, should lead to perfectly predictable new combinations of characters. In that period living beings were described as complicated but not ``complex'' systems, whose rules were nothing more than the expansion of the central dogma itself, genetic engineering being simply their wholly predictable technological consequence.

As it often happens further studies have shown a rather different picture. Particularly, the translation of linear information stored in DNA into four dimensional (space-time) phenotypes has been shown to involve a complex network of non linear, interactive dynamical processes. This is probably the main reason for the relatively few genetically modified (transgenic) plants and animals on the market in spite of the large number of potentially ``useful'' genes cloned and the availability of reliable transformation techniques. In plants for instance, in june 1997 about tex2html_wrap_inline106 of transgenic genotypes on the market or ready for it, had one single character modified having acquired resistance to herbicides, generally through the integration of specific bacterial genes. The second more frequent modification was resistence to insects again through genes coming from one bacterium (Bacillus thuringensis). In a few other cases the new genotypes were male sterile, resistant to fungal pathogens or produced oils with improved nutritional properties. It should be noted that in very few cases the introduced gene interfered with plant metabolism in a significant way, and therefore generally none or few unpredictable negative side effects of genetic manipulation were observed. [3]

The problems that side effects of transgenosis can cause in this and other fields at the organismal and at the population and ecological levels, strongly introduce the need of a better knowledge of the dynamical rules of biological processes, particularly at the levels of gene structure and function, gene network interaction, development, for an improved design and control of biotechnological processes.

Mathematical modeling, therefore, after a long period of almost total oblivion, becomes again an important tool, not only for building theories of which biology has anyway an urgent need, but also for technological applications of the acquired knowledge on the genetic and molecular basis of life. The remaining part of this introductory paper will be therefore devoted to discuss in more detail the problems at different levels of the hierarchical organisation of life which biotechnology is facing and the conditions under which modeling may eventually lead to successful solutions, ending with a summary of modeling tools. This, also to avoid a purely "selfish" behavior in modeling both from the side of biologists and from that of physicists and mathematicians. In the first case, as discussed at length elsewhere [4] biologists use mathematical tools only as devices for ordering in a descriptive way their experimental data. In the second mathematicians, more than physicists, use biological structures and processes to build analogies supporting their models, otherwise totally independent from any non mathematical falsification.

The DNA level

At the beginning of molecular biology images and concepts of information theory were used in biology to develop a very successful analogic metaphor with a great heuristic power. According to this metaphor DNA was considered as a ``string of informations'', the central dogma giving the rules of linear information flow. Linguistic concepts were also introduced such as the ``code'' for DNA reading, ``transcription'' (the synthesis of RNA on a DNA template), ``translation'' (the protein assemblage, using a ``language'' of 20 aminoacids instead of that, based on four nucleotides A, T, G, C of nucleic acids). The frequent and successful use of both metaphors led to their ``collapse', as epistemologists would call it, on the real, basic macromolecules of life and particularly on DNA and RNA, in the sense that their material nature was not considered as relevant as their information content. In other words the fact that DNA indeed contains heritable informations for life led to the wrong concept of DNA being purely information. This, in turn, led to consider DNA constraints to be due mainly to the effect of selective forces acting on proteins and, through them, on coding sequences.

The overall picture we have now is much more complex. In the first place the percent of genomes covered by coding sequences is decreasing from viruses and bacteria ( tex2html_wrap_inline108 ), to animals and plants (down to tex2html_wrap_inline110 ). Non coding DNA contains sequences which, when knowledge on their function was almost completely lacking, have been called ``junk'' DNA and thought to be so void of relevance for the adaptive value of the organisms that they were used as the basis for the so called theory of ``selfish'' DNA. According to that theory [5] junk DNA components, and particularly a wide class of sequences called ``repetitive'' for their presence in a high number of copies per genome, are fixed or discarded throughout evolution not because of their effects on phenotypes but according to their relative competitive value in the genome. We know now that non coding DNA is a heterogeneous class of sequences, part of which does not show at first sight any generic distinctive feature, the others displaying some sort of homogeneity. These last are:

Non coding sequences are located in the following different parts of the genome:

Obviously, the physico-chemical properties of homogeneous and heterogeneous regions, of coding and non coding ones, depend on the nucleotide composition and influence the structure organisation, the dynamical behavior of DNA molecules and their interaction with others, including RNA and proteins. All processes involved in the transfer of genetic information require complex interactions and are quantitatively modulated by their dynamics. Transcription of DNA into RNA only occurs, both in prokaryotes and eukaryotes, when an efficient transcription complex gets organised. Transcription complex involve upstream regions and a variety of proteins (transcription factors and others) which bind to specific DNA short sequences. Different transcription factors are available in cells in different physiological states, in multicellular eukaryotes in different tissues and the composition of their populations vary with time. Coupling of transcription factors to make a complex requires DNA curvature, a process whose occurrence and efficiency is modulated by the physico-chemicalnature of sequences and therefore by their composition.

Moreover, sequence structure influences the opening of DNA double elices, the speed with which RNA is transcribed by RNA polymerase, etc.. Other examples of ``working rules'' of the biological machinery can be taken from the translation of RNA into proteins and from splicing in eukaryotes. In the first process, again, interaction occurs between RNA and proteins in the ribosomes, the centers of protein synthesis, and between ribosomes , the RNA ready for translation (m-RNA) and transfer RNA (t-RNA), the carrier of aminoacids. Only a few of the rules regarding this network are known such as the existence of specific recognition sequences between m-RNA and the RNA contained in ribosomes(r-RNA), the need of coherence between the t-RNA populations and the sequence to be translated, etc. Many more rules and constraints however can be hypothesized in a complex which involves probably over 100 components. Similarly constraints to DNA sequences must be respected for the organisation of DNA into chromosomes particularly in eukaryotes. In this case DNA is compacted through several levels of super-coiling starting from the formation of nucleosomes, complexes of DNA with octamers of proteins, the histones.

All the examples of constraints described are most probably only a few of the many existing, deriving from the physico-chemical nature of nucleic acids and little or not from the adaptative efficiency of coded proteins. Therefore a new area of investigation has been opened on the hidden regularities of DNA sequence composition, their relevance in evolution, their meaning in terms of efficiency of the complex machinery involved in structure, organisation and stability of DNA, and in the expression of genetic information. The value of the knowledge acquired in this field for biotechnology should at this point be obvious if we only think that knowing the rules would allow us to program the expression levels, timing, localization of engineered sequences and also their adaptation, in the cases of large genetic distances between donor and host organism, to the recipient genetic environment and its rules. As a matter of fact, practical genetic engineering has already been faced with problems deriving from the lack of coherence between donor and host DNA rules. An illuminating example of this can be taken from early experiments for the production of plants resistent to insects through the integration of an insecticide producing gene from Bacillus thuringensis, a bacterium. In that case lack of coherence between codon usage and t-RNA population composition between the two partners caused low expression of the bacterial gene which was overcome by artificially changing its codon composition. The consequent rationale to be followed for the search of hidden rules is in the first place to investigate whether there are, in DNA sequences, significant deviations from randomness and to build models of their distribution. This may allow then the localization and classification of the sequences which cause the most significant deviations, thus obtaining preliminary informations about their possible function, on the basis of the existent knowledge on the physical chemistry of DNA components. A considerable amount of knowledge has been acquired in this area in the last few years using several mathematical methods for the search of constraints ranging from long range correlation analysis, to entropy or linguistic complexity measures, Markov analysis, studies on recurrent words, etc. [6] Recently, a first model has been also proposed of deviations from randomness in prokaryotic genomes. [7] All this has led to some significant conclusions of general value. This can be summarized as follows (but see the contributions by P. Lió in this book):

The way is open now to new research on both the theoretical and the applied sides. From the first point of view the use of methods developed may offer new insights into the problem of the relative weight of chance and necessity in evolution through their use on the rapidly growing number of completely or partially sequenced genomes. >From the second, on the other hand, the data obtained so far are suggestive of rules fixed during evolution and therefore probably of ``internal'' adaptive value, but evidence is lacking on their causative sequences and functional role. For this part to be tackled the development of specific analytical and computational tools is strongly needed for the localization of relevant sequences and the study of their composition. In other words it is necessary to develop methods leading to the translation of the knowledge of generic rules into knowledge on the specific nature and composition of sequences causing them. Moreover, experimental work has to be done to ascertain the effect of those specific sequences on gene expression and on phenotypes. Many tools are available for that, like, for instance, directed mutagenesis which allows the change at will of sequences to be rested in vivo after integration through recombinant DNA techniques, analysis of the dynamical structure of DNA molecules containing tracts of specific interest, etc.

The cell and organism level

As we have seen in the previous paragraph, right from the beginning the transfer of genetic information from DNA into proteins is a dynamical process determined by the interaction among a large number of molecules each with specific physico-chemical features according to which they respond to changing microenvironments. This is the reason why, as we discussed before, the knowledge of the composition of sequences transferred from one organism to another and thereby of their physico-chemical properties is critical for the design, prediction and modulation of genetic engineering operations. The same logic applies to all levels of biological organisation. [8]

Dynamic interactions occur among all components of each cell, cell to cell or cell-environment communication is made possible by very sophisticated signal transduction systems, the ``division of labor'' between tissues in higher organisms is determined by the fluxes and concentrations of a number of key substances present in the egg and/or produced in the developing organism. Also at the ecosystem and the biosphere levels communication between organisms and between them and the environment is crucial for development, survival, reproduction. The whole dynamical order of life is determined following hierarchical rules, in the sense that some factors influence more processes and structures than others. A good example of this fact is represented by eukaryotic hormones, a heterogeneous group of substances which, even at very low concentrations, have widely pleiotropic effects. Often, relevant switches of development and metabolism derive from small threshold changes in the values of relative concentrations of different hormones.

As many authors have been suggesting, [9, 10, 11] this implies that adaptive values (and productivity) of organisms are the result of non additive, non linear, dynamical interactions within the whole genetic network and that, consequently, the introduction of new genetic components into an organism may induce not wholly predictable changes in areas of the network itself, the larger the higher the position in the organismal hierarchy of the engineered sequence. Consequence of all this is the need of predictive models of the dynamic behavior of transgenic organisms, particularly in eukaryotes allowing to tune the expression of alien engineered genes with the host organism.

A series of mathematical and computational tools have been developed which allow the study of interacting networks from Turing's reaction-diffusion equations to coupled maps, cellular automata, neural networks, spin glass systems, etc., but their application to biological systems still suffers from several serious drawbacks. Firstly, many of the mentioned tools are still based on the reduction of the studied networks to binary discontinuous systems, although a series of corrective measures can be taken, and living networks do not seem, from experimental data, to fall entirely in this category, but rather to contemporarily show continuous and discontinuous features. Moreover, and certainly this is a more crucial problem, the number of components of biological systems is far too high for an exhaustive analysis of interactive dynamical behaviors with available computational methods.

The third obstacle to a correct mathematical modeling is the lack, in many cases, of complete quantitative data on the reactions occurring in given biological networks. The general solution taken by scholars of these problems is essentially to reduce the complexity to a manageable number of emergent variables, possibly choosing them among those for which sufficient experimental data are available. Several interesting examples of this approach are available which lead to dynamical models with putative predictive value, most of which still strongly require falsification.

The most extremist view of this approach has been probably taken by the artificial life groups which seem the most far away from experimental biology. Kauffman, [12] for instance, uses Boolean networks to mimick the origin of life, and more recently genetic interactions, with the ambitious aim of constructing something very near to a universal biological theory. Some of the results are undoubtly relevant in a moment when a surprising amount of data coming from molecular biology certainly needs systematisation, but at the same time pose some tricky methodological questions. The main worry and warning which should be taken from this kind of work concerns, in our mind, the possibility that some of the properties found through the reduction of extremely complex evolutionary processes to Boolean networks with very few variables may lead to the discovery of interesting rules of such networks but not necessarily of the system of reference, the life itself. In other words the danger is, again, the ``collapse'' of the Boolean metaphor on real life systems in a way somewhat similar to what happened earlier in the history of comtemporary biology, when the informational-linguistic central dogma metaphor, although showing a strong heuristic power, led to a misconception of the dynamical processes of life.

A similar warning should be kept in mind also when dealing with the results of mathematical and computational modeling of living processes, such as development, to which experimental rules, reduced to a manageable number, are applied. In models of this kind the generally pursued aim is to build models leading to computer simulations mimicking as much as possible biological existing processes. Just to take a known and relatively early example of this, Turing's reaction-diffusion equations have been used to simulate the self-organisation of oscillators mimicking the segmentation process in Drosophila larval development. Simulation experiments led in this case to the production of patterns very close to those really observed in Drosophila, based on the non linear interactions of four morphogenetic substances over gradients. Now, this kind of model can certainly be of great use to experimentalists but only if they take the stimulating suggestions coming from it, namely the existence of specific morphogen gradients, to plan experiments to test the hypothesis built in the model. Similarly, dynamical patterns obtained with computer simulations of genetic networks are of potential extremely high relevance for the prediction of the effects of recombinant DNA operations only if used as a hypothetical building tool in experimental planning. Falsification of model-derived hypotheses would lead to higher understanding of processes studied, along with possible changes in the model itself derived from new experimental data. An ideal experiment-model spiral would then be built, of great value on both the theoretical and applied side. It should be noted that unfortunately few such spirals seem to be occurring at present.

Modeling

Let us now try to say a few more words about the newly available tools in biological modeling arosen in the last years. We may group them in two large classes (the superposition between the two is not empty):

To the first class belongs for instance the paradigm of the ``spin glass'', which has been used to understand the functioning of the brain. [13] Starting from physical systems where interactions among ``spins'' are random (these are really existing physical systems, e.g. low concentration alloys of a magnetic metal (Mn) with a noble nonmagnetic metal (Au)), one discovers [14] a very rich pattern of free energy minima (stable states), which has suggested that something similar could happen for neurons (the ``spins'') connected in the brain network (synapses mimick the random magnetic interaction), the many minima landscapes being the explanation of the many states of associative memory. This paradigm has been applied also to the immune system, where the number of receptors (repertoire) is significantly smaller than the number of neurons; an attempt has been done to explain immunological memory (vaccination). Although some of the obtained results are suggestive of biological behavior, the necessary direction of research in this area leads to the realization that biological systems are not fully random; a ``structure'' must necessarily be imposed onto the ``disorder''. Neurons are not all the same, synapses have asymmetric interactions, cells and molecules in the immune systems have different functions, etc. Of course, the strategy of disordered systems is still viable, but much more specialized models are necessary to catch the complexity of the biological system.

The hope of the ``physical'' description would have been to find ``universal'' features, not dependent on the specific realization of the system. There are perhaps ``universal'' behaviors: the fact that a system is able to remember (and also forget) is perhaps shared by many possible neuronal organizations; many possible efficient immune systems exist theoretically (some of them much simpler than the human immune system), which reproduce the same pattern of primary and secondary response in vaccination. Natural selection has chosen a few among these many possibilities; the hope of modelization would be at the same time to understand why (but this is perhaps a too difficult and deep scientific/philosophycal question), but mainly to be able to predict and control the consequences of given changes in the components and the interactions within the system (a typical question is: will the system function afterwords?).

The class of dynamical systems modeling is wider, but it contains less developed paradigms with respect to the former. First of all ``dynamical systems'' is a name which include many different areas of research (a nice textbook with plenty of biological applications is Kaplan and Glass (1995) [15]). We list a few of them (some still in their infancy):

When modeling spatial homogeneous systems (e.g. well stirred chemical reactions), or being interested only in averaged global populations one resorts to ODE and M. If one instead must reproduce spatial behavior, it is necessary to build models of PDE, CM or CA.

Another main difference is that ODE and PDE are ``continuous'', in the sense that space-time and the variable of interest (e.g. a given species concentration) vary continuously (between position 0 and position 1, there is position 0.5, etc.). M, CM and CA are instead representatives of ``discrete'' modeling. A map is an application from one point in time (or in space), labelled with index n to the the next point n+1. If one counts the number of flies, tex2html_wrap_inline116 , in an ecosystem in summer it will depend on the number of flies in the previous summer tex2html_wrap_inline118 , being it proportional to the number of laid eggs. Discretization in this case arises naturally because of the time-dicrete observation, but one may even think that discretization is intrinsic, being related to life cycle of a fly, which has a given period (egg-adult-egg) of one year. This question will be always present in biological modeling: is the process ``discrete'' or ``continuous''. The answer to this question and the choice in one sense or the other may influence a lot the results. It is well known that while the continuous logistic ODE leads to a trivial dynamical evolution, the discrete logistic map leads to a ``complex'' dynamical behavior, including "chaotic motion". This latter is a kind of behavior which was hypothesized by Henri Poincaré at the beginning of this century and was then pervasively discovered in many physical, chemical and biological systems, also thanks to the use of computers, only in recent years (the last twenty). It means that motion is aperiodic, it does not repeat all the same, and it is sensitive to small perturbations: small infinitesimal perturbations grow exponentially in time. Someone has taken this last property as the proof that no prediction is possible but in our opinion this is wrong. Many predictions are still possible, no longer about given precise values, but about ``statistical'' behaviors (probability measures). Therefore, modeling logistic growth with a map or with an ODE may produce quite remarkable differences in the outcomes. Which is the correct modeling? Of course the one which reproduces experimental data and is able to make falsifiable predictions; one should not seriuosly consider any model before this happens. This is why a serious interdisciplinary effort is unavoidable. Chaotic motion is not present only in time-discrete systems, it also appears for autonomous ODE, if the number of considered variables is at least three, and in non authonomous ODE if there are at least two variables.

The extreme level of discreteness is reached by CA modeling, [16] where even the state variable is discrete (while in CM the space-time is discrete but the state variable is continuous). This might seem far from natural reality, but in some cases it reveals a useful property. In immune system simulations, for instance, the repertoire is finite and a receptor can be represented as a finite length l bit string, the number of states being then tex2html_wrap_inline122 . In the Celada-Seiden model [17] recognition happens if there is bit match (complementarity in the bits). How could this be modeled better by a continuum approach? Shape spaces continuum approaches have anyway been attempted. [18] Chaos is present also in CA, in the form of a property which is known as ``damage spreading'' (the variation in one bit produces a catastrophic effect); but no significative consequences for biology have so far been found.

When simulating at the computer systems of coupled ODE (coupled oscillators) or PDE one discovers interesting coherent motions (a lumped concentration which displaces coherently); they have been called ``solitons'' or more recently ``breathers'' in the context of spatially discrete systems (because they really breath with a given pace), see the lectures by M. Peyrard in this book. In biological systems one knows plenty of coherent processes which survive to the underlying noisy environment. One is the DNA transcription process, which could be started by such a coherent motion when the protein transcription complex attaches to the double helix. Coherence is somewhat opposed to chaos, but both are effects of the nonlinearity of the underlying equations of motion; these two features can also coexist, creating coherent chaotic objects.

A word which has not appeared yet is ``fractals'', and you may be surprised because it is often associated with the name ``chaos'' even in the popular press. In fact, in this book you will not find many applications of fractal geometry to biological modeling, although there are many (most interesting are those resulting from fractal growth, see e.g. Vicsek (1992) [19] and Barabasi and Stanley,(1995) [20]). Just to quote: the structure of the lung has been shown to be fractal, and this can influence breathing rate; long range correlations in DNA sequences have been analysed in terms of fractal brownian motion; bacterial colonies (E.coli and B. subtilis) display fractal growth over a nutrient, etc.

To summarise, there appear to be new tools for modeling, which have become available quite recently. Will this form the body of ``Mathematical biotecnology''? This book tries to give a projection into this space, which has many more dimensions.

References

References

1
E. Schrödinger, What is life? The physical aspects of the living cell, Cambridge Univ. Press, Cambridge (1943).

2
F.H.C. Crick, Nature 227, 561 (1970).

3
M. Buiatti and P. Bogani, Euphytica 85, 136 (1995)

4
M. Buiatti, Mathematical biology, a critical assessment, Il Nuovo Cimento D, in press (1997).

5
R. Dawkins, The selfish gene, Oxford Univ. Press, Oxford (1976).

6
P. Liò and S. Ruffo: Searching for genomic constraints, Il Nuovo Cimento D, in press (1997).

7
M. Buiatti Jr., P. Allegrini and P. Grigolini, in preparation.

8
D. Rollo, Phenotypes, Chapman tex2html_wrap_inline124 Hall, London (1995).

9
S. Wright, Evolution and the genetics of populations III, Experimental results and evolutionary deductions, The Univ. of Chicago Press, Chicago (1977).

10
C. H. Waddington, Paradigm for an evolutionary process, in Towards a theoretical biology 2: Sketches, Edinburgh Univ. Press (1972).

11
S. Kauffmann, At home in the Universe, Viking, London (1995).

12
S. Kauffmann ans J. Johnsen, Coevolution to the edge of chaos, in Artificial life II, Addison Wesley Publ. Co., Redwood City California (1992).

13
D.J. Amit, Modeling brain function (Cambridge Univ. Press, Cambridge 1989).

14
M. Mezard, G. Parisi and M. Virasoro, Spin glasses and beyond (World Scientific, Singapore 1987).

15
D. Kaplan and L. Glass, Understanding nonlinear dynamics (Springer, Berlin 1995).

16
S. Wolfram, Rev. Mod. Phys 55, 601 (1983).

17
F. Celada and P.E. Seiden, J. Theor. Biol. 158, 329 (1992).

18
A.S. Perelson and G.F. Oster, J. Theor. Biol. 81, 645 (1979).

19
T. Vicsek, Fractal growth phenomena, World Scientific, Singapore (1992).

20
A.L. Barabasi and H.E. Stanley, Fractal concepts in surface growth (Cambridge Univ. Press, Cambridge 1995).

About this document ...

MATHEMATICAL BIOTECHNOLOGY ?

This document was generated using the LaTeX2HTML translator Version 96.1 (Feb 5, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 BR.

The translation was initiated by Bagnoli Franco (bagnoli@dma.unifi.it) on Tue Dec 2 13:09:47 MET 1997


Bagnoli Franco (bagnoli@dma.unifi.it)
Tue Dec 2 13:09:47 MET 1997