Categories
Biology

The source code of life

Bioinformatics, right? Using computational methods from the tool-belt of informatics to study biology is a thing. And it’s not just some thing. It’s one of the things nowadays, as it lets us process increasing amounts of data, helps us find otherwise unnoticeable connections in the big mess that is nature and can even simulate biological processes. Already during my first exposure to biological research – an internship at the Institute for Virology in Cologne – did my supervisor advise me to focus on bioinformatics in my studies because she was aware of its importance. I am glad I followed her suggestion, since it opened many doors for me and paved my way towards synthetic biology. Arriving in this discipline, it now seems that there is another, more conceptual analogy between biology and informatics: the programming of DNA.

Many synthetic biologists like to describe their field as the engineering of nature: you take a biological system or parts thereof and manipulate them to do something new. And because most of life is based on DNA, by changing the DNA we can change the entire system. Luckily, the forces of evolution have made nature quite inventive, wherefore synthetic biologists can take advantage of a range of existing genetic commands to rewrite nature’s source code.

For the sake of this analogy, let’s just say that programming is about running through a set of instructions, comprised of rules and conditions, to arrive at some result (I honestly don’t know if that is in any way correct but it is how I think of programming, so let’s roll with it anyways!). In the same way we could imagine that a cell is largely about expressing (aka producing) each of its genes if and only if the conditions demand for that gene to be expressed. So the result would for instance be a protein like Keratin, which is the primary component of your hair, being produced only in specific regions of your body at specific times. DNA, as the set of instructions upon which the cell acts, can therefore be likened to source code in programming.

It seems like there are a few basic commands in coding which you can find in any programming language: IF statements, AND/OR/NOT operators and FOR/WHILE loops. Such commands can then be used to build more complex conditional structures, which discern what to output, based on a specific input.

IF (see deer then bark)

IF statements lead down one path if a condition is met and down another if it is not. So let’s imagine we were programming a game where you walk your pet dog Fenton in the park. We could add a condition wherein the dog will start barking IF it sees a deer:

IF there is a deer
    then Fenton will run and bark.
Otherwise
    Fenton will stay calm.

In biology, we encounter a similar infrastructure, called the promoter, which controls the transcription of genes. Transcription is the first step of gene expression, where a processable copy of the gene is created. Loosely speaking, the promoter is the region preceding a gene, which contains some control elements for its production. The control elements then act together to decide whether the gene should be transcribed or not. Hence, a promoter, as the sum of all its contained elements, can be viewed as a switch for the production of its gene. Of course, nothing in biology is binary. There is always some ‘leaky’ expression of a gene, regardless of the state of its promoter. This is because, while you have discrete states in programming, biology is always continuous, making it hard to clearly discern between states. Generally, however, the promoter is pretty good at its job. So in analogy to programming, a gene would only be expressed, IF its promoter allows it:

IF promoter commands transcription
    then express keratin.
Otherwise
    do not express keratin.

AND/OR/NOT: the gates to logic

When programming your game about Fenton, you will, at some point, want to build more complex logical structures. For instance, it is not enough if there is a deer present in the game, Fenton will also need to see it, to start barking. Further, the same impolite reaction could be elicited by spotting a squirrel. Maybe Fenton is lastly a bit out of shape, so he will only start running at his ‘playmates’ if he is not exhausted. All of these statements could be broken down into series of IFs but to make life easier, we can substitute them for a couple of operators, such as:

IF (there is a deer OR a squirrel) AND Fenton sees it AND is NOT exhausted
    then he will make trouble.
Otherwise
    he will be a good boy.

It seems like biology also came up with that logical structure. We already covered that the state of a promoter is dictated by the contained control elements. Most of these elements are proteins called ‘transcription factors’, or TF in short. These proteins recognise and bind to a short DNA sequence in the promoter and with their presence either promote or reduce the transcription of the gene, depending on the type of TF and binding location. The logical structure of promoters comes about because multiple different TFs can influence the state of the very same promoter.

As I hinted at above, TFs can either increase or decrease gene expression, which yields the IF and IF NOT options. Furthermore, while some TFs act autonomously, many others will need to interact with a partner to become active or inactive, depending on their mode of activation. In this way, nature has learnt to implement AND operators, necessitating that two things be true at the same time. Lastly, there is also lots of redundancy in biology, which increases the stability of a system. In this sense, there often are different TFs with very similar effects, wherefore the one OR the other can be deployed without much difference between the two. As is so often the case, there is much more going on in biology, like two different types of TF wanting to bind on the same site. In such a case, criteria like concentration (abundance per volume) of each TF and their respective binding strengths will decide who wins and gets to bind.

So let us imagine that the production of keratin and with it hair growth is controlled by the following TFs: There is a hypothetical activator called handsomin, which is necessary to produce keratin. At the same time, the dominant repressor stresserin will inhibit keratin production, if present. In such a case, an activator can be added – let’s call it alpecin – which binds more strongly to the binding site of stresserin and thus outcompetes it, even if it is present:

IF (handsomin AND NOT stresserin) OR alpecin
    then express keratin
Otherwise
    do not express keratin

But wait, there is more! It turns out that transcription factors are a logical goldmine! Remember how the binding probability of a specific type of TF depends on its concentration? The story actually goes on! Many types of TFs bind cooperatively, meaning that once one copy of the protein binds at a site, it will recruit more copies of itself to also bind and further increase transcription. This cooperativity has a fascinating effect on the gene expression curve: depending on the degree of cooperativity, we now approach a more binary scenario where there is either very little or full expression. Of course things are still never discrete in biology. A cell is still an analogue machine with a continuous transition between states. But through complex interactions and switches, the transition phase can be minimized to mimic discrete-like behaviour. In short: we get a sensor for the amount of TF, which only switches on after a certain concentration-threshold is passed. So what does this say about my… eehr someone’s hair?

IF (#handsomin > 500 AND NOT #stresserin > 9000) OR #alpecin > 30
    then express keratin
Otherwise
    do not express keratin

(Any similarities with brand names in the real world are entirely coincidental. I further feel obliged to clarify that this genetic network is completely made up and real life decisions should not be based on it. I lastly take offense in the impression that my stress level feels the need to dominate my handsomeness like that.)

Let’s talk about loops FOR a WHILE…

Another powerful tool in your programming belt are loops to iteratively execute code WHILE a condition is met or FOR a specific number of times. Such a periodicity can also be replicated in biology using the tools we established above. A beautiful example of an unconditional loop with striking punctuality is the circadian clock – a molecular clock inside your cells, comprised of a bunch of IF, AND, OR and NOT commands, which more or less track the time of the day regardless of your exposure to environmental cues.

In the case of the WHILE loop, in essence we want to have cyclic execution of a set of instructions under the condition of an IF statement. So let’s start with a gene that is activated by a TF called activin, thereby implementing the IF condition. We could envision that this gene in turn is another TF called cyclin, which represses its own transcription. If we additionally take into consideration that proteins are degraded in a cell over time, we can see that this might yield cyclic behaviour: activin will initiate the production of cyclin. As more cyclin is produced, it represses its own production (maybe we can design cyclin so that it only starts having an effect beyond a certain concentration, using cooperativity) and the cyclin is degraded by the cell. Once it drops below its threshold, it will stop repressing its own production, more cyclin can be produced again and so forth. The loop will finally be terminated by taking away the activin.

WHILE activin is present
    produce a wave of cyclin.

FOR loops are a little different, because they necessitate a memory component about what number in the series of loops is currently being executed. We could try to take protein numbers for that by running through something like a WHILE loop and instead putting the activin under control of accumulating cyclin. But maybe that wouldn’t work reliably because of the uncertain location of the proteins in space and imperfect repeatability due to mere variation we get in all living things. There are, however, biological mechanisms that do track such processes more accurately. One well-known example would be the telomeres. Telomeres are protective caps on the ends of our chromosomes. As our cells divide, the entire genome is copied in an awe-inspiringly delicate and exact process. However, with each cell division, the telomeres get a little shorter. It is like a molecular countdown, tracking how old the lineage of a cell is and can get. This means that it is not impossible for FOR loops to exist or be implementable in a biological cell.

Building and testing

Great, so now that we have all that we need to do some DNA coding, we can start replenishing my totally intact and not at all receding hairline! So I sit down, I design the necessary networks, I genetically modify myself and suddenly I am completely bald instead. Hm. It would seem like there are some mistakes in the code… In the other example, a programing mistake might lead to Fenton running away from the deer instead of chasing it.

When encountering such an issue while programming, I would go back to the source code and add a bunch of output commands at crucial steps to figure out, at what point the script fails. Now, biology is far more complex than a programming environment, because there is much more – well, life – going on beside your small network. Because of this, in synthetic biology one would most often not try to directly modify complex phenotypical traits (meaning things that you can see on an organism) but rather test a network step by step using marker proteins. These can, for instance, be fluorescent proteins, which let you directly measure the output of a circuit. As an example, we could have cyclin also activate the gene of a fluorescent protein, whereby we should see oscillating light as a control signal. In a way, these markers have precisely the same functionality as the output commands in programming: a detectable feedback from your circuit which tells you something about its status. This shows that we do not just find basic programming building blocks in DNA, we can and do also test our genetic code in a similar fashion!

What is the meaning of all of this?

I do wonder, what the reason for these similarities is. How is it possible to draw such analogies from two otherwise seemingly unrelated trades? Clearly, they did not derive from each other, because the development of informatics came before and was completely unrelated to the discovery of gene regulatory networks. But maybe there is some conceptual reason for this, nonetheless.

As I so cheekily claimed in the beginning, programming is just about running through clearly laid out instructions to arrive at a result, which is identical with genetics. DNA is a molecular way of storing and executing information and in this manner can be simplified to mere logic. Logic, turning it around again, itself being the tool of programming. So maybe the two are so similar because they are based on the same concept of simple gated logic for fulfilling their purpose.

Now let us keep playing the why-game (I loved playing that game as a child… must have been annoying for everyone else): Why are they based on the same concept, then? Hm. So there is a reason for the term ‘programming language‘, right? In a way, a language could be described as a limited number of expressions (words, gestures etc.), which we can flexibly combine to create meaning. When programming, we can also use our limited number of basic expressions to create whatever protocol we want. It is easy to give new meaning by simply shuffling the expressions around a bit. If we want to integrate a new functionality to a code, sometimes all that must be done is to add another IF statement. So a source code is quickly adaptable. Adaptability, in turn, might ring a bell in connection with biology. It is of utmost essence for organisms to be able to adapt to a potentially changing environment. Even the most well-adapted but unchanging species would start being outcompeted if its environment changed in the slightest. So, there must always be the possibility for change in DNA. Of course, there is: it’s called mutations. But tiny changes in an otherwise massive, continuous, rigid thing would hardly make a difference. Instead it would be handy if a tiny change on the level of DNA could lead to a significant change in its translation to meaning.

DNA must be prone to variation. The modular concept of expressions in genetics might have arisen because through it the change of one base (A, T, C or G) can completely rewrite the meaning of an entire sequence and determine whether a gene is produced or not – just by destroying a TF binding site or creating a new one. Through this, random mutations can help quickly adapting to a changing environment, rather than having to develop completely new genes from scratch. We can just reshuffle the existing ones instead. Maybe DNA even worked differently a couple of billion years ago and this modular approach just turned out to work best.

Is that the reason for why DNA is like it is? Maybe, to be honest, I don’t know. And what does this mean for programming in return? If this system was the ‘best’ naturally evolved one, would this mean that we are doing programming ‘right’? Or is there even any other way of programming? I guess the first question gets rated with a ‘not necessarily’. After all nature has no way of looking at its source code and arbitrarily letting Fenton run away from the deer instead, just because it feels like it. As for the second one, I really don’t understand enough about how computers work to confidently answer it. But is each ‘front-end’ programming language not compiled into a computer-readable binary… thing? In that case the answer would be yes: just write binary. But that is tedious, so let’s stick with modular. Maybe we can even regard the pure DNA as binary code while the top-level gene regulatory structure can be likened to the top-level programming language of choice. How intriguing!

So now, last question: What about biohackers…?

Leave a Reply

Your email address will not be published. Required fields are marked *

EN