Home
I have this (cousin, sister, friend, daughter, fill in the blank) who is working on her PhD at Yale in something called computational biology, whatever that means. I am curious about what she does, but it seems hard and incomprehensible. I never much liked science, and computers intimidate me.
She works close to seven days a week, and yet she seems to like what she does, but what exactly does she do? Wonder no more. Below is a somewhat whimsical three part series: What does Tara do?
What does Tara do? A three part series
Part I What is Computational Biology?
Part II What is Machine Learning?
Part III But what do you do?
In its most basic form, computational biology uses computers to answer questions in biology. Like what? Imagine that I gave you a completely random string of letters ZATOLLEHYATHT. If you look at this string very quickly, you might not notice that backwards and in the middle of all this nonsense is the word HELLO. Imagine that now instead of just a 15 letter string, I gave you a string of 3 billion letters. Do you think you could find HELLO if it were written backwards or worse yet if we had stuck letters in between “HAELLIO” or were missing letters “HLLO?” The problem of identifying HELLO starts to become more difficult. But wait, you might ask, who cares? Why do we want to find HELLO at all? Many of you have heard about DNA from crime specials or paternity suits. The human body is composed of cells, and each cell of the body contains an identical copy of DNA. DNA is really just a four letter code ATCG, but the arrangement of these letters provides the instructions underlying (no exaggeration) all of life. What does this mean though? What color your hair is, how tall you are, etc, are determined by your particular arrangement of all these ATCGs, almost 3 billion of them for humans. Now finding HELLO might mean that your eyes are brown or more seriously that you will have a predisposition for a certain disease. Ok, you say, I understand, why it is important now, but how do you do that? This is where computers come into play. In the next section, I’ll explain how we use something called Machine Learning to teach computers how to identify things like HELLO even when they are missing letters or have extra letters added in. We can also use computers to decide what HELLO means? Because what if I just told you there is a word in the string, but I didn't tell you which one? How would you find it? If you were an alien from another planet, how you would learn that HELLO is a greeting between two earthlings and not a type of shoe or an object? I'll talk about both these problems in Part II.
Last time, we talked a little bit about the importance of DNA. We left off with our alien trying to determine the meaning or functional significance of the word HELLO. We're going to leave him in suspense just a little longer. In this part, I'm going to explain what Machine Learning is (no biology today), and then in the third part I'm going to tie everything back together and give you a real-world example of something I've actually done (really I do something).
Spent any time with a little kid lately? A bird nosedives for a speck of crumb. You point and say "Bird!" "Bird!" says little kid again and again. Later, that day, he looks up at the sky, "Bird, Bird." "No," you patiently reply, airplane or helicopter or Superman, you get the picture. Basically, anything in the sky is a bird to the little kid, but as time goes on he learns how to distinguish between a bird and an airplane, but how? For one thing, you keep giving him feedback (in Machine Learning speak we would say training him). No, not a bird, a plane. He may start to notice things about the different objects. Airplanes do not have beaks. They make a different sound. They are often much higher in the sky. There are lots of attributes that airplanes and birds just do not have in common. As more and more of these attributes become obvious, the child will have less and less of a problem telling them apart. Basically, machine learning is similar to the explanations you would give a small child except that the information is encoded mathematically.
What do I mean by this? Ok, let's formulate the little kid problem a little differently. Let's say I wanted to give a computer a picture and have it answer the age-old question: Is it a bird, a plane, or Superman? To simplify things further, let's just deal with the difference between a plane and Superman. What are some things we can measure? We could look at relative size. Superman is about 6 feet tall. A plane is closer to 40 feet. Superman's "wingspan" is also significantly smaller than the planes. We could also add the feature hasRedCape. If hasRedCape is true, we would assign a 1, and if hasRedCape is false, we would assign a 0. Ok, so now we have three features, overall size, wingspan, and hasRedCape. We could now give the computer hundreds of pictures of planes and Superman. The computer could determine the values of these three features and label the picture plane or Superman accordingly. Of course, one can make this situation much more complicated. If the picture is presented as the alter-ego Clark Kent, sans Red Cape, would our learner do better than Louis Lane at divining that it is really Superman or at the very least not an airplane? In this case, we might want to learn weights to each of our features, that is, if hasRedCape is false but the wingspan and overall size fits the Superman description, we might still want to label the picture Superman, so we would give a lower weight to hasRedCape. We could also count the number of times the computer makes a mistake, that is, labels a picture a plane that is Superman or vice versa. We could then go back to our features and maybe add some new ones in order to improve the performance of our learner's labeling. But what does any of this have to do with biology? In the next part, I'll give a slightly more realistic example of when we use Machine Learning in the third part of what does Tara do?
At this point, you may be saying to yourself. Tara, you have now talked about aliens and Superman. I've been patient. I think I have the first few concepts, but I still don't see what you do. In this section, I am going to take the concepts from the first two parts and tell you something I actually did.
My lab does research on a particularly nasty microbe called Acinetobacter baumannii. Why are we studying baumannii? A. baumannii is really only a problem if your immune system is weak. If you are relatively healthy, you can fight it off, but one place where there are a large number of people with weak immune systems is hospitals, and unfortunately, A. baumannii is ideally suited for living in hospitals. It has an amazing ability to live for very long periods of time with no water or food, and it particularly likes the rubber tubing, catheters, and plastics that are ubiquitous in hospitals. A. baumannii causes a bunch of things including pneumonia, septicemia, all kinds of respiratory infections, and meningitis just to name a few. It is endemic in hospitals treating soldiers returning from Iraq and Afghanistan. The other problem with baumannii is that it is multidrug resistant. In other words, most types cannot be killed by the vast majority of antibiotics forcing doctors to use much older drugs with brutal side effects including in one case kidney failure, so all in all it is a pretty nasty microbe. However, people know very little about it. Why does it make us sick or in other words what makes baumannii pathogenic? Last year, my lab and I sequenced A. baumannii in the hopes of learning more about both its overall life style and also what in it could be making us so sick.
In the first part of this series, we talked about DNA. That it is made up four letters, and that the specific order and types of those letters decides things like what color our hair will be, etc. Well, sequencing is the experiment that we do in order to get back those letters, so at the end of the day I was sent an extremely large file containing about 4 million letters. My screen at that moment looked something like this:
ATCCATCGATCGATCGATCGATGCTACGTACGTAGCTACTACGCTAGCTA
GCATCATCGTAGATGCTACTAGCTAGCTACGTAGCTAGCTAGCTTACGTCT
AATTATAATTATATTACGCGGCGCGGCGCGGCGGCGCGCGGCGGCGCGCG
Four million of them, and now this is finally what Tara does. I use the same kinds of techniques we discussed to identify Superman in a picture to look at these letters and try to decide what in it is making baumannii pathogenic. But how? Well, if you were reading this document and all of a sudden you saw the following Hola senorita, donde esta hotel? You might say huh that doesn't look like English, and if you were Spanish, you might say it doesn't look like Spanish either (but that is beside the point). What made that sentence look so strange to you? After all it is composed of the same letters as our alphabet, so why isn't it English? Those aren't English words because the ordering of the letters is wrong even though the alphabet is exactly the same. This in a nutshell is what I do, I use statistics to find regions that look like another language, and those statistics are based on the same kinds of models. Just to be a little clearer, if I were assigned the problem of building a learner that classified words as either English or Spanish. I would first get a Spanish dictionary and an English dictionary, and I would do things like count the number of times a particular set of letters say est occurs in the beginning of a word, etc, then I could assign some probability for each of these pieces. Then when given my new set of words even if the word itself did not appear in my dictionary I could make an educated guess based on the frequency that those letters appeared together in the dictionary.
We did this for baumannii, and we identified a number of regions that looked strange. Another person in my lab then took a baumannii and removed only the odd letters (you can pretty much copy and paste with microbes) and measured whether or not it still killed worms (obviously, we can't do this study on people), and he found that in several cases if you removed or deleted those regions, the microbe lost its ability to kill worms. Now, we can't go around cutting pieces out of baumannii, but what you can do is try to design drugs that do the same thing. Of course, this type of research takes a really long time and finding them is basically where my part of the story ends. So here is just one example of the type of things that I do. I hope it makes sense and was somewhat entertaining, and maybe now you are at least partially convinced that I do do something.
After putting this together, our baumannii work was highlighted in NOVA Science Now. Check it out here.
Return to Top