Wednesday, March 2, 2016

March 03, 2016 at 02:35AM

Today I Learned: 1) The story of the competition race the Human Genome Project (HGP funded by NIH, making it a public project) and Celera (a private company run by Craig Venter) is quite fascinating, and it's a really fun story to pick sides on and argue about who "won". You can really go back and forth on it: HGP published the first assembled draft something like a few days before Celera, making it technically the first project to publish; but HGP had an eight-year head start on Celera, so Celera was really the spiritual victor (though it should be noted that even the public project finished under budget and two years ahead of schedule), especially because the public project draft wasn't particularly complete; but Celera was trying to essentially patent the important parts of the human genome and sell it, while HGP was going to make everything open access (and did, which massively accelerated research in many fields of biology), so HGP was the ethical victor; but Celera developed super-powerful sequencing techniques that totally revolutionized genomic sequencing (and which allowed HGP to finish on time...), *and* they completed the project for about a tenth the money as the public project, so Celera was clearly the technologic and scientific victor. Today I learned a new piece of the story. To explain this, I'll need to explain genomic assembly, which requires explaining shotgun sequencing, which is the technique invented by Celera and eventually adopted by HGP to finish the project. If you already know how shotgun sequencing works and what assembly is, you can skip the next three paragraphs. Before shotgun sequencing, there was Sanger sequencing. Sanger sequencing works like this: you amplify a stretch of DNA with PCR (a technique for producing lots of DNA from a targeted template sequence) but you also randomly incorporate special fluorescent versions of C, G, T, and A that terminate amplification. That gives you a bunch of fragments of the target DNA of different lengths, with a fluorescent nucleotide on the end of each. The color of the nucleotide depends on the letter at that nucleotide. You then run this mix on an gel, which separates the fragments by length. You look at the color of each band, and that tells you which nucleotide is at the end of the fragment of that length. You can read colors right down the line of bands. (technically that isn't *quite* how Sanger sequencing was done in the early days of the project, but it's close enough). The problem with this technique is that you have to know the sequence of the DNA at at least one end of the thing you're trying to sequence for the PCR to work, and preferably you should know both ends. This is a problem when the whole point is to figure out what the sequence of the DNA is. What the HGP did was to sequence from some known bit of the human genome, then use the new sequence to amplify the next bit, then the next bit, and a few hundred nucleotides at a time they could eventually sequence it all. Shotgun sequencing uses the same idea, except that before the Sanger reaction to add fluorescent nucleotides, you shear the DNA you want to sequence in to bazillions of little fragments of a few hundred bases each, which you can then clone into a ton of plasmids. You amplify each plasmid in bacteria, then sequence the plasmid. Since you know the sequence of most of the plasmid, it's trivial to amplify. This way, you don't have to bootstrap your way through the genome -- you can just sequence thousands or millions of fragments at once. Combined with new machines for performing the sequencing reactions more efficiently, shotgun sequencing allowed Celera to churn through the genome ridiculously faster than the HGP could with Sanger sequencing. There's a bit of a catch, though. With Sanger sequencing, you get a bunch of slightly overlapping sequences, in order, which you can easily string together into a whole genome. With shotgun sequencing, you get millions of *randomly selected* fragments, with no information about where they came from other than "somewhere in the genome". With enough random fragments, you can find overlapping fragments and stitch them together into entire chromosomes... but that's a computationally difficult puzzle to solve. This is the problem of assembly. This is where UCSC comes in (that's the University of California Santa Cruz). David Haussler, a professor at UCSC, was tasked with assembling the fragments from the public genome project. I think there were others involved, but Haussler is the only one I know of, and he ended up being the most critical. Anyway, one of Haussler's graduate student, an industrious fellow named Jim Kent, was particularly interested in the assembly project and came up with an assembly algorithm, then wrote up an assembler in about four weeks of intense coding. During that time, he also wrote the world's first interactive web-based genome browser, which was released with the public genome project draft as the UCSC Genome Browser, probably the most popular single tool in the world for retrieving data about the human genome (with the possible exception of BLAST?). Kent and Haussler deployed the assembler on a hastily-assembled cluster of about 100 Pentium III desktops. Now, this was 2000, and Pentium IIIs weren't as laughable as they are now, but for reference consider that a typical Pentium III is slower and less powerful than your average cell phone processor today. Admittedly, there were 100 of them... but Celera was using rather more powerful computers, better optimized for the kind of string comparisons required for assembly, and they had THOUSANDS of them. Haussler and Kent released their first assembled genome three days before Celera. 2) DNA sequences, espeically for populations are really more naturally viewed as graphs than as strings. The standard way to represent a DNA sequence (say, the human genome) and its variants (say, YOUR genome in all of its quirky variation) is to have a canonical reference sequence against which you align all sequences, and individual variants are noted against the reference. For humans, there is one public reference genome, which is now on its 38th version, against which all human genome assembly is performed. If you were to get sequenced, your genome would be mapped against this reference, and you would essentially get a list of differences between your genome and the reference. Instead, you can think of a set of sequences as a graph of connected short sequences. This is way easier to explain with pictures, but I run a text-only TIL, gosh darnit, so here goes. Consider these four example genomes of a very, very small organism: 1) ATTGTTTTGCGCA 2) ATTCTTATGCGCA 3) ATTGTTTTGCGCA 4) ATTCTTCCCCCTTGCGCA The traditional way to think about these genomes would be to build a "consensus sequence" that best represents the common elements of all of them, which would probably be "ATTGTTTTGCGCA". Genomes 1 and 3 are exactly the reference sequence; genome 2 has a mutation in the 7th nucleotide; genome 4 has an insertion "CCCCC" in between the 7th and 8th nucleotides. Alternatively, you could represent these sequences collectively as a graph where each node is a short sequence and directed edges between nodes means "that sequence follows this one" with a weight proportional to the frequency at which those two sequences go together. So the graph would start with the node "ATTGTT"; that node would point to each of the nodes "T", "A", and "CCCCC", with weights of 2, 1, and 1, respectively; and each of those nodes would point to the sequence "GCGCA". If you trace any path through the graph from beginning to end, it spells out a sequence. You do lose some information in this representation, like which mutations are associated with which other mutations, but there are advantages too. For one thing, it's a natural way to visualize the variation within a population. It's also helpful for assembly -- any variations from the reference sequence mess up the assembly process, but using a graph representation, you know what the most common variations are and can account for them. The graph representation also gives you a clear idea of what the most variable regions of the genome are, and which ones tend to stay constant. 3) A liquid biopsy is when you take blood from a cancer patient, filter out all the cells, and sequence all of the bits of DNA you can find floating around. It's not a very efficient way of sequencing a cancer, but it *is* highly uninvasive and easy to perform, and apparently there's usually enough cancer DNA floating around to get a good idea of what cancer it is.

No comments:

Post a Comment