Saturday, June 28, 2008

The DNA Network

The DNA Network

Walle and Paul Ehrlich [Tomorrow's Table]

Posted: 28 Jun 2008 06:54 PM CDT

Professor Paul Ehrlich and the makers of the new film Walle have a lot in common. They see the need to "shift our dominance away from malignant" (please see Stewart Brand's review of the lecture below) and toward the benign, the living and the beautiful.

They also both discuss the ways in which humans will need to evolve socially and culturally to make this happen.

In the movie Walle, humans have fled an earth that is no longer habitable. There are great scenes of stuff (calling to mind the George Carlin riff) piled as high as the abandoned skyscrapers nearby. In Ehrlich's world, we are well on the way to an equally dismal fate. With the warming of the earth's atmosphere and the accumulation of toxic chemicals in our soils we are destroying humanity's life support system-the climate food and water we need to survive.

Ehrlich's lecture, entitled "The Dominant Animal: Human Evolution and the Environment " was one of a series of Seminars About Long Term Thinking hosted by the Long Now Foundation. The fabulous Long now Foundation aims to provide counterpoint to today's "faster/cheaper" mind set and promote "slower/better" thinking.

Today I recycled some stuff and hung my clothes out to dry.


Here is founding board member Stewart Brand's summary of the event:

"To track how humans became Earth's dominant animal, Ehrlich began with a photo of a tarsier in a tree. The little primate had a predator's binocular vision and an insect-grabber's fingers. When (possibly) climate change drove some primates out of the trees, they developed a two-legged stance to get around on the savannah. Then the brain swoll up, and the first major dominance tool emerged---language with syntax.

About 2.5 million years ago, the beginnings of human culture became evident with stone tools. "We don't have a Darwin of cultural evolution yet," said Ehrlich. He defined cultural evolution as everything we pass on in a non-genetic way. Human culture developed slowly-the stone tools little changed from millennium to millennium, but it accelerated. There was a big leap about 50,000 years ago, after which culture took over human evolution---our brain hasn't changed in size since then.

With agriculture's food surplus, specialization took off. Inuits that Ehrlich once studied had a culture that was totally shared; everyone knew how everything was done. In high civilization, no one grasps a millionth of current cultural knowledge. Physicists can't build a TV set.

Writing freed culture from the limitations of memory, and burning old solar energy (coal and oil) empowered vast global population growth. Our dominance was complete. Ehrlich regretted that we followed the competitive practices of chimps instead of bonobos, who resolve all their disputes with genital rubbing.

"The human economy is a wholly-owned subsidiary of the Earth's natural systems," said Ehrlich, and when our dominance threatens the ecosystem services we depend on, we have to understand the workings of the cultural evolution that gave us that dominance. The current two greatest threats that Ehrlich sees are climate change (10 percent chance of civilization ending, and rising) and chemical toxification of the biosphere. "Every cubic centimeter of the biosphere has been modified by human activity."

The main climate threat he sees is not rising sea levels ("You can outwalk that one") but the melting of the snowpack that drives the world's hydraulic civilizations--- California agriculture totally dependent on the Sierra snowpack, the Andes running much of Latin America, the Himalayan snows in charge of southeast Asia. With climate in flux, Ehrlich said, we may be facing a millennium of constant change. Already we see the outbreak of resource wars over water and oil.

He noted with satisfaction that human population appears to be leveling off at 9 to 10 billion in this century, though the remaining increase puts enormous pressure on ecosystem services. He's not worried about depopulation problems, because "population can always be increased by unskilled laborers who love their work."

The major hopeful element he sees is that cultural evolution can move very quickly at times. The Soviet Union disappeared overnight. The liberation of women is a profound cultural shift that occurs in decades. Facing dire times, we need to understand how cultural evolution works in order to shift our dominance away from malignant and toward the benign.

In the Q & A, Ehrlich described work he's been doing on cultural evolution. He and a graduate student in her fifties at Stanford have been studying the progress of Polynesian canoe practices as their population fanned out across the Pacific. What was more conserved, they wondered, practical matters or decoration? Did the shape of a canoe paddle change constantly, driven by the survival pressure of greater efficiency, or did the carving and paint on the paddles change more, driven by the cultural need of each group to distinguish itself from the others.

Practical won. Once a paddle shape proved really effective, it became a cultural constant."

--Stewart Brand

Health News in Second Life: Health 2.0? [ScienceRoll]

Posted: 28 Jun 2008 02:18 PM CDT


This week, I organized a session for 23andMe in Second Life and it turned out to be quite an interesting event. Jen McCabe Gorman gave Second Life a try and listed some reasons why to use it in health 2.0:

  • hell of a lot cheaper than traveling
  • More Interesting Q&A
  • Flying in Second Life was a blast
  • Credibility is established beforehand
  • Speed

Medical & Psychological Sites In Second Life:

OMG!!! Canadians SERIOUSLY Worried About High Gas Prices!!! [Bayblab]

Posted: 28 Jun 2008 02:16 PM CDT


Yup that's right, according to a Globe and Mail survey, the high price of gas is now Canadian's number one concern. Last year's top issue, the environment ("like OMG, Al Gore figured out that New Orleans is sinking because of global warming!!!"), has been relegated to third place. According to Gloria Galloway this is like total bad news for the Liberal's carbon tax proposal:

"This shift could make it more difficult for Liberal Leader Stéphane Dion to sell the carbon-tax plan he unveiled last week, a complex scheme to cut greenhouse-gas emissions that will be the cornerstone of his party's platform in the next election."

What might be useful here is if people divert their attention away from gas price signs and political documentaries for a second and do some thinking. Hmmmm....maybe these issues are somehow linked? Maybe the larger issue is economic dependence on burning fossil fuels? Hey look - every time we use a litre, there's a litre less left in the world!!! And hey, pollutants are released at the EXACT SAME TIME!!!

Politicians - what Canadians need is a visionary plan that engineers us an escape from the sinking ship that is the fossil fuel economy and builds us a new one. Neither pansy-ass environmentalist appeasement measures nor short-term gasoline price-fixing measures are going to cut it.

More skin cancer and pigmentation genetic variants [Yann Klimentidis' Weblog]

Posted: 28 Jun 2008 11:58 AM CDT

There's three studies on skin pigmentation and cancer in the latest Nature Genetics, and a commentary piece (see below) synthesizing the findings. One study is a whole genome association study for skin cancer, finding two SNPs on chromosome 20. Another is a candidate gene study looking at the association of variants in TYR and ASIP on skin cancer risk. Finally, another study finds an association of "new" variants in ASIP and TPCN2 with pigmentation among Icelanders and Dutch individuals. I think that the most interesting, but not necessarily surprising, thing here is how the associations with skin cancer risk differ between populations.

Shedding light on skin cancer

Paul D P Pharoah
Nature Genetics 40, 817 - 818 (2008)
Abstract: Pigmentation traits are known risk factors for skin cancer. Now, three new studies provide insights into the genetic factors underlying these effects, and the results reveal a surprisingly complex picture of the relationship between pigmentation traits and disease risk.

ASIP and TYR pigmentation variants associate with cutaneous melanoma and basal cell carcinoma
Daniel F Gudbjartsson, Patrick Sulem et al.
Nature Genetics 40, 886 - 891 (2008)
Abstract: Fair color increases risk of cutaneous melanoma (CM) and basal cell carcinoma (BCC). Recent genome-wide association studies have identified variants affecting hair, eye and skin pigmentation in Europeans. Here, we assess the effect of these variants on risk of CM and BCC in European populations comprising 2,121 individuals with CM, 2,163 individuals with BCC and over 40,000 controls. A haplotype near ASIP, known to affect a similar spectrum of pigmentation traits as MC1R variants, conferred significant risk of CM (odds ratio (OR) = 1.45, P = 1.2 10-9) and BCC (OR = 1.33, P = 1.2 10-6). The variant in TYR encoding the R402Q amino acid substitution, previously shown to affect eye color and tanning response, conferred risk of CM (OR = 1.21, P = 2.8 10-7) and BCC (OR = 1.14, P = 6.1 10-4). An eye color variant in TYRP1 was associated with risk of CM (OR = 1.15, P = 4.6 10-4). The association of all three variants is robust with respect to adjustment for the effect of pigmentation.
Common sequence variants on 20q11.22 confer melanoma susceptibility
Kevin M Brown, et al.
Nature Genetics 40, 838 - 840 (2008)
Abstract: We conducted a genome-wide association pooling study for cutaneous melanoma and performed validation in samples totaling 2,019 cases and 2,105 controls. Using pooling, we identified a new melanoma risk locus on chromosome 20 (rs910873 and rs1885120), with replication in two further samples (combined P less than 1 10-15). The per allele odds ratio was 1.75 (1.53, 2.01), with evidence for stronger association in early-onset cases.
Two newly identified genetic determinants of pigmentation in Europeans
Patrick Sulem, et al.
Nature Genetics 40, 835 - 837 (2008)
Abstract: We present results from a genome-wide association study for variants associated with human pigmentation characteristics among 5,130 Icelanders, with follow-up analyses in 2,116 Icelanders and 1,214 Dutch individuals. Two coding variants in TPCN2 are associated with hair color, and a variant at the ASIP locus shows strong association with skin sensitivity to sun, freckling and red hair, phenotypic characteristics similar to those affected by well-known mutations in MC1R.

How much data is a human genome? It depends what you store. [Genetic Future]

Posted: 28 Jun 2008 10:07 AM CDT

Andrew from Think Gene has finally prompted me to write a post I've been working on sporadically for a month or so. The question is pretty simple: in the not-too-distant future you and I will have had our entire genomes sequenced (except perhaps those of you in California) - so how much hard drive space will our genomes take up?

Andrew calculates that a genome will take up about two CDs worth of data, but that's only if it's stored in one possible format (a flat text file storing each chromosome). There are other ways you might want to keep your genome depending on what your purpose is.

The executive summary
For those who don't want to read through the tedious details that follow, here's the take-home message: if you want to store the data in a raw format for later re-analysis, you're looking at between 2 and 30 terabytes (one terabyte = 1,000 gigabytes). A much more user-friendly format, though, would be as a simple text file containing each and every DNA letter in your genome, which would take up around 1.5 gigabytes (small enough for three genomes to fit on a standard data DVD). Finally, if you have very accurate sequence data and access to a high-quality reference genome you can squeeze your sequence down to around 20 megabytes.

The details
For the first two formats I'll assume that someone is having their genome sequenced using one of today's cutting-edge sequencing technologies, the Illumina 1G platform. The 1G platform and its rivals, Roche's 454 and Applied Biosystem's SOLiD, are the instruments that are currently being used to sequence over 1,000 individuals for the international 1000 Genomes Project; if you were to have your genome sequenced right now it would almost certainly be using one of these platforms.

The Illumina technology basically sequences DNA as a huge number of short (36-letter) fragments, called reads. Because read lengths are so short and the system has a fairly high error rate, assembling an entire genome would require what's called 30x coverage - which basically means each base in the genome is sequenced an average of 30 times.

Once the reads have been generated, they are assembled into a complete genome with the help of a universal reference genome, the sequence created by the Human Genome Project from a mixture of DNA from several individuals. Even with this high level of coverage there is still considerable uncertainty involved in the process of re-assembling the genomes from very short fragments, and both the algorithms used to perform this assembly and the reference genome are being constantly improved. Thus for the moment there may be some advantage in storing your data in a raw format, so that in a few month's time you can take advantage of better software and more complete reference genome to reconstruct your own sequence in a more complete fashion.

For the third and fourth formats, I've moved into the future: basically, I'm assuming that we now have access to affordable sequencing technology that can generate extremely long and accurate reads from a single molecule of DNA. That would allow you to reconstruct your entire genome - both sets of chromosomes, one from your mother and one from your father - with very high confidence. In that case you no longer need to store your raw data, and we can instead start thinking about the most efficient possible way to keep your entire genome on disk.

Note that in what follows, for the sake of simplicity I am ignoring the effects of data compression algorithms. It's likely that you could shrink down these data-sets (especially the image files) by quite a bit using even straightforward compression.

Anyway, enough background. Let's get started.

1. For hard-core data junkies only: raw image files
To put it very simply, the Illumina 1G platform sequences your DNA by first smashing it up into millions of fragments, binding those fragments to a surface, and then feeding in a series of As, Cs, Gs and Ts. As these bases are incorporated into the DNA fragments they set off flashes of light that are captured by a very high-resolution camera, resulting in a series of pretty coloured images such as the one on the left(which is actually a montage of four images, one for each base). Each of those spots represents a separate fragment of DNA, captured at the moment that a single base (A, C, G or T, each labelled with a different colour) is read from that fragment. By building up a series of these images the machine accumulates the sequence of the first ~36 bases of those fragments in the image, after which the sequence quality starts to drop off.

Almost as soon as these images are generated they are fed into an algorithm that processes them, creating a set of text files containing the sequence of each of the fragments. The image files are then almost always discarded. Why are they discarded? Because, as you will see in a minute, storing the raw image data from each run in even a moderate-scale sequencing facility quickly becomes prohibitively expensive - in fact, several people have suggested to me that it would be cheaper to just repeat the sequencing than to store these data long-term.

How much data? Each tile of an Illumina machine will give you accurate sequence information for around 25,000 DNA fragments. A separate image is obtained for each of the four bases, with each "snap-shot" comprising around 2 Mb of data. That comes to a total of 320 bytes/base. For an entire genome with 30x coverage, that comes out as around 28.80 terabytes of data. That's almost 30,000 gigabytes!

Why store your genome like this? Well, either you believe that image-processing algorithms are likely to improve in the near-future, thus allowing you to squeeze a few more bases out of your data; or you have a huge bunch of data servers lying idle that you want to do something with; or you're just a data junkie. However, your actual sequence data is not readily accessible in this format, so you'd also want to be keeping at least a roughly assembled version of your genome around to examine as new information about risk variants becomes available.

2. For DIY assemblers: storing individual reads
I mentioned above that those monstrous image files are rapidly converted into text files containing the sequence of each of your ~36-base reads. The files that are generally used here are called Sequence Read Format (SRF) files, which are used to store the most likely base at each position in the read along with other associated data (such as quality scores).

How much data? It depends what sort of quality information you keep: at the high end you'd be looking at around 22 bytes/base (1.98 terabytes total) to store raw trace data, while at the low end you could just score sequence plus confidence values for around 1 byte/base (90 gigabytes total). That's starting to become feasible - you could now store your genome data on an affordable portable hard drive.

Why store your genome like this? This is a pretty efficient way to store your raw read data while you wait for improvements in both the reference human genome sequence (which is still far from complete) and assembly algorithms. As with the previous format, though, you'd also want to store your sequence in a more readily accessible assembled sequence so you could actually use it.

3. Your genome, your whole genome, and nothing but your genome
OK, now let's gaze a few years into the future, and assume (fairly safely) that new technologies for generating accurate, long reads of single DNA molecules have become available. This means you can stitch your entire genome together very easily, allowing you to store the whole 6 billion bases of it in a text file - this is the type of data storage approach that Andrew discussed in his post. In essence, you're storing every single base in your genome as a separate character in a massive, 6 billion letter long text file.

How much data? Each DNA base can be stored in two bits of data, so your complete genome (both sets of chromosomes) tallies up to around 1.5 gigabytes of data. If you wanted to store some associated confidence scores for each base (indicating how likely it is that you sequenced that section of your DNA correctly) that might take you up to 1 byte/base, or a total of around 6 gigabytes. Either way, you could now fit your genome on a cheap USB thumb drive.

Why store your genome like this? This is probably the easiest possible format to store your genome in - it contains all the information you need to compare your sequence with someone else's, or to find out if you have that rare mutation in your GABRA3 gene that you saw on the news last night. It's everything you need and nothing you don't.

Now, most sensible people will probably be content with their 1.5 Gb genome, especially as data storage becomes ever cheaper. But a few will want to squash it down further, particularly if they're storing lots of genomes (like a large sequencing facility, or your insurance company). In that case they can go one step further by taking advantage of the fact that at the DNA level all of us are very much alike.

4. The minimal genome: exploiting the universal reference sequence
I don't know who you are, but I do know that if you lined up our genomes you would find that we have a lot in common - almost all of the bases in our genomes are absolutely identical. Indeed, for any two randomly selected humans you will find, on average, that around 99.5% of their DNA is precisely the same (although the precise pieces that are different will of course differ from person to person). We can use this commonality to compress our genomes further using a clever trick: if we have a very good universal human reference sequence, we can ignore all the parts of our genome that match it, and only store the differences.

In practice, then, your personal genome sequence will comprise (1) a header, stating which reference sequence to compare to (this would ideally be a reference sequence from your own ethnic group), and (2) a set of instructions providing all the information required to transform that reference sequence into your own 6 billion base genome.

For convenience, I'll assume that the reference is stored as a single contiguous text file containing all 46 chromosomes joined together. To make your genome, a software package will start at one end of the reference sequence; each instruction will tell it to move a certain number of bases through the genome, and then change the sequence at that position in a specific way (it could either change that base to something else, insert new sequence, or delete the base entirely). In this way, sequentially running your personal instruction set will convert the reference sequence into your own genome, base by base.

How much data? This one is tricky because we still don't have a great idea of exactly how many differences exist between people. I'm going to make some rough guesses using this paper, which compares the genome of Craig Venter to the sequence generated by the Human Genome Project, and assuming that this paper under-represents the total number of variable sites by around 30% (due to missed heterozygotes and poor coverage of repetitive areas). For a diploid genome (i.e. one containing two copies of each chromosomes, one from each parent) this gives an estimate of around 6 million single base polymorphisms, about 90,000 polymorphisms changing multiple bases, and about 1.8 million insertions, deletions and inversions.

Now, the instructions for each of the single base polymorphisms can be stored as 1.5 bytes each on average (enough space to store both the distance from the previous polymorphism, and a new base). Multiple-base polymorphisms will be perhaps 2 bytes each, allowing for the storage of a few additional changed bases. Deletions might be around 3 bytes to store the length of the deleted region. Insertions will be more complicated: if they simply duplicate existing material they might only take up 3 or 4 bytes, but if they involve the insertion of brand new material they will be much larger (3 bytes, plus 1 byte for every 4 new bases inserted). From the Venter genome the average insertion size is 11.3 bases, so let's say insertions take up 7 bytes on average.

Making some more assumptions and tallying everything up I get a total data-set on the order of 20 megabytes. In other words, you could fit your genome and the sequences of about 34 of your friends onto a single CD.

Why store your genome like this? If you have a fetish for efficiency, or if you have a whole lot of genomes you need to store, this is the system to use. Of course, it relies on having access to a universally accessible reference sequence of high quality - and you would probably want to recalculate it whenever a new and better reference became available.

Squeezing your genome even further
Want to get even more genomes per gigabyte? Here's one efficiency measure you might want to consider: use databases of genetic variation. These might be especially useful for large, common insertions (rather than storing the entire sequence of the insertion, you can simply have a pointer to a database entry that stores this sequence).

Acknowledgments: the raw numbers and calculations in this post owe a lot to David, Tom, James and Zam from the Sanger Institute, UK. Thanks guys!

Subscribe to Genetic Future.

Medicine Meets Virtual Reality 17: Organizing Committee [ScienceRoll]

Posted: 28 Jun 2008 08:57 AM CDT


This January, I attended the Medicine Meets Virtual Reality 16 conference in Long Beach, CA. It was a one-in-a-lifetime experience and I enjoyed talking about Medicine 2.0 and organizing a live Second Life medical exercise for the medical students of USC.

Now, I’m truly honoured to be in the organizing committee of next year’s event. Take a look at the list of famous scientists and innovators. I hope they will be delighted by my efforts.

Of course, I will try to come up with some medicine 2.0 related ideas.

I’ll keep you posted about the conference through the whole year.

Related posts:

What’s on the web? (28 June 2008) [ScienceRoll]

Posted: 28 Jun 2008 08:37 AM CDT


  • Radiopaedia beta 2.0: A fantastic, comprehensive medical wiki with nice improvements and a clear business model. Frank Gaillard made an exceptional job.

Treating hypertension decreases mortality and disability from cardiovascular disease, but most hypertension remains inadequately controlled. Objective: To determine if a new model of care that uses patient Web services, home blood pressure (BP) monitoring, and pharmacist-assisted care improves BP control.

oh no, internet blues for me [the skeptical alchemist]

Posted: 28 Jun 2008 08:29 AM CDT

Unfortunately at the moment I am in a place where there is very limited access to internet, so that it will basically be impossible to post anything until September. I deeply apologize - I am writing this as I am out of this usual place, and I managed to find a place with fast internet.

The skeptical alchemist, and the Molecular and Cell Biology Carnival, will be back in September.

Apologies again,

steppen wolf

Sciencewear [Bayblab]

Posted: 28 Jun 2008 07:07 AM CDT



Wear your love of science on your sleeve, or illustrate 'teach the controversy' silliness. The above pictures are clothing designs from wearscience.com. For more of their graphics, go here.

A simple therapy for brain injury [Think Gene]

Posted: 28 Jun 2008 01:23 AM CDT

Severe brain injury due to blunt force trauma could be reduced by application of a simple polymer, Polyethylene glycol or PEG, mixed in sterile water and injected into the blood stream – as reported in BioMed Central’s Journal of Biological Engineering.

Andrew Koob and Richard Borgens from Purdue University, Indiana, performed experiments in rats which showed that PEG was effective in limiting damage if administered within four hours after the head injury. However, if treatment was delayed for a further two hours, the beneficial effects were lost. During the experiments, rats were injured with a falling weight and then PEG was administered fifteen minutes, two hours, four hours or six hours later. The authors then carried out a series of behavioural tests on the rats to determine the effectiveness of the PEG treatment.

According to Borgens “These data suggest that PEG may be clinically useful to victims of traumatic brain injury if delivered as rapidly as possible after an injury”. Such a treatment could feasibly be carried out at the scene of an accident where PEG could be delivered as a component of IV fluids thus reducing long term brain injury.

Source: BioMed Central

Behavioral recovery from traumatic brain injury after membrane reconstruction using polyethylene glycol. Andrew O Koob, Julia M Colby, Richard B Borgens. Journal of Biological Engineering 2008, 2:9 (27 June 2008)

Josh says:

I wonder how this works. Does anyone have any ideas?

MIT probe may help untangle cells’ signaling pathways [Think Gene]

Posted: 28 Jun 2008 01:11 AM CDT

CAMBRIDGE, Mass.–MIT researchers have designed a new type of probe that can image thousands of interactions between proteins inside a living cell, giving them a tool to untangle the web of signaling pathways that control most of a cell’s activities.

“We can use this to identify new protein partners or to characterize existing interactions. We can identify what signaling pathway the proteins are involved in and during which phase of the cell cycle the interaction occurs,” said Alice Ting, the Pfizer-Laubach Career Development Assistant Professor of Chemistry and senior author of a paper describing the probe published online June 27 by the Journal of the American Chemical Society.

The new technique allows researchers to tag proteins with probes that link together like puzzle pieces if the proteins interact inside a cell. The probes are derived from an enzyme and its peptide substrate. If the probe-linked proteins interact, the enzyme and substrate also interact, which can be easily detected.

To create the probes, the researchers used the enzyme biotin ligase and its target, a 12-amino-acid peptide.

Their work is conceptually related to an approach that uses GFPs (green fluorescent proteins), which glow when activated, as probes. Half of each GFP molecule is attached to the proteins of interest, and when the proteins interact, the GFP halves fuse and glow. However, this technique results in many false positives, because the GFP halves seek each other out and bind even when the proteins they are attached to are not interacting, said Ting.

The new probes could be used to study nearly any protein-protein interaction, Ting said. The researchers tested their probes on two signaling proteins involved in suppression of the immune system, and on two proteins that play a role in cell division. They are currently using the probe to image the interaction of proteins involved in synapse growth in live neurons.

Source: Massachusetts Institute of Technology

Protein−Protein Interaction Detection in Vitro and in Cells by Proximity Biotinylation. Fernández-Suárez, M.; Chen, T. S.; Ting, A. Y.. J. Am. Chem. Soc.; 2008; ASAP Article;  DOI: 10.1021/ja801445p

Josh says:

The GFP method they mention here is more commonly referred to as FRET (Förster/Fluorescence resonance energy transfer).

No comments: