The DNA Network

Puget Sound scientists: be a Biotech Expo mentor and help students [Discovering Biology in a Digital World]

Posted: 15 Oct 2008 08:47 PM CDT

Every year students in the Puget Sound area gather together at the Biotech Expo to celebrate the life sciences and compete for prizes. Although their projects are diverse in nature, they compete in categories like research, art, journalism, drama, music, and others, all the students learn about science as part of their work.

You can help a high school student learn about science by being a mentor for the Biotech Expo. It is especially helpful to students if they can bounce questions off of a real-live person who works in a scientific field.

Read the rest of this post... | Read the comments on this post...

DNA Testing Goes to Hollywood [DNA Testing Blog]

Posted: 15 Oct 2008 04:14 PM CDT

This week, two of DDC’s scientists are attending the 19th International Symposium on Human Identification in Hollywood, California. Laboratory director Dr. Michael Baird co-hosted a pre-symposium workshop Sunday on the use of non-autosomal markers in relationship analysis. Speakers presented Y-chromosomal and X-chromosomal STR analysis in addition to mitochondrial DNA sequencing. Methods to incorporate the results [...]

20 year study on psychological impact of gene testing [Microarray and bioinformatics]

Posted: 15 Oct 2008 12:17 PM CDT

In the background of the mounting negative publicity agianst consumer gene testing companies. Navigenics is teaming up with Scripps reseacrh institute to study how such tests affect patient behavior.

This is a long time study, over 20 years. with help from Affymetrix to Scan, Navigenics to interpret and offer life style guidance and the Microsoft HealthVault will be the place where participants can enter information and share it as they see fit with physicians

share your thoughts linkedin Answer

Baldness genetics [Yann Klimentidis' Weblog]

Posted: 15 Oct 2008 12:05 PM CDT

Dan at Genetic Future has a great post on a couple of recent genome-wide association studies on male pattern baldness.

Local genomes [genomeboy.com]

Posted: 15 Oct 2008 11:49 AM CDT

Some of my colleagues have had their genomes scanned as part of the Duke Personal Variome Project:

Willard, though not a participant in the study, said he has had his genome analyzed several times, revealing that he is at a greater risk of developing cancer. Still, he knew of this risk before he ever submitted blood or spit to a laboratory-he was diagnosed with colon cancer and also has a family history of the disease.

“Getting colon cancer changed my lifestyle,” he said. “My genome testing hasn’t changed my lifestyle… but I’m already in this behavioral mindset of thinking about cancer risk and trying to modify behavior to reduce that risk in a very general way.”

Family history of a disease can be just as indicative of one’s chance of getting the disease as genomic testing, Willard said. But he added that for some people, the results of a genetic test might spur them to take care of their health in a way that family history does not.

“As people think about what DNA means and about what the genome means, there really is a sense that this is much more directive and impactful than the generic ‘Uncle Joe had heart disease,’” he said.

GFP Researchers Endorse Obama! [The Daily Transcript]

Posted: 15 Oct 2008 11:48 AM CDT

A message from Marty Chalfie, winner of the 2008 Nobel Prize in Chemistry for the discovery of GFP:

Read the comments on this post...

The Canadian Election, my 2 cents. [The Daily Transcript]

Posted: 15 Oct 2008 11:36 AM CDT

You may not have noticed, but yesterday, the US's largest trading partner had an election. Watching the returns with my wife, I was struck yet again how different Canada is from US. Just like Americans, Canadians get upset at the government, but unlike Americans, Canadians want the government to work and are ready to punish their leaders if they feel like they are getting screwed. About three elections ago, the Liberals were punished for a financial scandal (incredibly small for US standards, but too big for the patience of most Canadians). What was the result? In 2004 Canadians couldn't vote Conservative, because most of the population is just not conservative enough, so the Liberals lost many seats in parliament but held on to enough ridings to form a minority government. The government fell in late 2005 over a vote. In January 2006 most Canadians didn't think that the Liberals had cleaned up their act and so they finally gave the Conservatives power, albeit a minority government. The biggest problem for the Conservatives was winning seats in Quebec. To try to woo, the finicky voters of La Belle Province, the Conservatives pushed an agenda that favored a shift of powers from the federal government to the provinces. Meanwhile the Liberals changed leaders and tried to regroup, but most Canadians weren't impressed. So a few months ago the Conservatives emboldened by recent poll numbers announced a new election hoping to gain a majority. All was going well, Stephen Harper the leader of the Conservatives, had instructed his troops to stay on message and it looked like the Conservatives would win enough seats in Quebec to get a majority. But then the Tories started acting like Tories and tried to cut funding to the arts. Artists were upset, and the Quebecois abandoned the Conservatives to vote for the Bloc Quebecois, a separatist party who were until then on the verge of extinction.

So the end result? No change what-so-ever, Canadians are still upset at the Liberals (who will have to change leadership once again), Quebec is still suspicious of both the Liberals and the Conservatives. Traditionally the NDP, a party left of the Liberals had never won in very left leaning Quebec, probably due to the fact that there are two other left of center Parties (the BQ and the Liberals). The NDP was thus seen as a wasted vote in my home town. BUT last night the NDP actually won a seat in Quebec for the first time during a general election! Now that a taboo of Quebecois politics has been broken, perhaps we'll see a change in the not so distant future.

Read the comments on this post...

Canadian Doc says government run health care not the answer [Mary Meets Dolly]

Posted: 15 Oct 2008 11:18 AM CDT

While I was visiting a relative in hospice care last week, I had the most interesting conversation with the hospice doctor on staff. I relate it to you only because it is rare, for me at least, to speak to someone who has worked in both "socialized" and "free market" health care systems.

This doctor was born and raised in Canada. He became an obstetrician and took care of mothers and their babies for many years in the socialized Canadian health care system. For the care he provided during 40 weeks of pregnancy and then labor, he only received $155. He continued to work in this system until one case finally pushed him to move to the United States. He had a pregnant patient who developed severe diabetes. She needed to be hospitalized for the last 16 weeks of her pregnancy. This doctor visited her everyday for 16 weeks. He called in a diabetic specialist to help with her care. And when it came time to deliver, he had another doctor come in and do a C-section.

The reimbursement from the Canadian health system for prenatal and delivery services was $155. The diabetic specialist received $100 and the doctor who performed the C-section received $58. After 16 weeks of diligent care of a high-risk pregnancy, this hospice doctor got a letter from the Canadian health care system that said he owed them $3!!!!!

Needless to say this doctor high-tailed it to the United States and set up a family medicine practice. This practice did very well until Medicare and Medicaid stopped paying. When the government stopped paying him for his services, his family medicine practice went under. He then got a job as a hospice physician.

After hearing this amazing story I had to ask this doctor what he thought would fix our "broken" health care system here in America. First he said, "Your system is not broken. Canada's system is broken. Yours is just damaged."

Second, he emphatically said, "Socialized medicine IS NOT the answer. When Barack Obama says he can cover everyone, he has no idea what he is talking about. Socialized medicine on a national scale in the U.S. would bankrupt the country in 10 years." In essence he said, "Health care for all is health care for no one."

So I asked, "If socialized medicine is not the answer, what is?" And he replied, "Health savings accounts." I will try to paraphrase his reasoning. Every month many families or employers pay close to $2000 a month in insurance premiums. Most people do not even come close to using $2000 a month in medical expenses. That $2000 goes to the insurance company and is never seen again. Instead of giving that money to the insurance company, the monthly premium would go into a tax free savings account from which all medical expenses are paid directly by the patient. No referrals, no statement of benefits. etc. You see who you want to see about whatever, whenever. Instead of lining the pockets of the insurance company, that monthly premium will grow and grow, because it still belongs to you. This is a huge incentive to stay healthy and not abuse the health care system by, say, bringing your kid into the ER with an ear infection. Also you can use the health savings account to retire on if you wish. Along with the savings account, is a catastrophic insurance policy. This policy costs much less and will basically cover health care costs over a set amount, like $5000 per year.

I am no expert on health care. I am not pretending to be. I just wanted to share my experience talking to someone who really seemed to know the system and had some real suggestions on how to move forward. I am interested in knowing what my readers think.

Send your mail to Pete Best instead [genomeboy.com]

Posted: 15 Oct 2008 10:34 AM CDT

Ach, Ringo, we hardly knew ye…

http://www.youtube.com/watch?v=pUsBqDfnIzk

Apparently “Peace and love” is the new “Get the hell off my lawn.”

Pacific Bio releases some details on SMRT Sequencer read lengths, library prep [SEQanswers.com]

Posted: 15 Oct 2008 09:38 AM CDT

Our friends over at posted a very interesting containing details from Pacific Biosciences regarding their much anticipated third-gen uber-sequencer. I certainly don't want to steal their scoop so I won't go into detail here...I encourage you to visit their site as it's the first detailed look at what PB may offer as soon as 2010. Some points include long (Sanger-sized) or short reads, and run times!

Read more and join the community...

Web as platform: A web of data services [business|bytes|genes|molecules]

Posted: 15 Oct 2008 08:00 AM CDT

The New York Times just released an API, one for Campaign Finance. Marshall Kirkpatrick calls this significant because “steps like this are going to prove key if big media is to thrive in the future.”

To me this is another data point into something I have been thinking a lot about lately. We’ve often talked about a web of data. We’ve also talked about how data is pretty much useless until you can do something about it. This actually fits in well with my theory that raw data on it’s own doesn’t really have any value. It’s what you do with it that makes it valuable. That leads us to the next evolution of the web and the data-centric age; a web of data services. This is also the answer to the question Tom Tague asks, How will we interact with the web of data?

The NY Times providing APIs to its resource of data is the natural evolution of the web of data. Data must not be locked up. It must be made available, and we must be able to do something with it and services on the fabric of the web are the way to go and APIs enable services.

In this article on the web of data Tom Tague writes

The Web of data is the logical extension, letting developers create links between data sources that are themselves exposed on the Web for others to reuse to build large-scale, ad hoc mashups, while simultaneously reducing the challenges of integrating heterogeneous data.

The web of data services is not only about letting developers create links between data sources, but also about computing on those data sources and repurposing the data and results to provide information that if more meaningful and has greater value than the data sources individually could provide.

Tom asks what a Semantic Web browser should look like. To me it looks like a command prompt, or an IDE, where developers can leverage the services that make the data available and the services that make the methods that act on the data available to provide useful information to the end user, who then consume that information in a human readable form. Perhaps in an interactive environment that allows them to leverage some underlying canned services and pre-computed results to look at the information in a way that makes more sense to them.

We aren’t there. It will take a while, but we’ll get there. IMO, in the life sciences, the very epitome of data-driven science, we need to seriously think about data and data services, as well and methods and method services. Matt Wood talked about essentially this approach in a talk to the Informatics group at the Sanger Institute earlier this year.

Everything that's great comes from blogging! [The Gene Sherpa: Personalized Medicine and You]

Posted: 15 Oct 2008 07:43 AM CDT

Today, I want to point out what Webicina is and why it is important. You may be asking yourselves....webicina? What the heck is that? and Why should I care.... Here's why. Just about everything in...

[[ This is a content summary only. Visit my website for full links, other content, and more! ]]

Navigenics to add gene sequencing to its personal genomic service [Genetic Future]

Posted: 15 Oct 2008 07:01 AM CDT

Navigenics has announced in the industry publication In Sequence (subscription only) that it plans to add gene sequencing to its personal genomics service. This would make it the first of the "Big Three" personal genomics companies (Navigenics, 23andMe and deCODEme) to offer analysis of rare as well as common genetic variants.

The move into sequencing has always been inevitable for the personal genomics industry. Currently all three of the major players in the affordable personal genomics field (as opposed to Knome's high-end service) use chip-based technology to analyse up to a million common sites of variation, known as SNPs, scattered throughout the genome. SNP chips provide remarkable insight into common variants (that is, variations with a frequency of 5% or greater in the general population), but they don't provide any real information about rarer variants - particularly those with a frequency of less than 1%.

It has become increasingly clear over the last few years that common variants play a disappointingly small role in most common diseases, as SNP chip approaches on ever-larger sample sizes have consistently failed to find the majority of disease-causing variation. Rather it appears likely that a substantial proportion of disease risk lurks in individually rare, large-effect polymorphisms - variants found in just a small fraction of the population, each contributing a substantial increase in disease risk. Most of these variants will never be picked up by SNP chip technology, so new approaches will be required to find them - and that's where sequencing comes in. By determining the complete DNA code within a set of target genes, sequencing identifies both common and rare variants alike.

Until now, the shift of the personal genomics industry into the sequencing market has been held back by two major barriers: cost, and the difficulty of interpreting rare variants. The first barrier is dropping with alarming speed, but the second is still a major challenge - and one that will pose some serious dilemmas for Navigenics and other companies as they launch their sequencing ventures.

Of course, these aren't new dilemmas: molecular diagnostics labs have been facing the challenge of determining whether or not a novel mutation is disease-causing for decades, in the context of both rare Mendelian diseases (like muscular dystrophy) and particularly in complex diseases such as breast cancer (BRCA1 mutation analysis is a particularly subtle art that probably warrants its own post). Navigenics will thus be taking advantage of the experience and the databases of a company called Correlagen Diagnostics, which already offers sequencing-based tests for a range of known disease-causing genes. I don't know enough about Correlagen to comment on their expertise, but it certainly makes sense for personal genomics companies to team up with experienced molecular diagnostics teams as they face the challenges of the sequencing era.

Navigenics will initially restrict the complexity of the problem by focusing on a set of known disease genes, and will draw on Correlagen's database to see if any new variants they find in a client's genes are known to be associated with diseases in other patients. However, most of the possible disease-causing variants they find will be completely novel - such is the nature of rare variants - and their disease-causing status will thus need to be predicted de novo. Navigenics' solution is roughly laid out in the In Sequence article:

In many cases, though, a rare gene variant will never have been seen before and, thus, be more difficult to interpret. Based on the variant's properties, like its evolutionary conservation, or whether it results in an amino acid change, Navigenics will attempt to assign it a probability score that predicts its clinical relevance. "And that's a really, really hard problem," Stephan said.

What is needed, he said, is sequencing-based genome-wide association studies. "What you ideally would want to do is take thousands of people with a complex genetic disease and thousands of people without one, sequence their entire exomes, and look for hotspots of accumulation of rare variants in certain regions of the genome [of cases vs. controls, where] the specific variants look like they have some sort of functional consequences."

"Then, the next time a person comes through the door, you can start to informatically stratify the loci that you see variants in ... based on sequencing all these genomes."

I think Stephan is under-estimating the sample sizes required for these studies to be effective - we're talking hundreds of thousands of whole genomes, at least - but the overall message is on-target, and it's not good news for personal genomics customers expecting to find out what their genome means right now. It's going to take a long time and a tremendous amount of work before de novo functional prediction becomes a reliable proposition.

Of course, that's not going to stop personal genomics companies from staking out claims in the sequencing arena, and from offering risk predictions from rare variants - however provisional and imperfect - to customers. 23andMe has long expressed interest in a sequencing approach, although co-founder Linda Avey is coy about the company's ambitions in the In Sequence article:

"23andMe is closely following the next-generation sequencing field and will offer an expanded service when the data quality, balanced by the cost, of these offerings meets our criteria," said Linda Avey, co-founder of 23andMe, in an e-mail message. Once the company decides to include sequencing analysis in its service, "we will examine any and all sequencing companies in determining which would work best with our platform," she said.

Navigenics will apparently be offering whole-exome sequencing (analysis of the protein-coding regions of all genes in the genome) some time next year, and complete genome sequencing at some stage after that. You can bet that 23andMe's desire to remain at the lead of the personal genomics industry will ensure that Navigenics will not be alone; at the same time, the whole-genome sequencing services offered by industry pioneer Knome and other emerging players will be dropping to affordable levels. When you throw in the current obscene rate of change in the sequencing technology sphere, this is likely to turn into a chaotic and fascinating race.

Subscribe to Genetic Future.

Baldness genes: one old, one new [Genetic Future]

Posted: 15 Oct 2008 06:57 AM CDT

This is now an archive site. For fresh content visit the new Genetic Future site or subscribe to the new RSS feed.

From a geneticist's point of view, male pattern baldness - also known as androgenic alopecia - is a tempting target. Baldness is common in the general population, with a prevalence that increases sharply with age (as a rule of thumb, a male's percentage risk of baldness is approximately equal to his age, e.g. 50% at age 50, and 90% at age 90), so there are no shortage of cases to study. It's also a strongly heritable trait, with about 80% of the variation in risk being due to genetic factors. Finally, baldness has been reported to be associated with a wide range of diseases such as prostate cancer, heart disease and diabetes, so learning about the genes that underlie this condition may help to dissect out the molecular pathways behind more serious disorders.

So it was only a matter of time before researchers targeted baldness with their favoured tool of the moment, the genome-wide association study. This week two separate groups published the results of genome-wide scans for baldness genes in the prestigious journal Nature Genetics. In both cases, their findings strongly support a known genetic association with the androgen receptor gene on the X chromosome, and also highlight a new region on chromosome 20 with a smaller (but still significant) effect on baldness risk.

I'm a little late to the party on this story - see posts by Razib, Hsien, Grace and Erin from 23andMe - but there are some interesting facets to this story that warrant a little extra attention.

The candidate gene approach got there first
One of the most striking findings of these studies is the massive signal of association around the androgen receptor (AR) gene, which is located on the X chromosome - it's a clear outlier on the signal plot shown below (each dot is a single genetic variant, with each chromosome labelled in alternating colours, and the height on the Y axis is the strength of the association with baldness). In contrast, the novel association on chromosome 20 is fairly modest.

The unusual thing about this signal is that the association between the AR gene and male pattern baldness has been known since 2001, when it was reported by Justine Ellis from the University of Melbourne (as an aside, The Spittoon erroneously suggests that the first report was in 2005). This is unusual because the pre-genomic era of "candidate gene" association studies, in which only a few selected genes at a time were screened for associations with a disease or trait, was notoriously bad at finding the most important genes. In most cases, the top hits in recent genome-wide association studies are in genes that would never have been identified by the candidate gene approach (e.g. FTO and obesity, the 5p13.1 gene desert in Crohn's disease). Baldness thus represents a rare success story for the candidate gene approach.

The androgen receptor was originally selected for analysis by Ellis on the basis of biological plausibility - it's well-known that baldness is associated with the testosterone pathway, and the androgen receptor is the molecule that signals testosterone's presence to cells all over the body. This means the chromosome X result from these genome-wide studies comes with an immediate biological explanation; unfortunately, the same cannot be said for the chromosome 20 signal.

It's unclear which gene is the culprit on chromosome 20
Both papers highlight the same stretch of DNA on chromosome 20 as the second strongest signal of association (although the two studies highlight different markers as the top hit, both top markers fall within a region of high linkage disequilibrium - which is just a fancy way of saying that they're almost always inherited together, so they're almost certainly both tagging the same underlying causal variant). However, unlike the chromosome X story, there's no obvious candidate gene lurking in this region - the nearest gene (PAX1) is almost 200,000 base pairs away, and has no known role in the testosterone pathway.

One of the studies provides experimental data showing that PAX1 is expressed in the scalp - but it's also expressed (and at much higher levels) in muscle and thymus, so this isn't compelling evidence of a causal role in hair loss. It will take some serious experimental work to unravel the real genetic culprit in this region.

HairDX may be testing the right gene, but the wrong marker
The genetic testing company HairDX offers testing of androgen receptor variants to predict the risk of premature baldness in both males and females. For their male test they examine the marker rs6152, which is located close to the beginning of the androgen receptor gene, but more than 250,000 bases away from the best hit in either of the two genome-wide studies. This suggests that the predictions made by the HairDX test could well be substantially improved by shifting to different markers (and, of course, incorporating markers from the chromosome 20 region).

I'll be discussing the current HairDX tests in more detail over the next few weeks. For the moment, let's just say that they're not something I'll be rushing out to purchase any time soon.

Genes --> baldness cure?
There are very few things on the internet more depressing than a Google search for "baldness cure" - in a single click you are transported into a sordid world of shame, desperation and rampant greed; ad-riddled forums for lonely men looking for a way to restore their once-luxurious manes, and an army of clinicians and researchers willing to sacrifice their credibility for a share of the resulting cash. As in any medical arena fuelled by desperation, that cash is plentiful (one of the Nature Genetics studies notes that annual sales of a single anti-baldness treatment recently surpassed $405 million).

To pharmaceutical companies baldness must be almost as good a target as obesity: it's extremely common, afflicts the wealthy as well as the poor, and its sufferers will readily fork over cash for a potential cure. But to find effective treatments, big pharma needs to have a clear idea of how baldness occurs at the molecular level - and that, in theory, is where genetic studies can help. By finding new genetic variations that influence baldness risk, genome scans might highlight unexpected pathways that ultimately lead to new drug targets.

However, these two new studies haven't provided much to help feed the wallets of pharmaceutical executives: the androgen pathway has long been known to influence baldness risk, and is already targeted by a number of existing baldness drugs (e.g. finasteride, a.k.a. Propecia), while the chromosome 20 region doesn't yield any clear-cut targets or clues regarding baldness pathways.

Judging from the chromosome scan shown above there are no more low-hanging genes on the baldness tree; it's going to be extremely difficult (i.e. requiring much larger sample sizes and/or different research approaches, such as large-scale sequencing) to drill down to find the next tier of small-effect risk genes. However, that's precisely what will be required for effective molecular dissection of the genetic basis of baldness.

Perhaps if a fraction of the money from online sales of dubious baldness therapies went into actual hair loss research we'd have answers more quickly - but I won't be holding my breath.

References
J Brent Richards, Xin Yuan, Frank Geller, Dawn Waterworth, Veronique Bataille, Daniel Glass, Kijoung Song, Gerard Waeber, Peter Vollenweider, Katja K H Aben, Lambertus A Kiemeney, Bragi Walters, Nicole Soranzo, Unnur Thorsteinsdottir, Augustine Kong, Thorunn Rafnar, Panos Deloukas, Patrick Sulem, Hreinn Stefansson, Kari Stefansson, Tim D Spector, Vincent Mooser (2008). Male-pattern baldness susceptibility locus at 20p11 Nature Genetics DOI: 10.1038/ng.255

Axel M Hillmer, Felix F Brockschmidt, Sandra Hanneken, Sibylle Eigelshoven, Michael Steffens, Antonia Flaquer, Stefan Herms, Tim Becker, Anne-Katrin Kortüm, Dale R Nyholt, Zhen Zhen Zhao, Grant W Montgomery, Nicholas G Martin, Thomas W Mühleisen, Margrieta A Alblas, Susanne Moebus, Karl-Heinz Jöckel, Martina Bröcker-Preuss, Raimund Erbel, Roman Reinartz, Regina C Betz, Sven Cichon, Peter Propping, Max P Baur, Thomas F Wienker, Roland Kruse, Markus M Nöthen (2008). Susceptibility variants for male-pattern baldness on chromosome 20p11 Nature Genetics DOI: 10.1038/ng.228

Subscribe to Genetic Future.

Genome Technology interview [Mailund on the Internet]

Posted: 15 Oct 2008 06:22 AM CDT

Last week I gave an interview to Genome Technology about R/parallel (that I’ve blogged about a few days earlier).

I don’t know if the article about R/parallel is out yet — I haven’t seen it — but below you can see the questions I got and the answers I gave…

The interview

When it comes to parallelizing software such as R, what are the inherent challenges beyond the average bench biologists?

There is a lot of parallelism going on in modern hardware, most of which you never worry about. The compilers and CPUs take care of it for you. This has to do with how data and program instructions float around on the silicon, and usually not something you want to know about unless you develop hardware or compilers. When you do notice it, what you usually notice is how data from main RAM is vastly slower to access than data in the cache. If you are careful with your data structures you can avoid this problem in most cases by just following a few rules of thumb (and all that means is that you should try to keep data that is accessed together, close to each other in memory address).

In languages like C++ you will also notice how virtual functions are much slower to call than non-virtual functions. This is also a consequence of parallelism. When you are going to call a function — which means that instructions now needs to be read from a different place in memory — the CPU can see where you are going and start fetching the instructions. With virtual functions, you don’t know where you are going until you’ve computed exactly where you want to execute (this is called virtual table lookup, but it just means that you are computing a point in the instructions to jump to, rather than have that point explicit in the instructions). In that case, the CPU cannot start fetching the instructions, so there will be a delay in the function call.

But so far I am not really answering the question; I’m just warming up to that.

I just wanted to make it clear that parallel processes is not new to computing and doesn’t need to be something we have to worry about. Except that we do have to worry about it right now, until systems developers can hide it away under layers of abstraction — just like R/parallel is trying to do.

Why do we have to worry about it now?

To make CPUs faster it is no longer sufficient to just add more and more transistors, closer and closer together, on a chip. There is a physical limit to how much longer we can do that.

If we cannot increase the speed of the individual processors, then at least we can do more in parallel, but now the software needs to help the hardware. The parallelisation needs to be dealt with in the
software layer.

On your desktop computer, you are seeing this in several places.

The CPU has instructions, so called SIMD, for performing the same operation on multiple data in parallel.
The CPU has multiple cores (it is hard to find one now with less than two and soon that will be four)
Your graphics card is a very powerful computer in itself (the processor there is called a GPU, and is a bit harder to program than a CPU, but that will change over time).

I’m not going to comment on 3. any further. Using GPUs for scientific computing is very hot these days, but is probably best left to the computer scientists for now.

The parallelisation in 1. is relatively easy to work with as a programmer. There are some hardware considerations to deal with, like how data should be formatted in memory, but it is not conceptually hard to deal with.

Even better, it can be automatically applied to a large degree in languages like R. In C/C++ for example, you don’t have high-level constructs to perform an operation on all elements in a vector (which is the kind of operations ideal for this kind of parallelisation). In R you do, so whereas it can be hard for a C/C++ compiler to automatically use this kind of parallelisation, in R it would be
almost trivial to apply it. This is what I wrote about in my blog post.

The parallelisation in 2. is more difficult to deal with.

Even if computations have been running on highly parallelised hardware for a while, conceptually the programs have been “single threaded”. At every point of time in the computation, conceptually a single instruction is being executed, and the state of the program is deterministically determined by the input and the instructions executed so far.

With multiple cores, the program can run several threads in parallel, and the program is now conceptually parallelised.

Multiple threads are not new to computer science. We have had them as a concept for decades. Even when the actual processor could only execute a single instruction at a time, it would sometimes be beneficial to think of the program as running in parallel. And for various reasons that i wont go into, you could also get significant speed improvements, but for different reasons than we typically think about in scientific computations.

Now, with multiple cores, you get a significant runtime benefit for your computations if you use multiple threads. If you have four cores on your CPU, then you can, under ideal conditions, run your programs four times faster than you could on a single core.

Dealing with this kind of parallelism is very hard, though (and now I am finally getting to the actual question).

The problem, as I see it, is that our brains just find it hard to think about parallelism. I know, it sounds weird, ’cause in the real world everything happens concurrently and we deal with it.

On the other hand, people who wouldn’t recognise a differential equation if it sat on their lap, can still catch a ball, so clearly there are some things we find easy to deal with but hard to reason about, and concurrency is one of these things.

The problem is, that when the program solves several tasks at the same time, we need a way of coordinate these tasks. We need to have the input available before we start processing, and deliver the output when it is needed.

This is something we know from our own life, but on the computer there is an added complexity: the program is not smart enough to know that it doesn’t have the data it needs, and will happily process garbage input if the real input isn’t available. Likewise, it will not think twice about overwriting important current data with bogus outdated data, if it gets behind the main program.

We have to explicitly program how the communication and synchronisation rules should be, and our brains are pretty bad at reasoning about this.

We forget about important situations where synchronisation is necessary and we sometimes add rules that lets each thread wait for some other thread to complete, so no thread can actually continue its work and the whole system is deadlocked.

There are rules you can follow to avoid these problems, but even very experienced people find it hard and make mistakes.

On a side note, a large field in theoretical computer science — concurrency theory — works on ways to deal with this problem: rules for constructing programs so the errors are avoiding, or methods for
analysing models of systems to prove that they have the intended behaviour. (My PhD work was on the later).

Unfortunately, most of theoretical work is far from being usable in real programming situations. So in practise we are relying on rules of thumbs and experience, and neither works that well.

Of course, all the problems with multi-threaded programs just get much, much harder on multi-computer systems (clusters or networks of services). There the synchronisation is even worse, and on top of that individual computers can crash and the system must be able to
deal with that.

Leave that to the computer scientists for now…

If you are not experienced in parallel programming, and your interests are in, say, biology and not computer science, you will probably find dealing with concurrency issues a pain in the neck. You just want the computer to crunch your numbers with the equations you’ve given it, and really you shouldn’t have to worry about how it does it.

Programming languages do not yet hide this kind of parallelism. We have high-level constructions to describe our mathematics (to varying degrees), but when it comes to parallel execution, we are just not there yet.

R/parallel is a small step in that direction. It gives you a language construction for executing what would otherwise be a sequential loop, in parallel. It then deals with distributing the tasks to threads and synchronising them, so you, as the programmer, won’t have to worry about it.

This idea is not new. There has been programming languages with such constructions for ages. It just hasn’t made it into the main-stream languages.

How should one go about deciding what parts of code to make parallel and what parts to leave alone? As you say, the BMC paper leaves it up the researcher to decide, but do you think this might be beyond the average user and is better left to parallel programming experts?

To be absolutely honest, even the experts probably wouldn’t know where to parallelise just by looking at a program.

Just like it is notoriously hard to reason about where the bottlenecks are in a sequential programs, and how to get around them, it is hard to reason about where parallelisation will give you a boost.

The good news is that it isn’t that hard to figure it out. All you have to do is profile your program, and then you know the hotspots you need to focus on.

Trial and error will show you where you get a performance improvement by parallelising (and with R/parallel there isn’t much work involved in doing this testing, compared to programming it up manually).

The experts have done this many times before and might have a better intuition about where to try parallelisation first, but really I would say that knowing the problem the program is solving — and thereby knowing where the real number crunching is going on — is just as helpful.

This is not where we want to invoke computer scientists. If they just give us to tools to experiment without much of an overhead, then we’re fine.

You mention there are some limitations with parallelizing R through an add-on package such as the one described in the paper vs. what you describe on your post, can you explain the difference in layman’s terms?

An add-on package can give you a new abstraction, like in R/parallel, that lets you tell R to “do this stuff here in parallel rather than in sequence”.

It is a lot easier to use such an abstraction than to program the parallelisation yourself, but it still leaves it up to you to worry about parallelisation.

Ideally, you just want to tell the computer which equations to solve, and not have to worry about how it does it.

Although you might not think this when you are programming, you are a lot closer to this ideal than you could be. You might think that you have to tell the computer exactly how to solve a problem, because you have to worry about loops and exceptions and whatnot, but you are actually very far removed from the actual computations.

With high-level languages, you tell the program to perform some operation (say exponentiating) on all elements in a list. The operations that the computer actually have to do are at a much lower
level.

To take the same example in the context of parallelising a program, you should be able to tell the computer to exponentiate all elements in a list and not have to worry about whether it needs to do
it in parallel or in sequence.

You want to be able to just write

> result <- exp(v)

and not worry about whether you want

> result <- exp(v)

> result <- run.parallel(exp,v)

Of course, if we need all the trial and error to figure out when parallelisation is worthwhile, can we trust the computer to make the decision for us?

If you just compile the program and have to make the decision without knowing about performance hotspots, then no.

But here’s the thing: when we are running the program, we can profile it, and then we know the hotspots, so we can at that point replace the sequential execution with parallel execution where it will actually improve performance.

Modern virtual machines already do something similar for code generation. You might interpret byte-code in your virtual machine, but the performance-critical parts will quickly be recognised and compiled into machine code, so those parts are executed much faster than they would otherwise be.

The virtual machines does much more than that, though: they recognise the common cases in your methods and optimise for them. Remember the virtual functions I mentioned above? Since there is a high penalty with those, a virtual machine can remove them by generating code where the function is not looked up in a table but directly in the code. When the virtual machine recognises that it is running the common-case scenario, it runs that generated code (with an efficient function call), and otherwise the generic code (which is slower, but won’t be called that often).

To put it simply: when we are running our programs we know where the bottlenecks are, and can optimise them. This requires that the virtual machine does the profiling and optimisation, though, and is not something you can just add onto it with a library.

Do you think that some of the currently available commercial parallel tool kits, such as he ones offered by Interactive Supercomputing and Revolution Computing, which both offer ways to parallelize R, offer something more powerful or robust than what’s being described in R/parallel paper?

I am not familiar with those tool kits, so I don’t know. What is offered in R/parallel is pretty basic. It is the first small step in the right direction. If you have money and time to throw at it, it shouldn’t be much of a problem to improve it a lot, so I wouldn’t be surprised if commercial packages are better.

They won’t necessarily stay better, though. If an Open Source project for parallelising R really takes off (like R itself has done), then there is a lot of motivated programmers working on it, and that is
hard to keep up with as a company.

New Web Tools to Promote Your Site [Think Gene]

Posted: 15 Oct 2008 05:26 AM CDT

To go with our new layout, new web tools for news.thinkgene.com!

First: Did you forget your password or account login for news.thinkgene.com? Reset your password.

Help

Don’t know any HTML? Use an unsupported blogging platform? Do computers scare you? Email us and we’ll help you out, no problem.

Bookmarklet

Submit whatever webpage you are viewing by adding this special bookmark to your browser. Add, it, then simply click it to submit.

Submit Buttons

Embed these buttons directly into your website. When clicked, the page will be submitted by your users. Copy-and-Paste the HTML code into your website or blogging platform.

Example:

Add as Blogger Gadget
in Layout, click “Add a Gadget” in the sidebar, choose “HTML/JavaScript”, copy-paste the code below into the “content” area, press “SAVE”.
<a expr:href='"http://news.thinkgene.com/links/new?u=" + data:post.url + "&t=" + data:post.title' title="post to news.thinkgene.com"><img src="http://www.thinkgene.com/images/thinkgene-button-horizontal.gif" alt="post to news.thinkgene.com" style="border: none;" /></a>

Add as Wordpress Template
paste this code into your theme, probably in “post.php”
<a href="http://news.thinkgene.com/links/new?u=<?php the_permalink() ?>&title=<?php the_title() ?>” title=”post to news.thinkgene.com”><img src="http://www.thinkgene.com/images/thinkgene-button-horizontal.gif" alt="post to news.thinkgene.com" style="border: none;" /></a></a>

Add as raw HTML
submit URL of TITLE (you must replace these values in your code!)
<a href="http://news.thinkgene.com/links/new?u=URL&t=TITLE" title="post to news.thinkgene.com"><img src="http://www.thinkgene.com/images/thinkgene-button-horizontal.gif" alt="post to news.thinkgene.com" style="border: none;" /></a>

Other Images
You can replace the image URL in “src=” with these to display another image

http://www.thinkgene.com/images/thinkgene-button-large.gif

http://www.thinkgene.com/images/thinkgene-arrow.gif

Caltech biologists spy on the secret inner life of a cell [Think Gene]

Posted: 14 Oct 2008 10:58 PM CDT

Josh: This isn’t a field I’m that familiar with, but is it known how the baby’s immune system uses the antibodies provided by the mother’s milk? How does it stimulate the baby’s cells to produce the same anti-bodies? The mention of the clathrin coat not being completely shed is particularly interesting. Either their observations were flawed, other researchers never noticed that the coat was not completely shed, or this is a special case where the coat is just not completely shed.

The transportation of antibodies from a mother to her newborn child is vital for the development of that child’s nascent immune system. Those antibodies, donated by transfer across the placenta before birth or via breast milk after birth, help shape a baby’s response to foreign pathogens and may influence the later occurrence of autoimmune diseases. Images from biologists at the California Institute of Technology (Caltech) have revealed for the first time the complicated process by which these antibodies are shuttled from mother’s milk, through her baby’s gut, and into the bloodstream, and offer new insight into the mammalian immune system.

Newborns pick up the antibodies with the aid of a protein called the neonatal Fc receptor (FcRn), located in the plasma membrane of intestinal cells. FcRn snatches a maternal antibody molecule as it passes through a newborn’s gut; the receptor and antibody are enclosed within a sac, called a vesicle, which pinches off from the membrane. The vesicle is then transported to the other side of the cell, and its contents–the helpful antibody–are deposited into the baby’s bloodstream.

Pamela Bjorkman, Max Delbrück Professor of Biology at Caltech and an investigator with the Howard Hughes Medical Institute, and her colleagues were able to watch this process in action using gold-labeled antibodies (which made FcRn visible when it picked up an antibody) and a technique called electron tomography. Electron tomography is an offshoot of electron microscopy, a now-common laboratory technique in which a beam of electrons is used to create images of microscopic objects. In electron tomography, multiple images are snapped while a sample is tilted at various angles relative to the electron beam. Those images can then be combined to produce a three-dimensional picture, just as cross-sectional X-ray images are collated in a computerized tomography (CT) scan.

“You can get an idea of movement in a series of static images by taking them at different time points,” says Bjorkman, whose laboratory studies how the immune system recognizes its targets, work that is offering insight into the processes by which viruses like HIV and human cytomegalovirus invade cells and cause disease.

The electron tomography images revealed that the FcRn/antibody complexes were collected within cells inside large vesicles, called “multivesicular bodies,” that contain other small vesicles. The vesicles previously were believed to be responsible only for the disposal of cellular refuse and were not thought to be involved in the transport of vital proteins.

The images offered more surprises. Many vesicles, including multivesicular bodies and other more tubular vesicles, looped around each other into an unexpected “tangled mess,” often forming long tubes that then broke off into the small vesicles that carry antibodies through the cell. When those vesicles arrived at the blood-vessel side of the cell, they fused with the cell membrane and delivered the antibody cargo. The vesicles also appeared to include a coat made from a molecule called clathrin, which helps form the outer shell of the vesicles. Researchers previously believed that a vesicle’s clathrin cage was completely shed before the vesicle fused with the cell membrane. The new results suggest that only a small section of that coating is sloughed off, which may allow the vesicle to more quickly drop its load and move on for another.

“We are now studying the same receptor in different types of cells in order to see if our findings can be generalized, and are complementing these studies with fluorescent imaging in live cells,” Bjorkman says. “The process of receptor-mediated transport is fundamental to many biological processes, including detection of developmental decisions made in response to the binding of hormones and other proteins, uptake of drugs, signaling in the immune and nervous systems, and more. So understanding how molecules are taken up by and transported within cells is critical for many areas of basic and applied biomedical research,” she adds.

Source: California Institute of Technology

FcRn-mediated antibody transport across epithelial cells revealed by electron tomography. Wanzhong He, Mark S. Ladinsky, Kathryn E. Huey-Tubman, Grant J. Jensen, J. Richard McIntosh, & Pamela J. Björkman. Nature 455, 542-546 (25 September 2008) | doi:10.1038/nature07255

Panda genome arrives [Omics! Omics!]

Posted: 14 Oct 2008 09:54 PM CDT

China announced over the weekend the completion of the giant panda genome.

For the benefit of presidential candidates who can't conceive of the value of scientific research on bears I'll suggest a few questions worth exploring in the panda genome (beyond the obvious direction of weapons development)

First, the panda genome is one more mammalian genome to add to the zoo. For comparative purposes you can never have too many. Since other carnivore genomes are done (first & foremost the dog, but cat as well), this is an important step towards understanding genome evolution within this important group. It is the first bear genome, but with the price of sequencing falling it is likely that the other bears will not be in the extremely distant future (with the possible exception of Ursa theodoris).

Second, completion of a genome gives a rich resource of potential genetic variants. In the case of an endangered wildlife species such as panda, these will be useful for developing denser genetic maps which can be used to better understand the wild population structure and the gene flow within that structure. Again, if you are running for president please read this carefully: this has nothing to do with paternity suits. If you want to manage wildlife intelligently and make intelligent decisions about the state of a species, you want to know this information.

Third, pandas have many quirks. That bambooitarian diet for starters. Since they once were carnivores, it is likely that their digestive systems haven't fully adapted to the bamboo lifestyle. Comparisons with other carnivores and with herbivores may reveal digestive tract genes at various steps in the route from meat-eater to plant-eater.

Fourth, as the press release points out, there are many questions critical to preserving the species which (with a lot of luck) the genome sequence may give clues to. First among these: why is panda fertility so low? U.S. zoos have been doing amazingly well in this century, but that's only 4 breeding pairs. The Chinese zoos have many more pandas & many more babies, but it's going to take a lot more to save the species.

Accelerators: It’s all about the programming [business|bytes|genes|molecules]

Posted: 14 Oct 2008 09:21 PM CDT

Image via WikipediaMy favorite source of hardware acceleration commentary returns to that subject.

Joe asks

My question now is, given the intent of Intel in this market, will Larrabee be able to get traction in the graphics world? And therefore, effectively displace nVidia (and to a lesser extent, AMD) as the accelerator king?

IMO, if Intel nails the programmability issue, they can make a huge dent into nVidia’s leadership. It’s all about the programming at this stage