According to a 2010 article in Bio-IT World, the term $1,000 Genome has been around since 2001. The University of Wisconsin’s David Schwartz claims to have coined the term at an NHGRI retreat during a breakout session. Whatever its origin, the $1,000 Genome soon became the target for the rapid development of next-gen sequencing (NGS).
With Illumina, the dominant player in the NGS market, claiming this year that they’ve reached that target with their HiSeq X Ten system, it’s fair to stop and ask just what has been achieved. What do you get for that $1,000? And furthermore, where does NGS go from here?
Beginning next week, we're launching a new series, The Rise of Long Read Sequencing.
I first heard “long read” sequencing differentiated from “short read” in an interview with Mike Hunkapiller, CEO of Pacific Biosciences last year. I had asked him the obvious question about how he expects to compete with Illumina, and he responded saying that “short read technologies” had serious draw backs.
“Wait a minute,” I remember thinking at the time, “did Mike just dismiss Illumina’s technology out right? And what are these long reads he’s talking about.”
There’s no doubt that Illumina is a major success story. In the current edition of Forbes, Matthew Herper crowns Illumina with a glowing article, naming the rapid decrease in the price of sequencing after their CEO, “Flatley’s Law.” This is no small praise for Illumina’s Jay Flatley, who has led the company from startup who used to offer oligos for $0.15/base to be the dominant player in the sequencing space, and now strongly poised as an upcoming contender in the clinical diagnostics industry.
But this is the story you’ll hear everywhere.
What is less known is that of the turnabout of Pacific Biosciences and the rise of long read sequencing. PacBio had a much touted beginning, raising north of $600 million. But they disappointed the industry by not delivering on some early hype that they could compete with Illumina on throughput by sequencing a human genome in fifteen minutes. In fact, PacBio not only didn’t improve on Illumina’s high throughput, their technology had the unattractive high error rate of 15%. And to top that, their machine was more expensive.
However, for over a year now, we’ve been following an emerging trend among researchers toward the use of PacBio’s long reads to do not only de novo sequencing, but to probe areas of the human genome that have defied short read technologies. From better characterization of RNA isoforms to raising the quality of the human reference genome, more and more papers are published touting the new possibilities of PacBio's long reads.
There’s also now some data coming from Oxford Nanopore’s new minION that is exciting the first round of users. This is long read data. In addition, I recently toured Genia Technologies’ facility in Mountain View and was shown their new sequencer now in alpha testing. Genia’s CEO, Stefan Roever, says their new chip will read over a million long reads per run.
Once you have long reads and high throughput, is there any use for short read technology? I asked Stefan. “Not really,” he confirmed.
To chronicle the rise of long reads, we went to PacBio and asked them if they’d introduce us to some of their users and sponsor a series on the topic. They did.
Take the story of Gene Myers, for instance. Gene helped develop the BLAST algorithm for sequence alignment back in the 90’s, working on the Human Genome Project at Celera. Then he got out of sequencing to pursue “more interesting science.” He thought that the future of sequencing was pretty straight forward and not that provocative for a scientist.
“Everything basically went short because that’s where you could get the reduction in cost,” says Myers in our upcoming interview. “Today everyone does it routinely but I don’t think they should be. . . . They’re using 100 bp reads, and the assemblies are crappy,” Gene says.
Gene is now back into sequencing, working at the Max Planck Institute in Germany. And he’s very excited about long reads. He says that for the first time ever it is theoretically possible to get to 100% accuracy with PacBio’s technology.
Wait a minute. What about PacBio’s terrible accuracy rate?
It turns out that that even though the error rate of the PacBio SMRT system was quite high, the errors were random. So if you stacked the sequences deep enough, you could greatly improve the accuracy.
We ask Gene how is it that the industry has bought in for so long to the short read technology?
“I think it’s because they weren’t offered anything else. It’s what you got,” says Myers.
We start off the series with Mike Snyder from Stanford who explains how PacBio’s long read technology has opened up his research into the transcriptome. Often there are various RNA isoforms that are hard to analyze with Illumina’s short read technology, Mike says. He’s recently published a couple papers showing that with PacBio’s long reads he is able to completely cover the full-length RNA molecules, thereby characterizing areas that previously have not been annotated.
After that we’ll be talking with the former CSO of PacBio, Eric Schadt, now at the Icahn Institute at Mt. Sinai in New York. In his current job he’s working to bring sequencing to the clinic and says that the PacBio long reads are very important for getting a better picture of the genome. From Eric's interview:
“In order to drive the throughput super high, we’ve been ignoring a lot of the structural features in the genome that are as important as some of the single nucleotide hits, whether its long tandem repeats that vary, or bigger structural variations, or focal variants that are important in cancer--those things are difficult to characterize unambiguously with the current short read technology. [Short reads] were attuned to certain problems and had certain advantages that enabled this big advance, but they are absolutely not hitting the entire problem like we need hit.”
In addition to improving our understanding of the transcriptome and structural variation of the genome, the long read technology is helping us nail down that troublesome area of the genome known as the HLA region. This is a region that holds much promise for biomedical research because not only has it defied easy characterization, it just happens to be connected to many of the common diseases we have.
Dan Geraghty has been sequencing the HLA region for many years. Some of his work was used in the original Human Genome Project. Dan says that long read sequencing is a game changer.
“Long reads is the NGS story of the year,” he told me in our pre-interview chat.
For now this long read story is pretty much owned by PacBio. But all of these researchers say they are platform agnostic and are happy to see new technologies on the horizon that are promising long reads. There’s Oxford Nanopore and Genia and others, including Nabsys who we’ve profiled here as well. Illumina offers their Moleculo technology which assembles long reads from shorter reads, but not many have seen the datasets or other details about this technology.
So what does this mean for the future of NGS? Do long reads open up vast new territories in genomics that have yet to be discovered or are they just a nice bonus? We’ll be pursuing these questions with other guests as well, including upcoming chats with Shawn Baker, CSO of the sequencing marketplace, Allseq, and with George Church of Harvard.