Data Storage in DNA
October 16, 2017 – UW Daily
UW scientists coded Miles Davis into DNA
In an unassuming office in the Computer Science and Engineering Building, Ph.D. student Lee Organick holds a small test tube in her hand. Maybe an inch and a half long, the tube doesn’t look particularly special.
“Theoretically, that test tube can hold many petabytes of data, and could probably store most of Facebook’s photo archive if you encoded it into DNA,” said Luis Ceze, a UW associate professor of computer science and engineering.
The data in warehouses full of computers — entire data centers — could be encoded into just a few sugar cubes and placed inside that test tube.
As the information technology industry grows, it is producing information at an incredible speed that is outpacing mainstream storage. There is an enormous gap between information produced and total information capacity, and the industry is approaching the maximum storage limit.
“There is a clear trend towards trying to store information in as few atoms as possible,” Ceze said.
This concept is known as molecular storage, which is a term for data storage technologies that use molecular species as the storage element rather than items such as circuits, magnetics, or inorganic materials.
Ceze works in the Molecular Information Systems Lab (MISL), a project co-sponsored by Microsoft that explores the “intersection of molecular-level manipulation used in-silico and wet lab experiments.” The project began three years ago and brings together students and faculty with backgrounds in programming, biology, and chemistry, among others.
The team’s latest effort to push the boundaries of DNA-based storage was part of a collaboration with Twist Bioscience and the Montreux Jazz Festival, which has one of the largest music archives in the world, posing an interesting data storage problem. MISL was able to successfully encode two archival-quality audio recordings in DNA, amounting to nearly 140 megabytes of data.
But why DNA?
Exploration of DNA as a storage medium began back in the ‘60s when DNA was first discovered, but there has been a revival in the past few years.
“DNA is incredibly dense, about a million times denser than the densest information storage medium available today,” Ceze said. “It is compact and durable, making it the ideal storage medium for what is known as archival storage.
Storage systems are built in hierarchies, with the fastest, most expensive storage being used for the things users access most frequently. At the bottom of the hierarchy lies code storage or archival storage, where data that isn’t accessed as frequently is stored.”
The lab at the UW is focused specifically on information technology uses of DNA, and it started small scale experiments about two years ago working on random access storage.
The theoretical limit was calculated as one exabyte, or one million terabytes, per cubic millimeter, which is roughly equivalent to a few grains of sand and far denser than any electronic storage that exists today. For reference, the information stored in a cubic millimeter of DNA would require a stack of one terabyte-sized hard drives about 6 miles high.
In the case of the jazz festival, the audio files were converted from binary code — 0s and 1s — to the four nucleotide bases that make up a strand of DNA: A, C, G, and T. Theoretically, the entire six petabyte collection would result in DNA smaller than one grain of rice.
Encoding the audio files into DNA took about a week and an entire team ranging from computer scientists to biochemists.
“The pipeline is we get the bits, we encode it into DNA, create the list of sequences that need to be printed, and then we send this list back to Twist Bioscience,” Ceze said. “Then after a little while, when they are done with their processes, they send a FedEx envelope back to the lab.”
This is where the wet lab gets involved to amplify the DNA and add certain primers; after a fews days, the DNA is run through the sequencer and converted back to digital data.
“Theoretically this entire process could be condensed down to seconds,” Ceze said. “But this project was interesting because it demonstrated a real use of the archive. It shows that [this type of storage] is becoming a reality.”
In order for the use of DNA archival storage to become widespread, the cost would have to substantially decrease. The way scientists think about DNA would also have to change. In the life science industry, the sequences all have to be perfect and there are a lot more redundancies built into the coding. However, for data storage, the DNA doesn’t have to be perfect. Even if there are errors in the storage the DNA can still be recovered.
Increased automation is also key. Currently, the protocols used to prepare the sequencing are extremely time consuming, being unable to automate the entire process would make it more widely applicable.
Right now, the team holds the world record for most data stored in DNA at 400 megabytes.
“This is potentially feasible in a decade,” Ceze said. “It would change everything.”