A Model Approach: How scientists wrangle Big Data

By Koren Wetmore |

IN THIS INTERNET AGE, most of us can relate to the concept of information overload. But that’s nothing compared to the data volume that bombards research scientists.

Faced with 60 million DNA sequences or a record containing 750,000 data points, the last thing a biologist wants to do is perform an analysis using spreadsheets. Instead, today’s researcher reaches for computational and mathematical tools, producing models and scripts to process information at lightning speed.

Such tools not only improve the pace at which research moves but also broaden the scope of questions scientists can ask.

“If we’re looking for new photoproteins or fluorescent proteins involved in how a jellyfish makes light, we can find those easily now by sequencing an entire transcriptome of an organism,” says marine biologist Steve Haddock ’88. “Instead of saying, ‘We think it’s going to look like this, and we’re going to try and fish that gene out,’ we say, ‘What are all the genes this animal is expressing right now?’ It really allows you to think more about your science instead of your analysis.”

Haddock often applies computational skills in his work at the Monterey Bay Aquarium Research Institute, where he studies how jellyfish and other deep-sea gelatinous creatures interact with light. His research involves describing species, building phylogenetic trees and working with the DNA sequences and chemicals the organisms use to produce light. “There’s obviously a lot of computation,” he says. “So, we’ll have a program that will do each of the steps: cleaning up the sequences, pulling out those that look like they should be used to assemble, assembling the sequences and then doing diagnostics on the assembly.”

By translating their questions into mathematical language and then encoding it into a computer program, researchers can build models to predict outcomes or identify potential causes of observed phenomena.

The Centers for Disease Control and Prevention, for example, uses mathematical models to determine how a disease might spread in a given population and how to potentially stop or minimize its spread. Cancer researchers use models to explore potential ways to treat or prevent various cancers.

In fact, according to a study published last August in the Annals of Internal Medicine, a risk-prediction model developed at the University of Liverpool’s Cancer Research Center proved more accurate at determining a person’s lung cancer risk than assessments based upon family history or years of smoking…


Published in HMC Bulletin