SciTech

Kingsford receives grant for genetic data searching research

Carl Kingsford, associate professor at Carnegie Mellon’s Lane Center for Computational Biology, will use his Moore Investigators in Data-Driven Discovery Award to research searching algorithms for genetic data. (credit: Carl Kingsford) Carl Kingsford, associate professor at Carnegie Mellon’s Lane Center for Computational Biology, will use his Moore Investigators in Data-Driven Discovery Award to research searching algorithms for genetic data. (credit: Carl Kingsford)

Millions of people use Google everyday to search and navigate through the Internet. Using smarter searching algorithms, finding a match based on a keyword or phrase has become faster and more efficient, even as the size of the Internet continues to grow. Similarly, with the growing collection of genetic data, computational biologists are striving to develop their own algorithms to overcome the challenge of efficiently using the data for advancements in biology and medicine.

In support of this research, the Gordon and Betty Moore Foundation has awarded Carl Kingsford, an associate professor at Carnegie Mellon’s Lane Center for Computational Biology, with one of its fourteen Moore Investigators in Data-Driven Discovery Awards. This five-year, $1.5 million grant will support Kingsford and his research team in their efforts to develop advanced algorithms for genetic data searching.

The past few decades have seen a burst in the available genomic data with improvements in biological techniques and increased interest in understanding biology from a genomic perspective. A large amount of this interest can be seen in the National Center for Biotechnology Information (NCBI). Kingsford explained that the Moore award was given with the mission of “finding new uses for existing large datasets to use the data in ways that those who collected the data may not have expected.”

Genomic data exists primarily as sequences of four different letters that represent the four different base pairs in a DNA or RNA nucleic acid molecule. It is now understood that the specific sequence of letters is key to biological function. These searching algorithms have to be able to take a sequence of letters of varying size, and search through a massive database for another similar sequence.

Kingsford says that many ask, “Why not just ‘Google’ it?” However, he explained that searching for a string of words is completely different from searching through biological data. First off, there are many types of data, from DNA sequences to high throughput RNA sequence data, and each must utilize different techniques to match sequences. Furthermore, exact matches are rare, so sequences are compared based on their similarity to see if they can be related to having the same function in an organism.

For biologists, this information is extremely important. For example, experiments that analyze a biological function or a disease may identify that a sequence of DNA is highly expressed. In order to identify the function of this sequence and the biological role it plays, biologists need the ability to compare the sequence to those with known functions in the database.

Currently, Kingsford’s group is working to develop a unique method of building a database that will make it easier to search through gene expression data. Kingsford explained that each sequence is first broken into smaller fragments of size k called k-mers. The k-mers are then stored in structures called bloom filters which are essentially subsets of all the possible k-mers. These bloom filters are then used to construct a tree where the roots extending from any point of the tree will contain all the k-mers at that point. Thus, when searching for a match, instead of searching through all the sequences and comparing similarities, this approach quickly identifies where in the database the matching sequences exist by finding the number of matching k-mers.

Kingsford explained that improvements in searching and utilizing genomic data will help reduce the number of experiments biologists will have to conduct, and help speed up the rate of discovery.

Kingsford’s open-source project, Sailfish, was recently shown to drastically improve the time it takes to analyze gene expression data. The Data-Driven Discovery Award strongly supports those who have experience and success in this field, and open source projects. Kingsford and his team will continue to utilize the funding to develop smarter techniques, in order to aid advances in biological research.