Neural network used to analyze DNA

By using innovative computational methods to synthesize missing pixels from low-quality images or videos, computational biologists at Carnegie Mellon University have discovered missing information related to how DNA is organized in a cell.

“Filling in this missing information will make it possible to more readily study the 3D structure of chromosomes and, in particular, subcompartments that may play a crucial role in both disease formation and determining cell functions,” said Associate Professor of Computational Biology Jian Ma in a press release.

In a research paper published on Nov. 7 in the journal Nature Communications, Ma and Kyle Xiong, a Ph.D. student at Carnegie Mellon currently enrolled in a joint Ph.D. program for computational biology with the University of Pittsburgh, explained how they were able to use a unique method involving machine learning on nine different cell lines. Applying their method to these cell lines allowed them to analyze the differences in spatial organization between the subcompartments of these lines. Before this method was developed, researchers could only see subcompartments in one cell type of lymphoblastoid cells, a line known as GM12878. This line has been sequenced extensively, and quite expensively, by researchers who used Hi-C technology, which analyzes spatial interactivity in all genome regions.

“We now know a lot about the linear composition of DNA in chromosomes, but in the nuclei of human cells, DNA isn’t linear. Chromosomes in the cell nucleus are folded and packaged into 3D shapes. That 3D structure is critical to understanding the cellular functions in development and diseases,” remarked Xiong.

To Ma, Xiong, and other researchers researching this topic, subcompartments are particularly interesting due to the fact that they are reflections of the spatial segregation of regions of chromosomes that are highly interactive.

According to Ma, scientists would currently like to learn more about how subcompartments are juxtaposed and how this juxtaposition impacts cell function. Before Ma and Xiong’s discovery, however, researchers were only able to calculate subcompartments’ patterns if their Hi-C datasets were of extremely high coverage. In other words, they could only calculate these patterns if DNA has already been sequenced in an extremely detailed manner. This is only true of GM12878 and is untrue of other cell lines.

Ma and Xiong utilized a denoising autoencoder — a neural network — to aid in the filling in of gaps found in incomplete Hi-C datasets. This autoencoder is used in computer vision applications to compute what a missing pixel may be through learning what types of pixels are frequently bunched together and guessing what the missing pixel may be based on this information. Xiong decided to use this autoencoder in high-throughput genomics. He used the dataset for GM12878 to train the autoencoder, which eventually allowed it to recognize sequences of DNA pairs arising from differing chromosomes that could be interacting with other DNA pairs in the nucleus of the cell.

Ma and Xiong have named this computational method SNIPER. It has been successful in the identification of subcompartments in eight cell lines. The Hi-C datasets of these subcompartments were incomplete. For a control, SNIPER was used on the GM12878 data. According to Xiong, the researchers working on this method do not yet know if this tool is able to be used on every other cell type. Xiong and Ma are working on this method even further to allow it to be used under many cellular conditions and in other organisms.

“We need to understand how subcompartment patterns are involved in the basic functions of cells, as well as how mutations can affect these 3D structures. Thus far, in the few cell lines we’ve been able to study, we see that some subcompartments are consistent across cell types, while others vary,” remarked Ma. “Much remains to be learned.”