Producing research outputs that have computational novelty and contributions, as well as biological importance and impacts, is a key motivator for computer scientist Xin Gao. His Group at KAUST has experienced a recent explosion in their publications. Since January 1, 2018, they have produced 27 papers, including 11 published in the top three computational biology journals and seven presented at the top artificial intelligence and bioinformatics conferences.
Originally from China, Gao joined KAUST in 2010 after a stint with the University of Waterloo in Canada and a prestigious fellowship at Carnegie Mellon University in U.S. His group collaborates closely with experimental scientists to develop novel computational methods to solve key open problems in biology and medicine, he explains. “We work on building computational models, developing machine-learning techniques, and designing efficient and effective algorithms. Our focus ranges from analyzing protein amino acid sequences to determining their 3D structures to annotating their functions and understanding and controlling their behaviors in complex biological networks,” he says.
Gao describes one third of his lab’s research as methodology driven, where the group develops theories and designs algorithms and machine-learning techniques. The other two-thirds is driven by problems and data. One example of his methodology-driven research is work1 on improving non-negative matrix factorization (NMF), a dimension-reduction and data-representation tool formed of a group of algorithms that decompose a complex dataset expressed in the form of a matrix.
NMF is used to analyze samples where there are many features that might not all be important for the purpose of study. It breaks down the data to display patterns that can indicate importance. Gao’s team improved on NMF by developing max-min distance NMF (MMDNMF), which runs through a very large amount of data to be able to highlight the high-order features that describe a sample more efficiently.
To demonstrate their approach, Gao’s team applied the technique to human faces, using the images of 11 people with different expressions. Each image was treated as a sample with 1,024 features. After training MMDNMF to derive data to represent the features of each face, it could more correctly assign any black-and-white facial image than could be done using traditional NMF.
Opening biology’s Pandora’s box
Gao has many successful collaborations with KAUST researchers, but he says one of the most successful is with structural biologist, Stefan Arold.
Together, they have worked on several projects, including one2 that has led to a computational pipeline that can help pharmaceutical companies discover new protein targets for existing, approved drugs.
“Drug repositioning is commercially and scientifically valuable,” explains Gao. “It can reduce the time needed for drug development from twenty to 6 years, and the costs from around 2 billion USD to 300 million USD. The National Institutes of Health in the United States estimates that 70 percent of drugs on the market can potentially be repositioned for use in other diseases.”
Gao discovered that methods for drug repositioning face several challenges: they rely on very limited amounts of information and usually focus on a single drug or disease, leading to results that aren’t statistically meaningful.
However, Gao’s computational pipeline can integrate multiple sources of information on existing drugs and their known protein targets to help researchers discover new targets.
The model was tested for its ability to predict targets for a number of drugs and small molecules, including a known metabolite in the body called coenzyme A (CoA), which is important in many biological reactions, including the synthesis and oxidation of fatty acids. It predicted 10 previously unknown protein targets for CoA. Gao chose the top two: Arold and his colleagues then tested to see if they really did interact with CoA.
The collaboration verified Gao’s predictions, and the computational pipeline is now being patented in several countries. It could eventually be licensed to pharmaceutical companies to enable already-approved drugs to be used for treating other diseases. The method can also help drug companies understand the molecular basis for drug toxicities and side effects.
“What makes our collaboration so synergistic is that our areas of expertise provide the minimal overlap needed to understand each other without creating redundancy,” says Arold. “He brings the computational side and I bring the experimental side to the table. Our worlds touch, but don’t overlap. Our discussions complement each other in a very stimulating way, without stumbling over too many semantic hurdles.”
Another collaboration of Gao and Arold’s involves enhancing the analysis of data gathered by electron microscopy. Arold explains that despite much progress in electron microscopy hardware and software—allowing it to be used to determine the 3D structures of proteins and other biomolecules—the analysis of its data still needs to be improved. Gao and Arold are developing methods to reduce noise and thus improve the resolution of electron microscopic images of complex biomolecular particles.
They are also developing processes that can automate the interpretation of genetic variants and that enhance the process of assigning functions to genes. “If you put us together in a room for more than 15 minutes, we will probably come up with a new idea!” says Arold.
Improving current technologies
Other research by Gao’s team includes a computational approach that can simulate a genetic sequencing technology called Nanopore sequencing. Gao’s DeepSimulator3 can evaluate newly developed downstream software in nanopore sequencing. It can also save time and resources through experimental simulations, reducing the need for real experiments.
His team also recently developed Gracob4, a method used to sift through genetic information and determine what pathways are turned on in microorganisms by stressful conditions, such as changes in acidity or temperature or exposure to antibiotics. This can identify genes that are dispensable under normal conditions but essential when the microorganism is stressed.
A collaborative environment
KAUST has a unique infrastructure for bioinformatics research, explains Gao. “Bioinformatics people may often be considered just service providers,” he explains. “But at KAUST, we have an equal dialogue between experimental and computational scientists. This is possible because we have flexible resources in terms of grants, facilities and manpower, which means that we do not need to work on very specific predefined topics. If we see opportunities, then we can pursue them.”
Gao says he tries to leverage this to stand in the middle between computer science and biology. “I want my work to be appreciated by both communities. That’s why I collaborate with a lot of biologists. Stefan Arold is one of my closest collaborators. But I also work with several other KAUST faculty members: I work with Mo Li on nanopore sequencing and single-particle cryo-EM, Takashi Gojobori on molecular evolution and comparative genomics, Carlos Duarte on metagenomics of oceans, Charlotte Hauser on selfassembling peptides, and Samir Hamdan on enzymatic activities at the single-molecule level.”
- Gao, X. & Wang, J.J. Max-min distance nonnegative matrix factorization. Neural Networks 61, 75–84 (2015).| article
- Naveed, H., Hameed, U.S., Harrus, D., Bourguet, W., Arold, S.T. & Gao, X. An integrated structure- and system-based framework to identify new targets of metabolites and known drugs. Bioinformatics 31, 3922–3929 (2015).| article
- Li, Y., Han, R., Bi., C., Wang, S. & Gao., X. DeepSimulator: a deep simulator for Nanopore sequencing. Bioinformatics 34, 2899-2908 (2018).| article
- Alzahrani, M., Kuwahara, H., Wang, W. & Gao, X. Gracob: a novel graph-based constant-column biclustering method for mining growth phenotype data. Bioinformatics 33, 2523–2531 (2017).| article