Can AI finally bring order to biology’s data deluge?

Modern biology is awash in data. Scientists can sequence DNA, track gene activity cell-by-cell, map proteins in space, and image tissues at microscopic resolution. However, it is a struggle to put all that information together to form a cohesive view.

A KAUST-led vision for artificial intelligence (AI) could help bridge that gap. An AI system that combines multiple biological data modalities into a single model has been described by members of the AI4BioMedicine lab in the Biomedical Division. Called a “super transformer”, the new AI architecture aims to turn today’s fragmented measurements by different technologies into a more coherent picture of life inside cells and tissues^[1].

“This bridges the gaps between siloed computational approaches,” says Jesper Tegnér, professor of bioscience and computer science at KAUST who led the work.

“Such integration will be necessary if AI is to move beyond narrow, single-purpose biological analyses,” says research scientist Sumeer Khan, who co-authored the paper with Xabier Martínez de Morentin, a postdoctoral researcher. The proposed architecture, he explains, is meant to “facilitate scalable integration across data types” and thus provide a framework that can be used across genomic and biomedical research.

The idea fits into a broader effort by Tegnér and his KAUST colleagues to build AI systems that can both integrate biological measurements and explain what they have inferred from them. A single model that can learn from DNA sequences, gene activity, tissue images, and other data simultaneously could begin to link cause and effect across levels of biology, connecting genetic changes to altered cells, tissues, and, eventually, disease.

The appeal of such a system becomes clearer when considering how biological data are analyzed today. Most computational tools are built for a single task: one algorithm for DNA sequences, another for single-cell gene expression, another for tissue images. Integrating their outputs often requires bespoke pipelines, expert judgment, and guesswork. And as datasets grow in scale and complexity, this patchwork approach begins to fray.

Transformers offer one possible way forward. Originally developed for language processing, they were designed to understand how words relate to one another across sentences, paragraphs or extended treatises.

Their key innovation lies in a machine-learning technique called “attention,” which enables models to weigh relationships across a dataset. Rather than processing information strictly in order, a transformer learns which elements matter most to one another, even when they are far apart. That ability proved essential for systems that translate languages or summarize documents, and the same logic applies to biology.

Biological systems are full of distant and indirect interactions. Genes influence one another across long stretches of DNA. Cells respond to signals from their neighbors and from faraway tissues. Molecular events, including disease-related deficiencies, can ripple upward to shape organs and whole organisms. In that sense, life — and disease — have a grammar of their own, and transformers trained on enough data may be able to learn it.

The “super” in “super transformer” reflects an ambition to extend this approach. Rather than applying transformers to a single data type at a time, the KAUST team envisions an architecture that can handle multiple modalities simultaneously. In their proposal, DNA sequences, gene-expression profiles, spatial maps and images would all be translated into a shared internal representation, then linked through the same attention-based machinery.

That vision comes with important caveats. As Tegnér and his collaborators reported late last year, architectural choices and the fine structure of large neural networks can strongly influence robustness, bias, and interpretability, particularly when biological data are noisy or incomplete^[2]. Scaling models without careful design can amplify spurious correlations rather than reveal meaningful structure.

Those concerns motivate the design of the next generation of AI systems tailored for biomedicine — with innovations emerging from across KAUST laboratories. Taken together, these efforts point toward a future in which AI acts less like a collection of specialized tools and more like a unifying layer for biology. It could integrate diverse data, reason across them, and return answers that make sense to human researchers.

In that light, the “super transformer” is best understood not as a finished product, but as a blueprint for how biological AI might finally be built to connect scales rather than fragment them.

Many practical challenges remain, from data standardization to computational cost. Still, Tegnér argues that the direction is clear. Biology no longer lacks data; it lacks a way to see the whole.

Reference

Khan, S.A., Martínez-de-Morentin, X., Alsabbagh, A.R., Maillo, A., Lagani, V., Gomez-Cabrero, D., Lehmann, R. & Tegner, J. Multimodal foundation transformer models for multiscale genomics. Nature Methods, 23, 299–311 (2026).| article
Zhang, H., Yang, C. H., Zenil, H., Chen, P. Y., Shen, Y., Kiani, N. A., & Tegnér, J. N. Leveraging network motifs to improve artificial neural networks. Nature Communications 16, 11495 (2025).| article

ABOUT THE AUTHOR

Sumeer Khan & Xabier Martínez de Morentin

Research scientist & Postdoc

In Jesper Tegnér's AI4BioMedicine lab, Sumeer focuses on developing advanced computational models, including generative models for spatial transcriptomics analysis and large language foundation models for genomics data. Xabier studies how machine learning can help reveal cellular organization in the context of spatial transcriptomics and single-cell technologies. His work explores how data integration and representation learning can contribute to a deeper understanding of diseases.