The MetaGraph framework. Credit score: Nature (2025). DOI: 10.1038/s41586-025-09603-w
Uncommon hereditary ailments will be recognized in sufferers and particular mutations in tumor cells detected—DNA sequencing revolutionized biomedical analysis a long time in the past. In recent times, new sequencing strategies (next-generation sequencing) particularly have resulted in quite a few scientific breakthroughs. In 2020/2021, for instance, they enabled the fast decoding and world monitoring of the SARS-CoV-2 genome.
In the meantime, increasingly more researchers are making the outcomes of sequenced DNA publicly accessible. This has given rise to the creation of big information volumes, that are saved in central databases such because the American SRA (Sequence Learn Archive) or the European ENA (European Nucleotide Archive). Round 100 petabytes of knowledge are saved there—roughly the identical quantity as all of the textual content on the web, one petabyte being the equal of 1 million gigabytes.
To this point, biomedical scientists have wanted large computing energy and different sources to go looking by means of this quantity of DNA sequences and evaluate them with their very own sequences—making the environment friendly looking out in such mountains of knowledge a sheer impossibility. Pc scientists at ETH Zurich have now solved this downside.
Full-text search as an alternative of downloading complete information units
The scientists have developed a technique that vastly shortens and facilitates this search. The analysis is revealed within the journal Nature.
The “MetaGraph” digital device searches the uncooked information of all DNA or RNA sequences saved within the databases—identical to a traditional Web search engine. After getting into a sequence they’re concerned with as full textual content right into a search masks, researchers can discover out inside seconds or minutes, relying on the question, the place it has already appeared.
“It’s a kind of Google for DNA,” says Professor Gunnar Rätsch, information scientist on the Division of Pc Science at ETH Zurich. Till now, researchers needed to search the databases for descriptive metadata. As a way to entry the uncooked information, they needed to obtain the respective information units. These searches had been incomplete, time-consuming and costly.
“MetaGraph” is relatively favorable when it comes to prices, because the researchers state of their research. The illustration of all public organic sequences would match on a couple of laptop exhausting drives, whereas bigger queries ought to price not more than 0.74 {dollars} per megabase.
Because the DNA search engine the ETH researchers have developed can also be each exact and environment friendly, it could assist to speed up genetic analysis—for instance, within the case of little-researched pathogens or new pandemics.
On this means, the device may turn into a catalyst in analysis into antibiotic resistance: for instance, by figuring out resistance genes or helpful viruses that may destroy micro organism—often called bacteriophages—within the databases.
Compression by an element of 300
Within the research, the ETH researchers reveal how MetaGraph works: the device indexes the information and presents it in compressed type. That is achieved by the use of advanced mathematical graphs that enhance the construction of the information—just like spreadsheet applications comparable to Excel. “Mathematically speaking, it is a huge matrix with millions of columns and trillions of rows,” as Rätsch states.
The concept of rendering massive quantities of knowledge searchable with the assistance of indexes is normal follow in laptop science analysis.
What’s new concerning the work of the ETH researchers, nonetheless, is the advanced linking of uncooked information and metadata and the compression by an element of about 300, just like a ebook abstract: it now not accommodates each phrase, however all the principle storylines and connections stay intact—extra compact, but with none related lack of info.
“We are pushing the limits of what is possible in order to keep the data sets as compact as possible without losing necessary information,” says Dr. André Kahles, who, like Rätsch, is a member of the Biomedical Informatics Group at ETH Zurich.
In contrast with different DNA search masks at present being researched, the ETH researchers’ method is scalable. Which means that the bigger the quantity of knowledge queried, the much less further computing energy the device requires.
Half of the information is already accessible now
The ETH researchers first offered MetaGraph in 2020 and have been constantly bettering it ever since. The device is already accessible for queries (hyperlink). It gives a full-text search engine for hundreds of thousands of sequence units from DNA and RNA, in addition to proteins from viruses, micro organism, fungi, vegetation, animals and people.
At current, just below half of the sequence information units accessible worldwide are listed. In keeping with Gunnar Rätsch, the remaining ought to observe by the tip of the 12 months. Provided that MetaGraph is accessible as open supply, it may be of curiosity to pharmaceutical corporations which have massive quantities of inner analysis information.
Kahles even believes it’s attainable that the DNA search engine will in the future be utilized by non-public people. “In the early days, even Google didn’t know exactly what a search engine was good for. If the rapid development in DNA sequencing continues, it may become commonplace to identify your balcony plants more precisely.”
Extra info:
Mikhail Karasikov et al, Environment friendly and correct search in petabase-scale sequence repositories, Nature (2025). DOI: 10.1038/s41586-025-09603-w
Quotation:
‘Google for DNA’ allows fast full-text searches of huge genetic archives (2025, October 9)
retrieved 9 October 2025
from https://medicalxpress.com/information/2025-10-google-dna-enables-rapid-full.html
This doc is topic to copyright. Aside from any truthful dealing for the aim of personal research or analysis, no
half could also be reproduced with out the written permission. The content material is supplied for info functions solely.

