A pioneering collaboration has unveiled two vast scientific datasets that could help AI systems think across disciplines – from exploding stars to blood flow patterns – marking a significant step toward machines that can make unexpected connections between seemingly unrelated fields.
Estimated reading time: 6 minutes
What if artificial intelligence could think like a renaissance scientist, drawing insights across astronomy, biology, physics and more? The Polymathic AI project has taken a major stride toward this goal by releasing 115 terabytes of diverse scientific data – over twice the size of the training data behind GPT-3 – specifically curated to help AI systems develop multidisciplinary scientific understanding.
“These groundbreaking datasets are by far the most diverse large-scale collections of high-quality data for machine learning training ever assembled for these fields,” explains Michael McCabe, a research engineer at New York City’s Flatiron Institute. “Curating these datasets is a critical step in creating multidisciplinary AI models that will enable new discoveries about our universe.”
The initiative draws its name from the concept of polymaths – those rare individuals whose expertise spans multiple fields. But rather than relying on singular brilliant minds, the project aims to encode cross-disciplinary thinking into AI systems themselves. The datasets encompass everything from James Webb Space Telescope galaxy portraits to simulations of biological systems and fluid dynamics.
“Machine learning has been happening for around 10 years in astrophysics, but it’s still very hard to use across instruments, across missions and across scientific disciplines,” notes Polymathic AI research scientist Francois Lanusse. “Datasets like the Multimodal Universe are what will allow us to build models that natively understand all of these data and can be used as a Swiss Army knife for astrophysics.”
The data is split into two major collections. The Multimodal Universe provides 100 terabytes of astronomical observations and measurements. The Well collection offers 15 terabytes of numerical simulations modeling complex processes like supernova explosions and embryo development through partial differential equations – mathematical descriptions that emerge repeatedly across scientific fields.
“The freely available datasets are an unprecedented resource for developing sophisticated machine learning models that can then tackle a wide range of scientific problems,” says Ruben Ohana, a research fellow at the Flatiron Institute’s Center for Computational Mathematics. “The machine learning community has always been open-sourced; that’s why it’s been so fast-paced compared to other fields.”
Glossary
- Polymathic AI
- Artificial intelligence systems designed to work across multiple scientific disciplines, similar to human polymaths who have expertise in many fields
- Machine Learning
- A type of artificial intelligence that improves automatically through experience and data analysis
- Partial Differential Equations
- Mathematical equations that describe many physical phenomena and appear repeatedly across different scientific fields
Test Your Knowledge
How large are the new datasets compared to GPT-3’s training data?
The new datasets total 115 terabytes, which is more than twice the size of GPT-3’s 45 terabytes of training data.
What are the two main collections in the released datasets?
The Multimodal Universe (100TB of astronomical data) and the Well (15TB of numerical simulations).
How do partial differential equations connect seemingly different scientific phenomena?
These equations appear in diverse processes from quantum mechanics to embryo development, providing mathematical descriptions that bridge different scientific fields.
What fundamental shift in AI development does this project represent compared to traditional scientific AI tools?
While traditional AI tools are purpose-built for specific applications, this project aims to develop truly polymathic models that can work across disciplines and find unexpected connections between fields.
Enjoy this story? Subscribe to our newsletter at scienceblog.substack.com.