Extend your brand profile by curating daily news.

AI Tools Extract Buried Experimental Data from Scientific Papers to Accelerate Materials Discovery

By FisherVista

TL;DR

NIMS researchers developed LLM tools to accelerate materials database construction, giving scientists a competitive edge in discovering new functional materials faster than traditional methods.

The Starrydata project uses LLMs to extract structured data from scientific papers, automating the conversion of complex information into organized databases for materials property analysis.

By digitizing and sharing experimental data globally, this research accelerates materials development for sustainable technologies, potentially improving energy efficiency and environmental solutions worldwide.

Researchers are using AI like ChatGPT to mine millions of scientific papers, transforming untapped experimental data into searchable databases that reveal hidden patterns in materials science.

Found this article helpful?

Share it with your network and spread the knowledge!

AI Tools Extract Buried Experimental Data from Scientific Papers to Accelerate Materials Discovery

Materials scientists developing new functional materials for technologies like smartphones and automobiles face significant challenges in predicting material properties, as theoretical models alone cannot provide reliable predictions. The complex relationship between materials and their properties means even slight differences in composition or synthesis methods can result in entirely different characteristics, traditionally requiring researchers to rely on intuition built through years of experience.

Machine learning technology offers a potential solution by learning empirical trends rather than relying solely on theory, potentially replicating researcher intuition computationally. Large language models (LLMs) like ChatGPT now enable flexible information extraction that considers background knowledge and context, opening possibilities for automating the conversion of complex information sources like scientific papers into structured data. Building large-scale experimental datasets through this approach could enable researchers to gain inspiration through comprehensive data overviews and realize property predictions based on empirical trends using machine learning.

A research team led by Dr. Yukari Katsura at the National Institute for Materials Science has developed two new tools to accelerate construction of Starrydata, a materials property database built from data collected from scientific papers. This work was recently published in the journal Science and Technology of Advanced Materials: Methods. "Graphs in the millions of papers published to date contain valuable experimental data collected by past researchers, and much of it remains untapped," says Prof. Katsura, who launched the Starrydata project in 2015.

The project initially relied on manual data collection supported by the independently developed Starrydata2 web system, successfully amassing unprecedented volumes of experimental data. The new AI tools further streamline this process. "We found that by specifying a data structure and giving instructions to an LLM, we can accurately and comprehensively extract information about figures, tables, and samples from the text of paper PDFs across a wide range of fields," Katsura explained. She noted that many publishers prohibit AI use on paper PDFs, so the system currently targets open-access papers.

The first tool, Starrydata Auto-Suggestion for Sample Information, reads paper text and suggests candidate entries for data fields pre-designed for each materials domain. Already integrated into the Starrydata2 web system, this function sends user-pasted text from paper abstracts or experimental methods sections to OpenAI's GPT via API, automatically displaying candidate entries in English below each input field.

The second tool, Starrydata Auto-Summary GPT, deconstructs entire open-access paper PDFs uploaded by users and automatically summarizes all descriptions of figures, tables, and samples appearing in papers as structured JSON data. Generated using ChatGPT's custom GPT feature, the resulting data can be viewed as easy-to-read tables in web browsers. While this data isn't currently incorporated directly into the Starrydata database, it dramatically accelerates data collectors' work in quickly locating target data and entering information. Reading data points from graph images remains difficult for LLMs, so this task is performed by data collectors using an independently developed semi-automated tool.

"A paper is a logical structure assembled to convey the author's claims, but by deconstructing it and returning it to the form of experimental data, other researchers can also use it for their own research," says Dr. Katsura. "In this way, we are aiming for a future where experimental data from all materials science fields can be shared in digital format and viewed from a bird's-eye perspective."

Currently, Starrydata has progressed in building databases for specific materials science fields like thermoelectric materials that convert heat and electricity, and magnets. As an open dataset usable for new materials development, it's beginning to be utilized primarily by leading researchers worldwide. The team continues research to raise broader awareness of large-scale experimental data's potential and establish paper data collection as a recognized research form within the scientific community.

Curated from NewMediaWire

blockchain registration record for this content
FisherVista

FisherVista

@fishervista