September 21, 2020 -- An analysis of COVID-19 scientific abstracts using natural language processing and machine-learning techniques suggests that the literature currently lacks basic research on the pathogenesis and viral transmission of SARS-CoV-2, according to a September 16 article published in Patterns.
The motivation for the study was to identify research areas in the current COVID-19 literature that are relatively understudied compared to previous research conducted on other non-SARS-CoV-2 coronaviruses, based on the idea that these areas may correspond to knowledge gaps in the study of the virus and its transmission.
Volunteer scientists from the COVID-19 Dispersed Volunteer Research Network analyzed more than 137,000 abstracts through July 31, 2020, from the COVID-19 Open Research Dataset (CORD-19), a corpus of scientific papers on COVID-19 that includes studies published in PubMed Central and those archived in the preprint servers bioRxiv and medRxiv. They analyzed the full text of the abstracts, as opposed to the traditional approach of analyzing keywords only, which allowed for deeper insights into the literature, according to the authors.
The analysis found that COVID-19 studies to date are primarily clinical-, modeling- or field-based, in contrast to the vast quantity of laboratory-driven research for other (non-COVID-19) coronavirus diseases.
"In a crisis like this pandemic, we would expect research outside the lab to happen at a faster pace than lab research," said first author Anhvinh Doanvo, data scientist with the COVID-19 Dispersed Volunteer Research Network, in a statement. "Nevertheless, the relative lack of lab-based studies seems to be unique to SARS-CoV-2, compared to other human coronaviruses. This shortage of lab-based research means that the scientific community may miss key aspects of the virus that could impact our ability to contain this pandemic and to counter future ones."
The researchers also analyzed the specific topics covered in the COVID-19 and non-COVID-19 abstracts using a technique called latent Dirichlet allocation topic modeling. This analysis revealed that the COVID-19 publications tended to focus on public health, outbreak reporting, clinical care, and testing for coronaviruses.
In contrast, there was a relative lack of papers focused on the basic microbiology of COVID-19, including pathogenesis and transmission.
"Basic microbiological research has been slow to pick up the pace, leaving potential knowledge gaps in its wake," Doanvo said. "It's possible that stronger resourcing in these time- and resource-intensive efforts would better enable the scientific community to respond quickly to this virus."
The researchers also documented the evolution of COVID-19 research over time. They found a growing number of studies examining public health responses, clinical issues related to the virus, the societal impact of the outbreak, and how the disease spreads across populations. Meanwhile, reporting on the status of the outbreak has begun to plateau.
"This is a positive development, as it indicates that the scientific community has transitioned from the role of a passive observer of the virus into a group studying ways to fight its spread," said co-author Maimuna Majumder, PhD, a computational epidemiologist at Harvard Medical School and in the computational health informatics program at Boston Children's Hospital.
The researchers hope the natural language processing framework they developed can be used in the future to infer research gaps in emerging pathogens before they escalate to the level of a pandemic by comparing the cross-topic distribution of the literature on the new pathogen with that of previously explored pathogens.
Do you have a unique perspective on your research related to infectious disease research? Contact the editor today to learn more.