Using AI to understand the pathogenesis of COVID-19

KAUST Associate Professor of computer science and Acting Associate Director of the Computational Bioscience Research Center Xin Gao.

The ongoing COVID-19 pandemic has revealed itself to the world as an unprecedented viral threat with a crippling power to disrupt society as we know it.

The novel coronavirus is not yet a wholly understood disease. Global governance and policymakers are continuously providing updated information and directions on curtailing the virus' spread as they attempt to manage the pervasive disease, its symptoms, epidemiology and transmission and predict its future growth.

A group of KAUST faculty, coordinated by Donal Bradley, KAUST vice president for research, and Pierre Magistretti, dean of the University's Biological and Environmental Science and Engineering division, recently formed the Rapid Research Response Team (R3T). The team's focus is to collaborate with and strongly support the Kingdom's healthcare stakeholders to help combat the spread of COVID-19.

Sciencetown interview with Xin Gao

We spoke with Xin Gao for episode 11 of Sciencetown - Racing to understand.

Xin Gao, KAUST associate professor of computer science and acting associate director of the Computational Bioscience Research Center, is a key member of R3T. He and his Structural and Functional Bioinformatics (SFB) Group have been focusing on developing an artificial intelligence (AI)-based diagnosis pipeline from the CT scans of COVID-19 patients.

COVID-19 is an RNA virus that consists of one single RNA strand plus a series of proteins. Therefore, the gold standard to diagnose the patients should be to detect the nucleic acid sequence, the gene sequence of this virus, or the antibody sequence produced by patients' immune systems.

False negatives

However, Gao explained that the experience in China has shown that this kind of approach can generate a surprisingly high false negative rate.

"It's not about the testing itself," he said. "Very often it's about when and how you take samples properly from the people and whether you transport or prepare the samples [and] whether there are any errors along the way. The final consequence is that the presumed gold standard has a high false negative rate, with some reports in China of 30 to 50 percent, which is ridiculously high."

Specialist labs—like those at KAUST—would be able to avoid this high false negative rate. However, in the peak of a pandemic with a lot of inexperienced people, it is not so easy. As a result, the Chinese government has been adding additional assisting components to the diagnostic process.

This means that Chinese hospitals now rely on four criteria: nucleic acid detection or gene sequencing; biomedical imaging mainly from CT scans; big data-based epidemiology analysis—whether there has been contact either direct or indirect with people from Wuhan; and clinical symptoms, such as dry cough or difficulty breathing.

"This is not only about diagnostics but rather also about prognosis and treatment," Gao said. "I have been talking to a lot of clinicians in China, and what they are saying is that it is critical to say where the infection area in the lung is, which lobe contains the infection, because the statistics have shown that if a patient's infection area is more than 50 percent of the entire lung volume, then very likely this patient will die."

Diagnosis, prognosis and treatment

However, if the infection only happens in one of the five lobes, then a patient can very quickly recover without many side effects or consequences—and this is where AI can help.

"We are not only doing the diagnostics and classification of the patient," Gao said. "We are also using artificial intelligence to segment the exact infection area from the CT scan of the patient's lung. We then quantify the volume with respect to the total volume of the lung and give that to clinicians as a guideline to help them to decide what kind of medication they should give to the patients."

To help with their efforts to develop machine-learning techniques, Gao and his group are sourcing actual CT scan imaging data from collaborators in China and are talking to the Saudi Center for Disease Prevention and Control, the Kingdom's Ministry of Health and King Faisal Specialist Hospital & Research Center to do the same with Saudi patients' CT scans.

"The input into our model is what we call a sequence," he said. "It's not a genomic sequence but rather a series of images from a patient's CT scan. When a patient is admitted to hospital, they will have multiple CT scans—once every two to three days—to see how the disease is progressing or recurring.

"So, we can obtain multiple CT scan sequences from the same patient, and right now we are obtaining them from about 100 Chinese patients as the first batch. We then feed these 3D images into our machine learning framework and we teach the computers to try to identify the infection areas exactly and quantify the volume."

Gao has a number of ambitious goals for the group's model. First, it needs to be fully automatic without human intervention, allowing busy clinicians to focus on patient care. Second, it must be rapid enough to cope with the exponential growth of the current pandemic.

A pressing deadline

The toughest goal is to make the model machine-agnostic.

"We are getting as much data as possible from different sources generated from different patients on different CT machines from different hospitals from different radiologists and using different parameters," Gao noted. "That requires us to develop a very smart, pre-processing normalization approach for all these different data sets so that our model won't be biased or confused by different parameters or machines."

The time-frame to achieve the team's goal is daunting.

"For previous projects, my group and I spent six months up to one year building a system, and now the situation requires us to build one in three to five days," he continued. "For this particular project, I have invested four of the best people in my group to work day and night—some of them actually do not sleep—and our goal is to finish the one-year amount of work within one week."

However, the group isn't starting from scratch. For the past four months, they have been developing a similar platform for breast cancer diagnostics and segmentation from DCE-MRI data—much of which can be refined and optimized for COVID-19.

"The aim is that our model becomes so robust that it doesn't care about [where] machine data comes from or which parameters you are using, and that's why, once it's built, it can be easily be deployed on all different kinds of platforms and hospitals to best benefit the users," Gao said.