If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Correspondence: Sharon C. Kiang, MD, Department of Surgery, Division of Vascular Surgery, Linda University School of Medicine, 11175 Campus St, Ste 21123, Loma Linda, CA 92350
Department of Surgery, Division of Vascular Surgery, Linda University School of Medicine, Loma Linda, CADepartment of Surgery, Division of Vascular Surgery, VA Loma Linda Healthcare System, Loma Linda, CA
To identify confounding variables influencing the accuracy of a convolutional neural network (CNN) specific for infrarenal abdominal aortic aneurysms (AAAs) on computed tomography angiograms (CTAs).
Methods
A Health Insurance Portability and Accountability Act-compliant, institutional review board-approved, retrospective study analyzed abdominopelvic CTA scans from 200 patients with infrarenal AAAs and 200 propensity-matched control patients. An AAA-specific trained CNN was developed by the application of transfer learning to the VGG-16 base model using model training, validation, and testing techniques. Model accuracy and area under the curve were analyzed based on data sets (selected, balanced, or unbalanced), aneurysm size, extra-abdominal extension, dissections, and mural thrombus. Misjudgments were analyzed by review of heatmaps, via gradient weighted class activation, overlaid on CTA images.
Results
The trained custom CNN model reported high test group accuracies of 94.1%, 99.1%, and 99.6% and area under the curve of 0.9900, 0.9998, and 0.9993 in selected (n = 120), balanced (n = 3704), and unbalanced image sets (n = 31,899), respectively. Despite an eightfold difference between balanced and unbalanced image sets, the CNN model demonstrated high test group sensitivities (98.7% vs 98.9%) and specificities (99.7% vs 99.3%) in unbalanced and balanced image sets, respectively. For aneurysm size, the CNN model demonstrates decreasing misjudgments as aneurysm size increases: 47% (16/34) for aneurysms <3.3 cm, 32% (11/34) for aneurysms 3.3 to 5 cm, and 20% (7/34) for aneurysms >5 cm. Aneurysms containing measurable mural thrombus were over-represented within type II (false-negative) misjudgments compared with type I (false-positive) misjudgments (71% vs 15%, P < .05). Inclusion of extra-abdominal aneurysm extension (thoracic or iliac artery) or dissection flaps in these imaging sets did not decrease the model's overall accuracy, indicating that the model performance was excellent without the need to clean the data set of confounding or comorbid diagnoses.
Conclusions
Analysis of an AAA-specific CNN model can accurately screen and identify infrarenal AAAs on CTA despite varying pathology and quantitative data sets. The highest anatomic misjudgments were with small aneurysms (<3.3 cm) or the presence of mural thrombus. Accuracy of the CNN model is maintained despite the inclusion of extra-abdominal pathology and imbalanced data sets.
A convolutional neural network (CNN), a subdiscipline in AI, has been spotlighted in medical imaging for solving computer-based visual tasks (ie, image analysis, object identification, categorization, and segmentation). The application of a CNN has been investigated in a wide range of medical fields and could potentially lead to the development of new approaches for the diagnosis, prognosis, or treatment of patients.
In the era of personalized medicine and big data analytics, AI AAA imaging programs have the potential ability to predict personalized-patient outcomes.
One of the biggest challenges with these deep learning machine models in medical imaging is the variables influencing the accuracy and generalizability across institutions. CNN models amplify the aspects of the input that are important for discrimination and suppress irrelevant variations (ie, normal variants, confounding pathology, and small data sets).
Thus, the quality and applicability of AI algorithms across institutions may be questionable without extensive testing and subanalysis.
The goal of this study is to analyze confounding variables influencing the accuracy of a newly developed CNN specific for detecting infrarenal AAAs. Our previous machine learning model (without segmentation training/programming) automatically detects the presence of an infrarenal AAA in various locations and sizes with nearly 99% accuracy. The output of the model is a binary classifier that automatically recognizes the presence or absence of an AAA on computed tomography angiograms (CTAs) of the abdomen and pelvis.
The AI model accuracy will be analyzed on simulated real-world confounding variables such as data set size (segmented, balanced, or unbalanced), aneurysm size, extra-abdominal extension, dissections, and mural thrombus.
Methods
Study population
The local institutional review board approved this Health Insurance Portability and Accountability Act-compliant study and waived the requirement for written informed consent. A retrospective review of the hospital’s internal radiology database (mPower Clinical Analytics; Nuance Communications, Inc) identified 4821 CTA scans of the abdomen and pelvis performed between January 2015 and January 2020. Within this group, 398 CTAs of the abdomen and pelvis reported the presence of an aortic aneurysm. These examinations were individually reviewed for the presence of infrarenal AAAs (diameter >3.0 cm). From this group, 68 cases were excluded because of ruptured aneurysm, absence of an infrarenal AAA, prior repair of an infrarenal AAA, image nonavailability in the picture archiving and communication system and/or protocol errors (absence of intravenous [IV] contrast material, etc). Subsequently, 200 CTA scans containing infrarenal AAAs were identified. Clinical and demographic data (ie, date of birth, patient gender, presence or absence of hypertension, history of tobacco use, and scanner type) were collected from the medical record system. For the development of a propensity-matched control group, analysis of the 4821 CTA scans of the abdomen and pelvis identified 200 propensity-matched nonaneurysmal aorta control patients who were selected based on similar demographics, comorbidities, and technical imaging factors of the study group.
Convolutional neural network model development
As described in our prior work, for the initial CNN model development, axial reconstructions from all selected CT scans were exported in noncompressed JPEG format at preset window widths and levels.
All axial reconstruction images were resized to 512 × 512 pixels. A total of 6175 axial images containing infrarenal AAAs were sorted. A total of 100,249 axial nonaneurysmal images were sorted. The aneurysm set was randomized to 60% training (n = 3705), 10% validation (n = 618), and 30% testing (n = 1852) subsets. A nonaneurysm set was generated through sampling of nonaneurysm axial reconstruction images at fixed intervals. The nonaneurysm set was randomized to 60% training (n = 3705), 10% validation (n = 618), and 30% testing (n = 1852) subsets.
The VGG-16 neural network architecture was selected for its robust performance in a variety of image recognition tasks.
Transfer learning was applied to the neural network using ImageNet, a pretrained CNN developed using over 14 million hand-labeled images in over 20,000 categories.
Normalization, rotation, flipping, width, height, zoom level, and shear intensity were varied. No segmentation of axial reconstruction images was performed. Stochastic gradient descent was employed as the model optimizer. The penultimate network layer consisted of a dense layer containing 1024 neurons. The last fully connected layer was connected to a logistic layer using a rectified linear unit as the activation function for binary output (infrarenal AAA or nonaneurysm). Initial learning rate, decay, and momentum were set to 1 × 10−3, 1 × 10−6, and 0.9, respectively. Gradient descent optimization was applied via Nesterov accelerated gradient. The model was trained for 40 epochs with batch sizes of 15 to stable convergence of the loss function in the validation set. To address class imbalance, the majority class (nonaneurysm) was undersampled to the same size as the minority class (infrarenal AAA). Model development and analysis were performed using Keras (version 2.4.3), TensorFlow (version 2.4.1), imgaug (version 0.2.5), Scipy (version 1.2.1), NumPy (version 1.8.2), scikit-learn (version 0.23.1), and Matplotlib (version 3.2.2). All experiments were performed on a computer equipped with an NVIDIA Quadro P5000 graphical processing unit with 16 GB GDDR5 video memory.
The model was assessed for overall diagnostic accuracy at the image level. Loss and accuracy of training and validation groups were plotted by epoch to observe for stable convergence of model performance
Fig 1The design and performance of the optimized CNN model for AAA. (A) Flowchart of the study process depicting patient selection and study design. (B) In the optimized CNN model, as the number of epochs increases, there is an appropriate reduction of the loss function and increase in overall accuracy demonstrating an optimal fitting for model performance.
Machine learning analysis of data confounding variables influencing the accuracy of output
Optimization of the model included randomization to sets of 60%, 10%, and 30% for model training, validation, and testing, respectively. A total of 6175 axial images containing infrarenal AAAs were sorted. A total of 100,249 axial nonaneurysmal images were sorted. The aneurysm set was randomized to 60% training (n = 3705), 10% validation (n = 618), and 30% testing (n = 1852) subsets. A numerically balanced nonaneurysm set was generated through the sampling of nonaneurysm axial reconstruction images at fixed intervals. The balanced nonaneurysm set was randomized to 60% training (n = 3705), 10% validation (n = 618), and 30% testing (n = 1852) subsets. Training and validation subsets were used for model hyperparameter tuning. Test subsets were used for the evaluation of model performance.
After the finalized optimization of the model’s hyperparameters, the model’s accuracy and area under the curve (AUC) were then subanalyzed based on data set variations and pathological variables. In order to replicate real-world institutional variability and applicability, three variable sized data sets were created and tested for accuracy: selected, balanced, and unbalanced (120 images, 3704 images, and 31,899 images, respectively). In the subanalysis of pathology, the following variables were evaluated for misjudgments and accuracy: aneurysm size (<3.3 cm, 3.4-5.0 cm, >5.1 cm), extra-abdominal extension, dissections, and mural thrombus.
In regard to the variable data set size, a confusion matrix (two-by-two) table was generated from each testing set. Sensitivity, specificity, positive predictive value, and negative predictive value were calculated from the classification results. In regard to the pathological variables, misjudgments were analyzed by review of heatmaps, via gradient weighted class activation, overlaid on CTA images. Plots and figures were generated by Matplotlib and converted to vector graphic format in Visio Professional 2019 (Microsoft) or OmniGraffle Pro (version 7.18.1; The Omni Group).
Results
The demographics of the propensity-matched groups (AAA vs non-AAA) were similar: age (73.2 years and 72.1 years, P = .359), male gender (71.5% and 72.0%, P = .999), tobacco use (79.9% and 76.5%, P = .891), or history of hypertension (86.3% and 82.7%, P = .878).
The trained custom CNN model reported high test group accuracies of 94.1%, 99.1%, and 99.6% and AUC of 0.9900, 0.9998, and 0.9993 in selected (n = 400), balanced (n = 3704), and unbalanced image sets (n = 31,899), respectively (Table I). As demonstrated by the confusion matrices, despite an eightfold difference between balanced and unbalanced image sets (3704 vs 31,889), the CNN model demonstrated high test group sensitivities (98.7% vs 98.9%) and specificities (99.7% vs 99.3%) in unbalanced and balanced image sets, respectively (Fig 2).
TableThe trained custom CNN model reported high test group accuracies in varying sized datasets
Fig 2Despite an eight fold difference between balanced (A) and unbalanced (B) image sets (3,704 vs 31,889), the CNN model demonstrated low rates of misjudgements in the balanced and unbalanced image sets, respectively.
In the subanalysis of these misjudgment cases (n = 34, 0.092%), the CNN model demonstrates improving accuracy as the aneurysm size increases: 47% (16/34) for aneurysms <3.3 cm, 32% (11/34) for aneurysms 3.3 to 5 cm, and 20% (7/34) for aneurysms >5 cm (Fig 3). The presence of a measurable mural thrombus was also a notable confounding variable present in 50% of these misjudgments (Fig 4). Aneurysms containing measurable mural thrombus were over-represented within type II (false-negative) misjudgments compared with type I (false-positive) misjudgments (71% vs 15%, P < .05). The average thickness of the mural thrombi that caused a false-negative misjudgment was 12.8 mm (±6.1 mm). Inclusion of extra-abdominal aneurysm extension (n = 347 images) or dissection flaps (n = 80 images) in these imaging sets did not appear to decrease the model’s overall accuracy because they were not heavily represented in the error set relative to their incidence in the overall data set.
Fig 3The CNN model demonstrates improving accuracy as aneurysm size increases: 47% (16/34) for aneurysms <3.3 cm, 32% (11/34) for aneurysms 3.3-5 cm, and 20% (7/34) for aneurysms >5 cm. Below are heat maps generated via gradient weighted class activation mapping overlaid on CT images. (A) This aneurysm bordered on ectasia, very close to the 3 cm threshold contributed to the false positive misjudgement. (B) The relatively small size of the enhancing region in combination with mural thrombus contributed to a false negative misjudgement.
Fig 4The presence of a measurable mural thrombus was a notable confounding variable that was present in 50% of these misjudgments. Below are heat maps generated via gradient weighted class activation mapping overlaid on CT images. (A). Correct classification. The algorithm successfully avoid the misjudgment of an aneurysm with significant mural thrombus. (B). Incorrect classification. The relatively small size of the enhancing region in combination with mural thrombus contributed to a false negative misjudgement.
The implementation of AI in medicine is undergoing continuous evolution. The integration of AI imaging and biologic analysis could potentially lead to the development of revolutionary predictable models for the management of patients.
Before the application of machine learning to image analysis, manual human interpretation was required to convert an image finding into a binary or categorical variable for analysis. However, with the use of a CNN, we now have the ability to objectively break down an image into large biostatistical data sets. The tidal wave of medical records, in the form of imaging data, clinical data, and genomic data, is only likely to exponentially increase. Thus, the future of medicine research is likely to be even more data dependent with the synergy between medical scientists and AI technology becoming more pronounced. Based on the analysis of large data sets of disease profiles and treatment responses, machine learning programs will likely provide the opportunity to predict personalized, patient-specific, clinical outcomes.
The main challenge in AI data sciences applications is developing high-fidelity, widespread applicability. Because CNNs are created by training on sample cases of the general population, there is no ability to provide every possible anatomic scenario for AAA that has ever existed. Because of this, there will be many factors that can challenge the CNN’s widespread applicability. These “confounding” factors include varying imaging modalities (different types and qualities of CT scan imaging), pathologies (atherosclerotic plaques, dissection flaps, mural thrombus, and penetrating ulcers), protocols (timing of IV contrast opacification of the aorta), and practices (small hospital setting vs large tertiary referral center). In typical AI data sciences applications, the larger the input data set (ie, thousands to hundreds of thousands), the more accurate the algorithm output for sorting true signal from noise. However, surgical clinical research is often limited by the total number of patients (ie, hundreds to thousands) who can be studied at a particular institution. Thus, the quality and applicability of AI algorithms across institutions may be questionable without extensive testing and subanalysis. This study analyzed the anatomic and data set variables influencing the accuracy of a newly developed CNN specific for detecting infrarenal AAAs.
Some of the early AI studies in vascular surgery were implemented to assess the predictive nature of clinical markers for AAA patient outcomes.
However, imaging is an integral component for the diagnosis, surveillance, and management of AAAs. In the vascular surgery literature, there have been studies in the past that have examined the role of semiautomated AAA image analysis focusing on image segmentation.
Although these were significant advances in programming techniques, nevertheless these programs all require some baseline manual input. However, Lareyre et al
recently described a fully automated pipeline to characterize the AAA, including the presence of intraluminal thrombus and calcifications. This rapid method was tested on a set of 40 patients with CTA images and demonstrated a good correlation with results obtained from manual segmentation by human experts.
reported and designed a CNN classifier for the aorta where detection is 98.62% and a Hough Circles algorithm that classified a group of 120 aorta patches according to their diameter with an accuracy of 98.33%.
Our fully automated, novel, trained CNN model demonstrated a robust accuracy of 99.1% (95% confidence interval: 98.72%-99.36%) and an AUC of 0.9900 tested on 3600 images from 400 patients in two propensity-matched cohorts.
These results are derived from real-world, unaltered, nonsegmented images that contain varying acquisition methods, contrast agent used, resolution, concomitant comorbid pathology, and noise and artifacts. With this robust CNN, we have demonstrated a proof of concept model that can be used for a variety of potential future applications.
Although there is great potential for medical imaging for CNNs, there are many confounding variables for its widespread applicability due to varying imaging modalities, pathologies, protocols, and practices.
Because of this, the next step in the development process has been to address these CNNs’ applicability to real-world scenarios. In the 2017 American Association of Physicists in Medicine challenge, the winning CNN for cardiac autosegmentation demonstrated decreased accuracy when applied to another local institution data, compared with the testing cases from the challenge.
proposed a method to offer a potential solution to improve CNN-based model generalizability for the cross-scanner image segmentation tasks. Khened et al
proposed a novel network structure with residual connections to improve another CNN network generalizability. They pointed out that networks with a large number of parameters may easily suffer from overfitting problems with limited data. As with any new science or technological advancement, the pitfalls and challenges become more apparent when a more detailed subanalysis of these tools is performed. In our current study, we performed a preliminary subanalysis of our AAA-specific CNN model in regard to varying pathology and quantitative data sets. The highest anatomic misjudgments were with small aneurysms (<3.3 cm) or the presence of mural thrombus. The accuracy of the CNN model is maintained despite the inclusion of extra-abdominal pathology and imbalanced data sets. With this information, vascular surgeons can better design and fine tune their future AI algorithms for AAA research. To the best of our knowledge, this is the first work to explore the generalizability of a CNN-based AI algorithm for the CTA image analysis of AAAs with variable data set sizes, concomitant comorbid aortic pathology, multiple scanners, and techniques.
There are several limitations of this study. First, this is a retrospective single-center study with a limited number of subjects who had certain exclusion criteria (ruptured aneurysm, prior repair of an infrarenal AAA, and/or protocol errors [absence of IV contrast material, timing issues, etc]). Second, the sample size was underpowered for the subanalysis of comprehensive anatomic copathology; there was a lack of all varieties of aneurysm sizes and all forms of comorbid aortic pathologies (mural thrombus, dissection, extra-abdominal extension, etc). Third, the model is not 100% accurate and still demonstrates a <1% misjudgment rate.
In summary, a preliminary subanalysis of an AAA-specific CNN model can accurately screen and identify infrarenal AAAs on CTA despite varying pathology and quantitative data sets. The highest anatomic misjudgments were with small aneurysms (<3.3 cm) or the presence of a mural thrombus. The accuracy of the CNN model is maintained despite the inclusion of extra-abdominal pathology and imbalanced data sets.
Additional material for this article may be found online at www.jvascsurg.org.
The editors and reviewers of this article have no relevant financial relationships to disclose per the JVS-Vascular Science policy that requires reviewers to decline review of any manuscript for which they may have a conflict of interest.