Clin Res Cardiol (2023). https://doi.org/10.1007/s00392-023-02302-4

Application of Attention Mechanism with Vision Transformers Artificial Intelligence Architecture for the Detection of Severe Aortic Stenosis
F. Schölzel1, A. Efremidis1, A. G. Bejinariu1, M. Spieker1, M. Kelm1, O. R. Rana1, H. Makimoto2
1Klinik für Kardiologie, Pneumologie und Angiologie, Universitätsklinikum Düsseldorf, Düsseldorf; 2Data Science Center, Jichi Medical University, Tochigi, JP;

Aortic stenosis (AS) is the most prevalent valvular disease in the aging population. This constitutes the need for a screening method to identify patients eligible for further cardiological work up. Convolutional neuronal networks (CNN) have been demonstrated to effectively distinguish patients with severe AS from those without severe AS analyzing recorded heart sounds with confounding heart murmur as a limiting factor. Vision Transformer (ViT) is a novel deep learning architecture which has shown high capability for the processing of sequential data such as natural language.

The aim of the study was to examine the applicability and performance of ViT architecture for the detection of severe AS based on recorded phonocardiograms.

Our dataset included auscultation recordings from 1150 patients with a variety of valvular disease (158 with severe AS) assessed by echocardiography at study inclusion. Severe AS was defined as aortic valve area ≤ 1 cm² by continuity equation. Digital auscultation at the second intercostal space along the right sternal border was selected due to best recording quality.
We trained the ViT with randomly chosen balanced datasets (standard ViT) each consisting of 316 patients (50% severe AS). As data augmentation we used the time shifting method to split each recording of 15 seconds into 2 second intervals and converted them into spectrograms resulting in a set of 7228 samples (train 60%, validation 20%, test 20%). To assess the impact of concomitant mitral regurgitation (MR) we over- and underexposed the ViT by training the models with disbalanced sets, one containing all 84 cases of severe MR and the other without any severe MR. Each model was trained 250 times.

The standard ViT achieved good performance for detection of severe AS with mean F1 value, accuracy, sensitivity, and specificity of 71.2% 71.8%, 70.3%, and 73.4%, respectively. The models over- or underexposed towards severe MR showed no difference in performance as compared to the standard model (F1 value 70.5%, p = 0.072 and 70.6%, p = 0.22). Evaluation of misclassified samples revealed no relevant variation in respect of the underlying valvular pathology (p = 0.23). Half of the misclassified samples originated from 8% of the patients and 75.2% of those samples were misclassified by at least two models.

The novel Vision Transformer was applied to the task of detecting severe AS in a population of patients with complex multivalvular disease by means of digital auscultation for the first time. Although the accuracy underperformed in contrast to already published CNNs we documented a high robustness of the ViT towards concomitant valvular disease maintaining similar accuracy when over- or underexposed to severe MR. Regarding the increasing use of transformer architecture our data supports the applicability of ViT heuristic for detection of severe AS in concomitant multivalvular disease. Larger datasets are needed for the exact assessment of accuracy.


https://dgk.org/kongress_programme/ht2023/aV86.html