Clin Res Cardiol (2022). https://doi.org/10.1007/s00392-022-02002-5

CARDIO:DE - A distributable German clinical corpus containing clinical routine discharge letters from the cardiology domain
P. Richter-Pechanski1, D. Schwab1, C. Kiriakou1, C. Dieterich1, N. A. Geis1
1Klinik für Innere Med. III, Kardiologie, Angiologie u. Pneumologie, Universitätsklinikum Heidelberg, Heidelberg;
Motivation and objective:
In medicine in general and in cardiology in particular, much treatment-relevant information is still stored in unstructured free text. In this context, recent advancements in natural language processing (NLP) making use of deep neural networks show impressive results regarding models for information extraction from unstructured texts. To develop these models shared corpora are essential to support transparent and reproducible science. Models can be compared easier, which in turn can foster innovation in the field. To date there is only a single distributable German clinical corpus available, containing 200 oncological discharge summaries (Kittner et al. 2021). To the best of our knowledge, no distributable German corpora exist that cover the field of cardiovascular medicine. We present CARDIO:DE, the first freely available and distributable large German clinical corpus from the cardiovascular domain. The goal of CARDIO:DE is to support research activity in clinical information extraction on German texts and to make research outcomes more reproducible.

Materials and methods:
CARDIO:DE will contain 500 clinical routine German discharge letters from the cardiovascular domain in 500 separate files in plain text. We designed the CARDIO:DE corpus project prospectively. After having signed the study consent, we include the next generated discharge letter of the participating patient into the corpus. Before further processing, all discharge letters are initially automatically de-identified (Richter-Pechanski et al. 2019) and in a second pass manually de-identified by medical experts. In a final step, we apply methods of k-anonymity by randomizing rare laboratory values. To keep time information, we used well-established methods of randomization (Johnson et al. 2016) by shifting all dates using a random value per discharge letter. Our final corpus will be made available to NLP researchers using a three-step approach, including a description of the intended purpose of use, a data exchange and a data usage agreement.

Results:
At the time of writing, we have collected 350 patient consents and included all 350 discharge letters into the CARDIO:DE corpus. These documents cover a broad clinical spectrum of a tertiary care cardiovascular center. In this context, discharge letters of inpatients, the outpatient clinic and the cardiac emergency room (chest pain unit) are provided, resulting in a diversified mixture of clinical documents. Thus, the included discharge letters cover both, complex multiple-day hospitalizations as well as brief outpatient presentations. This results in deployment of a representative collection of clinical documents, covering common discharge letter sections (e.g. anamnesis, physical examination, medication etc.) in varying degrees and details. Discussion While collecting patient consents can be time consuming and tedious, a prospective study design best complies with current data protection regulations and keeps the content of clinical documents as consistent as possible. By keeping time information in the documents, the corpus can be used for various information extraction tasks in the cardiovascular domain.

Conclusion:
CARDIO:DE aims to fill the gap of shared German clinical corpora in the cardiovascular domain by distributing a shared corpus for research purposes that provides the opportunity to collaboratively develop and validate models for natural language processing on German medical texts.

https://dgk.org/kongress_programme/jt2022/aP1148.html