Exploring ECG Data: PTB-XL for Training Generative Models
- Kasturi Murthy
- Jun 23
- 7 min read
Updated: Jun 28
Electrocardiogram (ECG) data is essential for assessing heart health. In data science and engineering, particularly when developing AI algorithms, access to a diverse range of high-quality ECG data is crucial. This blog post, inspired by dissertation research on generating synthetic ventricular tachycardias, will delve into key aspects of ECG data sources and the significant challenge posed by noise and artifacts. Building on my previous post about synthetic sine wave generation, this entry introduces a practical example. We will explore various publicly available ECG data sources before concentrating on the PTB-XL data source [8] for training the Variational Autoencoder, and later the Generative Adversarial Model, as well as combinations of both (VAE and GAN).
Excited to share PTB_XL_Data_Overview.ipynb, my Google Colab notebook for an in-depth look at the PTB-XL ECG dataset.
WFDB at its Core: This notebook extensively uses WFDB, PhysioNet's [PhysioNet Databases] robust software for managing and processing physiological signals. You'll see how it enables efficient data handling, advanced signal processing (filtering, resampling), and deep analysis (heartbeat detection, feature extraction) vital for cardiac research.
PTB-XL Data Insights: Explore how I've structured and loaded PTB-XL's ECG data, handled its rich metadata, and processed diagnostic information. The notebook demonstrates flexible data loading with custom sampling rates and smart aggregation of diagnostic classes using SCP codes.
Access by Request: To ensure personalized support and quality, I'm offering access on a request basis.
Get Access: Simply fill out this quick form: [Response - Google Sheets]
Heads Up: A Gmail account is preferred for direct access. I'll review requests regularly and notify you via email once access is granted.
Understanding the ECG Waveform and Its Features
A typical ECG signal provides insights into cardiac, cardiovascular, and cardiorespiratory functions. It consists of key components:
P wave: Represents atrial depolarization.
QRS complex: Signifies ventricular depolarization and the heart's powerful pumping action. An abnormally large Q wave can indicate a past heart attack.
T wave: Represents ventricular repolarization (relaxation and recovery phase) and is vital for cardiac electrical stability.
PR interval: The time between the P wave and QRS complex.
QT interval: The time between the QRS complex and T wave.
RR interval: The time between successive R-peaks, often used interchangeably with NN interval, emphasizing normal heartbeats.
Deviations from these normal features, known as arrhythmias, can indicate various cardiac issues. These abnormalities can manifest as altered heart rates, irregular rhythms, or changes in wave morphology

The Landscape of ECG Data Sources
When working with ECG data, researchers often turn to comprehensive resources like PhysioNet [PhysioNet Databases]. Established in 1999 under the National Institutes of Health (NIH), PhysioNet offers free access to a vast collection of physiological and clinical data, coupled with open-source software for analysis. Key components include:
Physiobank: A digital archive containing ECG recordings from healthy individuals and patients with various cardiac conditions.
PhysioToolKit: A library of open-source software for processing and analyzing physiological signals.
PhysioNet hosts numerous ECG databases tailored to different research needs, such as studying arrhythmias or benchmarking algorithms. Some notable databases include:
MIMIC-III Waveform Database [1]: Contains a wide range of physiological signals, including ECG waveforms, from critical care patients.
ICENTIA11k [2]: A large-scale ECG dataset with 11,000 patients and over 2 billion labeled beats.
Chapman-Shaoxing 12-Lead ECG Database [3]: Focuses on arrhythmia research with 12-lead ECG recordings.
MIT-BIH Malignant Ventricular Ectopy Database (MVED) [4]: A collection of ECG recordings for detecting and analyzing cardiac arrhythmias, specifically malignant ventricular ectopy.
MIT-BIH Arrhythmia Database [5]: Widely used for developing and evaluating algorithms for cardiac arrhythmia detection, ECG signal processing, and machine learning applications, featuring 450 half-hour recordings and expert annotations of over 650,000 heartbeats.
CU Ventricular Tachyarrhythmia Database [6]: Contains 35 eight-minute ECG recordings of patients with sustained ventricular tachycardia, flutter, and fibrillation.
INCART Database [7]: A collection of ECG recordings from intensive care unit (ICU) patients, with 346 recordings from 146 patients.
PTB-XL [8] - A large publicly available electrocardiography dataset (version 1.0.1)
A Closer Look at the PTB-XL Database [8]
Among these valuable resources, the PTB-XL ECG Dataset [8] stands out as a large and freely accessible dataset, providing 21,837 clinical 12-lead ECG records from 18,885 patients. Each record is 10 seconds long. What makes PTB-XL particularly valuable is its inclusion of predefined train-test splits based on stratified sampling, which helps address limitations of datasets that only provide raw data.
The diagnostic classes in the PTB-XL dataset are interconnected through a structured system that categorizes specific cardiac abnormalities and pathologies based on common characteristics and clinical interpretations. Each diagnostic class represents a broader category of ECG findings, while subclasses provide further granularity by specifying particular types of abnormalities within each class.
The linkage between classes is established through a classification scheme that organizes related diagnostic statements into hierarchical relationships. For instance, within the superclass "CD" (Conduction Disorders), subclasses like "LAFB/LPFB" (Left Anterior Fascicular Block/Left Posterior Fascicular Block) and "CRBBB" (Complete Right Bundle Branch Block) are grouped under this overarching category of conduction abnormalities.
Similarly, subclasses under the superclass "HYP" (Heart Hypertrophy) such as "LVH" (Left Ventricular Hypertrophy) and "RVH" (Right Ventricular Hypertrophy) are linked by their shared characteristic of hypertrophy in specific regions of the heart.
Furthermore, the subclasses within the superclass "MI" (Myocardial Infarction) are connected based on the location and nature of myocardial ischemic injury. Subclasses like "IMI" (Inferior Myocardial Infarction), "AMI" (Anterior Myocardial Infarction), and others represent distinct types of myocardial infarctions, each linked to the superclass through their pathological features.
By establishing these interconnections between diagnostic classes and subclasses, the PTB-XL dataset offers a comprehensive framework for classifying and understanding diverse ECG abnormalities and cardiac conditions
The PTB-XL dataset comprises two main files:
ptbxl-database.csv: Contains the main dataset information, including ECG recordings and metadata such as patient information, signal data, and diagnostic annotations.
scp_statements.csv: Details the SCP-ECG statements used in the dataset, representing specific findings or characteristics in the ECG recordings. These statements provide structured and standardized information about diagnoses, forms, and rhythms, and are linked to the ECG records for integrated analysis.
The dataset is categorized into:
Diagnostic statements: Describe abnormalities or specific findings (e.g., non-diagnostic T abnormalities, abnormal QRS, ventricular premature complex).
Form statements: Provide information about the overall ECG pattern (e.g., sinus rhythm, atrial fibrillation).
Rhythm statements: Show specific rhythm patterns (e.g., sinus tachycardia, sinus arrhythmia).
The use of "strat_fold" in PTB-XL indicates stratified sampling, ensuring that subsets of the data maintain a representative distribution of characteristics, which is crucial for training robust machine learning models, especially with imbalanced datasets.

![The black inset window displays diagnostic superclasses, while the plot window features a typical ECG waveform generated using the Neurokit2 [9] Python library based on PTB-XL data set. This visualization provides a clear representation of ECG data, highlighting key diagnostic insights.](https://static.wixstatic.com/media/48f00c_6553fcfcca044226845b19d93ee31b0e~mv2.jpg/v1/fill/w_147,h_83,al_c,q_80,usm_0.66_1.00_0.01,blur_2,enc_avif,quality_auto/48f00c_6553fcfcca044226845b19d93ee31b0e~mv2.jpg)
Training of Generative Models

Efficient Latent Space Sampling for Synthetic ECG Generation Using Circular Buffers. This video demonstrates the creation of synthetic ECG signals with a Variational Autoencoder (VAE). Random latent codes, representing abstracted ECG features, are drawn from the VAE's latent space and input into a circular read buffer for effective, pseudo-continuous way. Once the buffer is full, the decoder converts these latent codes into synthetic ECG waveforms. The write buffer stores these outputs along with their Empirical Mode Decomposition (EMD) components t₁ and t₂.
The red waveform in the right panel window depicts the synthetic ECG based on the blue t₁ waveform (in the right panel window). A modified NeuroKit2 [9] ECG plotting function is employed to plot ECG (t₁) and selectively renders ECGs based on their quality—either Excellent or Barely Acceptable. This procedure repeats at regular intervals to generation of diverse ECG signals.
References
1. Moody, B., Moody, G., Villarroel, M., Clifford, G. D., & Silva, I. (2020).’ MIMIC-III Waveform Database (version 1.0). PhysioNet’. https://doi.org/10.13026/c2607mMIMIC-III Waveform Database, Published: April 7, 2020. Version: 1.0. MIMIC-III Waveform Database v1.0 (physionet.org)
2. Tan, S., Ortiz-Gagné, S., Beaudoin-Gagnon, N., Fecteau, P., Courville, A., Bengio, Y., & Cohen, J. P. (2022). ‘Icentia11k Single Lead Continuous Raw Electrocardiogram Dataset (version 1.0). PhysioNet’. https://doi.org/10.13026/kk0v-r952. Icentia11k Single Lead Continuous Raw Electrocardiogram Dataset v1.0 (physionet.org)
3. Zheng, J., Guo, H., & Chu, H. (2022). ‘A large scale 12-lead electrocardiogram database for arrhythmia study (version 1.0.0). PhysioNet’. https://doi.org/10.13026/wgex-er52.
6. Nolle FM, Badura FK, Catlett JM, Bowser RW, Sketch MH. CREI-GARD, ‘A new concept in computerized arrhythmia monitoring systems’. Computers in Cardiology 13:515-518 (1986). CU Ventricular Tachyarrhythmia Database v1.0.0 (physionet.org)
7. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220 [Circulation Electronic Pages; http://circ.ahajournals.org/cgi/content/full/101/23/e215]; 2000 (June 13). St.-Petersburg Institute of Cardiological Technics 12-lead Arrhythmia Database
8. Wagner, P., Strodthoff, N., Bousseljot, R., Samek, W., & Schaeffter, T. (2020). ‘PTB-XL, a large publicly available electrocardiography dataset (version 1.0.1). PhysioNet’. https://doi.org/10.13026/x4td-x982
9. Makowski, D., Pham, T., Lau, Z. J., Brammer, J. C., Lespinasse, F., Pham, H., Schölzel, C., & Chen, S. A. (2021). NeuroKit2: ‘A Python toolbox for neurophysiological signal processing. Behavior Research Methods’, 53(4), 1689-1696. https://doi.org/10.3758/s13428-020-01516-y

