Explorative Data Analysis of the ORCHESTRA Public Data Set¶

Welcome! The following statistics provide some visusal insights into the ORCHESTRA Public Data Set. The Public Data Set constitutes patient data from the ORCHESTRA cohort after a data cleaning process and includes data from patients documented until January 17, 2023.

The ORCHESTRA Public Data Set is originating from the central ORCHESTRA data base. The data anonymisation pipeline is described by Jakob et al. in Design and evaluation of a data anonymisation pipeline to promote Open Science on COVID-19 and Statistical biases due to anonymization evaluated in an open clinical dataset from COVID-19 patients. The public data is anonymised using our data protection concept. The anonymisation process was carried out with the ARX software

Copyright: This work is licensed under the Creative Commons Attribution Non-Commercial 4.0 License. With the use of this data you agree to include a proper acknowledgement of the ORCHESTRA study group in any work based on the data set. By working with this notebook you agree to maintain the confidentiality of the data set at all times and to not attempt to compromise or otherwise violate the privacy of the patients described. To view a copy of the license, visit https://creativecommons.org/licenses/by-nc/4.0/.

If you have any comments on the notebook, please drop us a message at support@orchestra-cohort.eu.

Data Set Structure¶

Here we provide information on the basic structure of the ORCHESTRA Public Data Set.

The data set consists of 3396 patients before anonymisation, 3026 patients after anonymisation, and 38 variables.

Each row represents the anonymised data of a single patient.

*The Clinical Phases are defined according to the WHO clinical progression scale:

To get to know the Public Data Set better, the values of variables are shown below according to the used data set. Please be aware that the Public Data Set is only a part of the complete ORCHESTRA data set. Anonymisation processes may lead to variables having less values than in the complete ORCHESTRA data set. For example the variable 'gender' can also have the value 'diverse', but there is no patient with this gender in the Public Data Set.

age:
18 - 39 years, 40 - 59 years, 60 - 79 years, >= 80 years, nan

gender:
Female, Male

quarter_of_diagnosis:
Q1-2020, Q2-2020, Q3-2020, Q4-2020, Q1-2021, Q2-2021, Q3-2021, Q4-2021, Q1-2022

chronic_heart_disease:
No, Unknown, Yes

chronic_lung_disease:
No, Unknown, Yes

chronic_liver_disease:
No, Unknown, Yes

chronic_kidney_disease:
No, Unknown, Yes

active_tumor_cancer:
No, Unknown, Yes

auto_inflammatory_disease:
No, Yes

diabetes:
No, Yes

neurological_psychiatric_disease:
No, Yes

transplant:
No, Unknown, Yes

cigarette_abusus:
Former, No, Unknown, Yes

covid_vaccination:
No, Unknown, Yes

covid_therapy:
Unknown, Yes

dialysis:
No, Unknown, Yes

intensive_care_treatment:
No, Unknown, Yes

events_embolic:
No, Yes

events_pulmonary_embolism:
No, Yes

events_neurological:
No, Yes

events_cardiac:
No, Yes

events_bacterial_pneumonia:
No, Yes

highest_level_respiratory_support:
High flow, Invasive ventilation, Mask or nasal prongs, No oxygen, Non-invasive ventilation, None

most_severe_stage_acute:
Mild, Moderate, Severe

hospitalisation:
No, Yes

any_symptoms_acute:
No, Unknown, Yes

general_symptoms_acute:
No, Yes

neurological_symptoms_acute:
No, Yes

respiratory_symptoms_acute:
No, Yes

gastrointestinal_symptoms_acute:
No, Yes

systolic_blood_pressure:
100-119 mmHg, 120-139 mmHg, 140-159 mmHg, 160-179 mmHg, 80-99 mmHg, < 80 mmHg, > 179 mmHg, Unknown

diastolic_blood_pressure:
110-119 mmHg, 40-59 mmHg, 60-89 mmHg, 90-109 mmHg, < 40 mmHg, > 119 mmHg, Unknown

heart_frequency:
60-100/min, < 60/min, > 100/min, Unknown

peripheral_oxygen_saturation:
60-69 %, 70-79 %, 80-89 %, 90-95 %, 96-100 %, < 60 %, Unknown

respiratory_frequency:
16-20/min, 21-29/min, < 16/min, > 29/min, Unknown

type_of_discharge_acute:
Alive, Ambulant, Death, Referral to another insitution, Unknown

availability_6month_followup:
Yes

any_symptom_6month_followup:
No, Unknown, Yes

1. Descriptive Analysis¶

The following descriptive statistics are computed in this section:

Quarter of diagnosis Distribution
Gender Distribution
Age Distribution
Age - Gender Distribution

The number of patients before anonymisation is 3396.
The number of patients after anonymisation 3026.

2. Patient status at the end of acute phase¶

The following descriptive statistics on the health status at the end of medical consultation are computed in this section:

Frequency of Health Status at the End of Medical Consultation
Hospitalisation in the acute phase
Intensive care treatment in the acute phase
Highest oxygen level reached in the acute phase

Note that we will use a filtered data set for computing the rates, which we describe below.

Frequency of Health Status at the End of Medical Consultation¶

Hospitalisation in the acute phase¶

Intensive care treatment in the acute phase¶

Invasive ventilation in the acute phase¶

3. Clinical Phases¶

From here on we will indicate the three clinical phases as

Mild Phase
Moderate Phase
Severe Phase

In the following we will plot the:

Maximum phase reached by patients

4. 6 Months Follow Up¶

In the following we will plot the:

Any Symptom - 6MFU

	Before Anonymisation	After Anonymisation
Alive	2682	2412
Ambulant	623	539
Unknown	55	43
Referral to another insitution	28	24
Death	8	8