In this tutorial you will practice what we've learned during the lessons on
- Relational Databases
- NoSQL Databases
- Programmatic Access
- Molgenis
NOTE In the directory named solutions, examples of solutions and answers to questions posed in the tutorial are provided, divided into sections, one per chapter. The names of the example files start with the paragraph where the question is posed, e.g, 06-loads_data_into_tables.py in directory solutions/01-Relational-databases refers to paragraph numbered 6 of the Relational databases tutorial. Please use the solutions as a reference to compare your or in case you are stuck on some steps.
In this section we will perform some operations on a relational database that uses the OMOP Common Data Model.
We will talk about OMOP CDM in future lessons, for now we will use it as a database with a schema and some preloaded data.
OMOP CDM can be loaded using different DBMS. For this tutorial we will adopt PostgreSQL run using docker.
-
Clone the repository with the tutorials
$ git clone https://github.com/crs4/bbmri-it-school-tutorials.git
-
Download the file Data_Management_Tutorial/omop.zip from https://space.crs4.it/s/JzzPC2wPGFiHR7y and extract the ZIP file in the
tutorial-data-management/01-Relational-databasesdirectory (i.e., the directory with the PostgreSQL's docker compose file). The file omop.sql is extracted. -
Run the compose
docker compose up -d
The compose file contains two services:
- PostgreSQL (
postgres) - pgAdmin (
pgadmin), the most common client to manage PostgreSQL
The
postgresservice in the compose file mounts the omop.sql file in the initdb directory, so the OMOP schema will be automatically loaded. - PostgreSQL (
NOTE the schema is quite big (2.4GB) so it will take some time to load). To check when PostgreSQL is ready, use
docker-compose logs -f postgresand wait the message "PostgreSQL init process complete; ready for start up." Don't worry about the messages 'The role "root" doesn't exist.
-
Access to pgAdmin web interface using a browser at the URL http://localhost:8888 and login using the credentials user:
admin@bbmri-school.itpwd:password -
Configure a new server:
- Click on "Add New Server"
- In the "General" tab, set the name bbmri-it-school-omop
- In the "Connection" tab set the following parameters
- Host Name/Address: postgres (Question: Why can't we use localhost???)
- Port 5432
- Maintanance database: bbmri-it-school
- Username: postgres
- Password: postgres
- Click on "Save"
You should be connected to the database loaded before. The tree on the left should look like this:
-
This OMOP schema has only the terminology tables loaded; now we can proceed to load the data in the CDM tables. There is a series of .csv files in the
tutorial-data-management/01-Relational-databases/omop_datadirectory that contains the data to be loaded in the CDM tables.Using Python, implement a script that reads the
.csvfiles and loads the data in the corresponding tables.
NOTE Due to foreign key constraints you will have to follow this order;
personprovidervisit_occurrencevisit_detailobservationcondition_occurrencedevice_exposuredrug_exposuredeath
TIP To access the database use psycopg2 package To read the CSVs use the built-in csv package or directly use the copy_expert method of psycopg2 cursor.
-
Once the data is loaded, open a pgadmin query tab (query tool button ) in a way to start making some queries anc creation of object in the OMOP DB, via pure SQL. First, let's run this aggregation query:
SELECT person_id, COUNT(*) AS observation_count FROM omop_cdm.observation GROUP BY person_id ORDER BY observation_count desc
What is this query doing?
-
Try to write the queries that answer to these business questions:
- Count the number of persons per gender;
- Get all the persons that were diagnosed of a specific diagnosis type (chooose one) each year. Remember that the Condition table is the one that carries the diagnosis information. In particular the conditions are expressed as SNOMED codes and found in the column condition_source_value.
-
Drop one of the indexes on the
omop_cdm.concept_ancestortable:
DROP INDEX idx_concept_ancestor_id_1Now execute this query:
SELECT * FROM omop_cdm.concept_ancestor WHERE ancestor_concept_id = 45635110How long does it take to execute? Now, recreate the index:
CREATE INDEX idx_concept_ancestor_id_1 ON omop_cdm.concept_ancestor(ancestor_concept_id);This will take a long time. Now, re-execute the previous query. How long does it take now?
-
Create a view named
v_person_observationthat contains the person_id and the observation_count for each person. Use the query shown in step 7 as a base. In this view, find and annotate the count for the person '28'. -
Add this log table:
CREATE TABLE omop_cdm.log ( id SERIAL PRIMARY KEY, action VARCHAR(255) NOT NULL, timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP );
-
Create a trigger that logs every time a new record is inserted in the
omop_cdm.observationtable. The trigger should insert a record in theomop_cdm.logtable with the action "insert" and the current timestamp. -
Add a new row in the observation table:
INSERT INTO omop_cdm.observation(observation_id,person_id,observation_concept_id,observation_date, observation_datetime,observation_type_concept_id,value_as_number, value_as_string,value_as_concept_id,qualifier_concept_id, unit_concept_id,provider_id,visit_occurrence_id,visit_detail_id, observation_source_value,observation_source_concept_id,unit_source_value, qualifier_source_value,value_source_value,observation_event_id, obs_event_field_concept_id) values('8100','28','40766239','2025-07-01',null,'38000280',null,null,'0',0,0,49,1755,1001755,93027-1,37020580,null,null,null,null,null )
-
Check the log table to see if the new record has been logged.
-
Check the view to see if the count for the person '28' has been updated.
In this section we will perform some operations on a NoSQL database that uses the MongoDB engine. We will run a simple mongoDB instance in a container using docker, inspect the data and perform some simple queries using the mongo shell.
-
In a shell, go inside the
tutorial-data-management/02-NoSQL-databases/mongodbdirectory and deploy and run the container:docker compose build
Then, once that the build is completed, run the container:
docker compose up -d
-
Enter to the the container, using the command:
docker compose exec -it mongodb bash -
Enter to the mongo shell, using the command:
mongosh
-
Once inside the shell, chabge the database to use ours:
use biobankDB
-
List the collections in the database. There should be five collections: biobanks, diseases, patients, samples, sampletypes
show collections
-
List the documents in the biobanks collection:
db.biobanks.find().pretty() -
Filter the biobanks by name, e.g., to find the biobank with name "Biobanca di Ricerca Napoli":
db.biobanks.find({name: "Biobanca di Ricerca Napoli"}).pretty() -
Execute one of the queries examples in the queries_examples directory(simply copy the overall code in the shell). For each query, try to understand what it is trying to do in detail, according to the operators that it is using.
-
Try to modify the "sample per year" query by adding also the sample type in the aggregation.
In this part we will practice the creation of a database programmatically using SQLAlchemy and versioning of the database with Alembic.
We'll create a simple database based on this diagram:
erDiagram
PARTICIPANTS {
string participant_id PK
string first_name
string last_name "Possible Values: M for male, F for Female, U for Unknown"
date date_of_birth
string gender
}
SAMPLES {
string sample_id PK
date collection_date
string sample_type
string participant_id FK
}
DIAGNOSIS {
string diagnosis_id PK
string condition
date diagnosis_date
string severity
string participant_id FK
}
PARTICIPANTS ||--o{ SAMPLES : ""
PARTICIPANTS ||--o{ DIAGNOSIS : ""
We will use as DBMS the same PostgreSQL instance used for the 01-Relation-databases tutorial
If you have doubts, you can use the SQLAlchemy and Alembic documentations as references.
-
First of all we need to create the directory and the virtual environment where we install the dependencies needed
$ mkdir biobank_manager # Create the new directory $ cd biobank_manager # Enter the directory $ python -m venv venv # Create the virtual environment $ source venv/bin/activate # Activate the virtual environment (from now on the python commands will use the one in the venv)
-
Now we can install the dependencies
pip install sqlalchemy alembic psycopg2
-
Create a tree structure like the one shown below:
biobank_manager/ # main directory biobank_manager/ # main python module __init__.py # empty file for module conf.py # file with configurations database/ # directory with the database definition __init__.py # empty file for module models.py # file with the SQLAlchemy models repositories.py # file with SQLAlchemy queries -
In the
conf.pyaddDATABASE_URLvariable using the following template. Set the values for user, password and db_name"postgresql+psycopg2://{user}:{password}@localhost:5432/{db_name}"Add also the
DATABASE_SCHEMA_NAMEvariable with the value"biobank_name" -
In the
models.pycreate three models based on the the ER DiagramTIP Remember to create the Base class first. Use
biobank_manageras schema nameTIP To restrict the
genderpossible values use create a class GenderEnum and set the attribute to be Mapped[GenderEnum] -
In the
maindirectory, create a script namedcreate_schema.pythat will populate the database. Add the instructions to:- create the SQLAlchemy
enginefor theDATABASE_URLdb - create the schema
DATABASE_SCHEMA_NAME - create all the tables
- create the SQLAlchemy
-
Do not create the repository.py yet. Run the script
$ python create_schema.py
This will run the database initialization
-
Check the database in pgAdmin4. If everything is correct you should see the schema
biobank_managerin thebbmri-it-schooldatabase with the 3 tables
-
Let's populate the database with some data
Create in the main directory a script named
insert_data.pythat creates:- 100 participants
- for each participant a random number of samples between 1 and 10. Use
this list of samples:
[ "DNA", "RNA", "Blood", "Urine", "Feces", "Buffy Coat" ] - for each participant a random number of diagnosis between 0 and 3. Use this list of diagnosis:
[ "Urinary tract infection", "Human immunodeficiency virus disease", "Sepsis", "Malignant neoplasm of breast", "Acute tonsillitis", "Acute appendicitis", "Streptococcus group A infection", "Iron deficiency anemia", "Hypocalcemia", "Fatty liver" ]
NOTE To communicate with the database you need to create an engine (e.g.,
engine = create_engine(DATABASE_URL)) and a Session. You can use a Context Manager to do that (i.e., withSession(engine) as session)TIP To create random name and surname you can use names-generator package
-
Create a new file in
database/repository.pyand add some functions to perform some queries:- get_all_participants - get_participant_by_id - get_samples_for_participant_by_id - get_samples_for_participant_by_first_name_and_last_name - get_samples_by_type - get_diagnosis_for_participant_by_id - get_diagnosis_before_date -
Create another script in the
maindirectory, namedquery_data.pythat calls to the functions
NOTE As for the previous script you should create the engine and the Session
In this section we will update the database schema and create migrations with Alembic.
-
First of all we need to create the alembic environment in our project.
Move in to the root directory of the project
biobank_manager/and run the command to initialize alembicalembic init alembic
This will create
- the
alembic.inifile, that contains the alembic configuration - the
alembic/directory that will contain the migrations scripts
- the
-
Now we need to edit the
alembic.inifile to set the url of the database.Search the parameter
sqlalchemy.urlin the[alembic]section and set the usual url of the database -
We are ready to create the
baselineof the database. Run the following command:alembic revision -m "baseline"NOTE Does
-mreminds you of something?**The command will generate a file in
alembic/versions/directory called<commit_id>_<commit_message>.pyLet's open the file
<commit_id>_baseline.pyWe can see some variables and two functions.
The variables are:
revision: the id of the revision represented by the filedown_revision: the id of the preceding revision. Since this is the very first revision, we leave this as None
NB: we can ignore
branch_labelanddepends_onfor nowThe two functions are:
upgrade: here we add the statements to change the database with modificationsdowngrade: here we add the statements to restore the database to the previouse version
Since this is the baseline we can leave the two functions empty
-
It's time to make some changes to our database. We decide to update the Participant entity to add the
place_of_birthand thessn(Social Security Number) which in our case (we are in Italy!) is the "Codice Fiscale": it means it contains exactly 16 charachtersWe want both to be mandatory (i.e., NOT NULL). The
ssnmust also be uniqueOpen the
models.pyand change the Participant table to add the two attributes -
Let's run our migration with the same command:
$ alembic revision -m "add place of birth and ssn"Alembic created another script in
versionsdirectory. Let's check it:In this case the
revisioncontains the new value and thedown_revisioncontains the id of thebaselinerevision. -
Next step is to add the instructions to change the database in the
upgradefunctionHere we show the first change that adds the
place_of_birthfrom alembic import op import sqlalchemy as sa ... def upgrade() -> None: # add_column add the column defined with the sqlalchemy definition to the table passed as first argument op.add_column( "participants", # NB: it is the name of the table (i.e., the __tableaname__), not the SQLAlchemy model class name sa.Column(name="place_of_birth", type_=sa.String, nullable=False), schema="biobank_manager" )
Add the second instruction to add the
ssncolumn. Remember that it must be non null and that it has the exact length of 16 -
Let's apply the changes with the alembic command
$ alembic upgrade head INFO [alembic.runtime.migration] Context impl PostgresqlImpl. INFO [alembic.runtime.migration] Will assume transactional DDL. INFO [alembic.runtime.migration] Running upgrade -> 1227802f7ccb, baseling INFO [alembic.runtime.migration] Running upgrade 1227802f7ccb -> a3e5da281584, add place of birth and ssn ... psycopg2.errors.NotNullViolation: column "place_of_birth" of relation "participants" contains null values
Doh! Something went wrong!
We added two new "not null" columns but the already presnt rows in our database are null.
We need to change the
upgradefunction so we add values for the already present rows.To do that, we can't directly create the two columns as null but we need to:
- Add the two columns as nullable
- Add the values for the already present rows
- Alter the two columns to be not null
As before, here is the example for
place_of_birth. Add the part for ssnITALIAN_CITIES = ['Bari', 'Bologna', 'Cagliari', 'Catania', 'Firenze', 'Genova', 'Milano', 'Napoli', 'Palermo', 'Roma', 'Sassari', 'Tonara', 'Torino'] def upgrade() -> None: # Add nullable column op.add_column( "participants", sa.Column(name="place_of_birth", type_=sa.String, nullable=True), schema="biobank_manager", ) conn = op.get_bind() # Fetch all person IDs results = conn.execute( sa.text("SELECT id, last_name, first_name, gender, date_of_birth FROM biobank_manager.participants") ).fetchall() # Assign a random city to each for row in results: city = random.choice(ITALIAN_CITIES) conn.execute( sa.text("UPDATE biobank_manager.participants SET place_of_birth = :city WHERE id = :id"), {"city": city, "id": row.id}, ) op.alter_column('participants', 'place_of_birth', nullable=False, schema='biobank_manager')
WARNING Of course, choosing a random city is not something you want to do in real world, here we're doing this for the sake of training
TIP To generate the "codice fiscale" value you can adopt the python-codicefiscale package
As we have seen in the lessons, Molgenis is a platform that allows to create and manage data models, import data, and create reports.
The purpose of this tutorial is to create with Molgenis the same schema created in the SQLAlchemy tutorial.
We will create the updated version, the one with the changes done with Alembic
Molgenis adopts a custom format, namely EMX2, which allows both the definitions of the database's schema and the upload of the data using spreadsheets (Excel or CSV). In particular:
molgenis.csv: contains the definition of the schema: tables, and for each table, the attributes with references and constraints<table_name>.csv: they contain the data for the tables. The columns of the CSV are the names of the attributes defined in the schema For example, if we have a table namedParticipants, the data should be added in toParticipants.csvfiels
The molgenis.csv file is a CSV file that contains the definition of the schema. It has the following columns:
- tableName
- tableExtend
- columnName
- columnType
- label
- key
- required
- isReadonly
- description
- refSchema
- refTable
- refBack
- refLabel
- defaultValue
- validation
- message
- computed
- semantics
For a detailed explanation, you can check the official documentation
This is the header of the file. Then, all the data following the header will be the definition of the schema. Not all columns above should be valued, let's analyze the most important:
tableName: the name of the tablecolumnName: the name of the attributecolumnType: the type of the attribute, e.g.,string,integer,date,boolean,ref(used for references, to other tables)required: whether the attribute is required (i.e., not null) or notkey: A number to indicate that the attribute is part of a key (primary or unique). The primary key has value 1. If a key is composed of multiple attributes they must have the same number in this columnrefTable: if the columnType is a ref, it needs to specify the referred table (e.g,Participantfor the participant attribute of theSampleentity)refBack:
-
Let's start creating our schema
Here is the first part of the
molgenis.csvfile for our database:tableName,tableExtends,columnName,label,columnType,key,required,isReadonly,description,refSchema,refTable,refBack,refLabel,defaultValue,validation, message,computed,semantics Participants,,,,,,,Participants of the study,,,,,,,,,,, Participants,,id,Identifier of the participant,int,1,,Identifier of the participant,,,,,,,,,,, Participants,,first_name,First Name,string,,,First Name of the participant,,,,,,,,,,, Participants,,last_name,Last Name,string,,,Last Name of the participant,,,,,,,,,,, Participants,,gender,Gender,string,,,Gender of the participant,,,,,,,,,,, Participants,,date_of_birth,Date of birth,date,,,Date of birth of the participant,,,,,,,,,,, Participants,,place_of_birth,The place of birth,string,,,Place of birth of the participant,,,,,,,,,,, Participants,,ssn,Social Security Number,string,,,The Social Security Number. In Italy corresponds to the "Codice Fiscale",,,,,,,,,,,
The first row of the CSV is the table. We recognize it because it just has the
tableNameand thelabel. The other rows defines the attributes of the table Participants. Notice the1value for column key of the attributeid. It means it is the primary key.We leave to you the addition of the other two tables with the attributes to the CSV. Be careful of the
refcolumn for theparticipantattribute. -
We can try to upload the schema in molgenis now.
In the
04-molgenis-EMX2directory you can find adocker-compose.ymlfile with molgenis and postgres services.Run the
docker-compose.ymlfiledocker compose up -d
-
Access to molgenis using a browser at the link
http://localhost:8080/apps/central/#/You should see the
pet storetest schema -
Login using
admin/admincredentials. You should see this: -
Click in the
+button to add a new database and call itbbmri-it-school-biobank -
Go the
Up/Downloadsection of in the menu. Click on theBrowsebutton and search for the molgenis.csv file created before. Finally, click onImport. If everything is correct you should see the following:
Note
At the beginning, the definition of the schema may have some issues. Molgenis gives feedbacks about the problems when importing the schema.
If you're having troubles defining the schema correctly, use the one in 04-Molgenis-EMX2/data
- Now you can navigate using the menu to the
Schemasection where you can see the defined tables
For each table defined in the molgenis.csv file, we have have a corresponding .csv file that contains the data for that table.
Each csv file must have a header with the names of the fields, followed by a row for each different record.
For example, the Participants.csv file will have the following header and data (one example record only):
id,first_name,last_name,gender,date_of_birth,place_of_birth,ssn
1,Funny,Davinci,M,1982-12-03,Bari,DVNFNY82B12A662U-
Edit the three
csvs filesParticipants,SamplesandDiagnosisYou have three choices:
- manually add some data for each csv)
- write a python script similar to the one used to generate random data in SQLAlchemy
- only for the braves: write a python script that reads the data from the
biobank_managerdatabase and generate the corresponding csv for this new database. NB: the postgres db ofbiobank_manageris accessible at port5432the molgenis one at port5434so you can run both services at the same time
-
After you created the
csvs you can upload the data as you did with the schema
Molgenis allows to create some reports of the data using direct SQL queries on the database.
In this section we will create a simple report as example.
-
Go to the
Reportsection, using the menu and click on the+button.Then click the ✏️ icon to edit the Report. You will see an error message: ignore it and click the ✏️ icon next to the
View report id=uniqueid -
Fill the form with the definition of the report. Set
- the id of the report (without blank spaces)
- a description
- the SQL query to select the participant id and the number of samples for each participant
Click the Save Button
You should see a table with the results of the query
In this last section we will see how Molgenis provides a functionality to run scripts and make analysis on the data using python.
We will create a very simple script that uses the Molgenis PyClient just to write in file the count of samples for each type
-
First of all go to the
Jobs & Scriptssection -
Then click the
+button. You will see a form to define the script. The script has the following inputs:- name: just the name of the script
- type: Python or Bash. We will use Python
- script: the code of the script
- dependencies: the libraries that the script needs to perform the analysis (e.g.,
numpyormolgenis-emx2-pyclient) - outputFileExtension: the extension of the file produced by the script. Set
txt - failureAddress: the email address to notify that the job failed
- cron: a string to schedule the script at planned time
In the script add the following code:
import os from collections import defaultdict from molgenis_emx2_pyclient import Client token = os.environ.get("MOLGENIS_TOKEN") outfile = os.environ.get("OUTPUT_FILE") with Client(url='http://localhost:8080', token=token) as client: samples = client.get(table="Samples", columns=["type", "participant"], schema="bbmri-it-school-biobank") sample_type_count = defaultdict(int) for s in samples: sample_type_count[s['type']] += 1 with open(outfile, "w") as f: for t, c in sample_type_count.items(): f.write(f"There are {c} samples of type: {t}\n")
Some point to notice:
- The two environment variables
MOLGENIS_TOKENandOUTPUT_FILE. These are always passed by Molgenis when the script is run inside Molgenis. Indeed the same script can be run from outside the server.- The token is a string used by the client to authenticate into Molgenis when performing calls to the REST API. The token is valid for the local server
- The OUTPUT_FILE is the name of the file of the script
- The url is
http://localhost:8080.localhosthere means the internal address in the molgenis docker container. It means that it is referring to itself. If we run the script from outside, we should add the external url of molgenis
Here is the result of the filled form
Click on
Save Scripts -
To run the script click on the
playbutton, don't add any parameter and wait for the script to finish. -
Finally go to the [Jobs] tab where you can see the result of the run
Notice the last column
outputwhere you can check the output file produced by the script











