Mastering XML with Python: A Guide to Handling ODM-XML Files
ODM-XML files might seem daunting, but Python can simplify their handling.

Context
In clinical studies, it's common to work with datasets that can come in various formats, depending on what the chosen EDC system can offer. However, in some situations, it will be necessary to extract a complete dataset, according to what regulatory entities require for the approval submissions of drug and vaccine use, mainly. From that point, there isn't much choice; an XML document will always appear as the most facilitated option. For those used to dealing with all types of data, some effort might be needed to transform it into something that can be used in various analyses. And for those used to dealing only with structured data (like spreadsheets), panic sets in. What now? What to do from there?
In this article, we will explore some of the secrets hidden in an ODM-XML document, as well as how we can uncover these secrets and produce a spreadsheet that makes sense for clinical study data analysis.
Why XML?
In general, there are some good reasons to use XML when working with data. For example:
You can create a file that works with many data manipulation tools. Since an XML document is a text file, it can be easily transferred between different technologies. Many technologies can read and process XML for this reason (some examples can be found here).
It maintains data integrity, which is crucial when working with databases. This is achieved by combining metadata and data in a single file, ensuring all resources are available without needing to import other files (we will explore this in more detail throughout the article)1.
The organization and categorization of data are more efficient, making it easier and faster to find specific information within an XML document.
What is ODM?
ODM stands for Operational Data Model. According to CDISC, "it is a format that can be used to store and exchange data between data management systems, as well as to store data, metadata, and administrative data related to a clinical study." For some health regulatory entities, like the FDA in the United States 2 3 4 5, CDISC has suggested this format as the standard for data archiving because of its ability to carry all relevant information for clinical studies in a single interchangeable file when needed.
The ODM model offers an XML model that facilitates the capture of clinical data; from a schema like the one in the figure below, we can divide the XML document into two main parts: the metadata, which provides the definition of the variables to be used, and the actual data, where the clinical study information is stored.

The schema of an XML document for clinical data, according to the ODM model established by CDISC. 6

Another schema of an XML document for clinical data. 7
Importing libraries
When using Python to extract data from .xml documents, you need to import some libraries. If you use platforms like Anaconda or WinPython, simply using import might be enough. If you're using a "pure" version of Python, I recommend installing the libraries using pip before proceeding with the following steps.
In this article, we will use libraries such as:
requests, the simplest library for making web page requests (because life is complex enough);lxml, another easy-to-use library for processing XML files;pandas, the most well-known library for data manipulation and analysis; with it, you can create a spreadsheet with the XML data.
import requests
from lxml import etree #In this case, we are going to use the ElementTree API available in the lxml library.
import pandas as pd
Getting the data
The .xml document we will use comes from a CDISC GitHub repository. To obtain it directly, i.e., without downloading any files, we will use the requests library. The get function of this library makes a request to a GitHub page containing the file and waits for a response from that page, which we are storing in the response variable. If the response is code 200, it means the request was successful.
url = 'https://github.com/cdisc-org/DataExchange-ODM/raw/main/examples/Demographics_RACE/Demographics_RACE_check_all_that_apply.xml'
response = requests.get(url)
print(response)
<Response [200]>
From this response, we will use the content object to actually get the XML that will be explored. The content will be available in the tree variable, as shown in the code below:
tree = response.content
The variable tree will be with us throughout the entire process of building the spreadsheet from the information we have.
Transforming and Checking the XML Structure
After extracting the content from the GitHub page, you can observe the structure of the XML that will be explored when the content is transformed into a readable variable. This is important for locating the tags, attributes, and values where the relevant data is found.
From here, we begin using another library we imported: lxml, which is dedicated to obtaining the XML elements mentioned earlier. We will store the entire XML in the variable tree, where the content from the GitHub page will be passed along with the type of parser (the transformer) that will be used. The variable tree2, in this case, serves to allow us to actually observe the "skeleton" of the XML.
tree = etree.XML(tree, etree.XMLParser(remove_comments=True))
tree2 = etree.tostring(tree, pretty_print = True, encoding = str)
print(tree2)
<ODM xmlns="http://www.cdisc.org/ns/odm/v2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" CreationDateTime="2020-07-06T10:20:15+01:00" FileOID="DEMOGRAPHICS_EXAMPLE" FileType="Snapshot" Granularity="Metadata" ODMVersion="2.0" SourceSystem="XML4Pharma CDISC ODM Study Designer" SourceSystemVersion="2015-R1">
<Study OID="ST.DEMOGRAPHICS_EXAMPLE" StudyName="Study with Demographics example" ProtocolName="MyStudy">
<Description><TranslatedText xml:lang="en" Type="text/plain">Demographics example with Race, with "check all that apply"</TranslatedText></Description>
<MetaDataVersion Name="Version 1" OID="MV.1.0">
<Description><TranslatedText xml:lang="en" Type="text/plain">Version 1</TranslatedText></Description>
<StudyEventDef Name="Screening visit with demographics" OID="SE.SCREENING" Repeating="No" Type="Scheduled">
<ItemGroupRef ItemGroupOID="FO.DEMOGRAPHICS" Mandatory="Yes"/>
</StudyEventDef>
<ItemGroupDef Name="Demographics form" OID="FO.DEMOGRAPHICS" Type="Form" Repeating="No">
<ItemGroupRef ItemGroupOID="IG.DEMOGRAPHICS" Mandatory="Yes"/>
</ItemGroupDef>
<ItemGroupDef Name="Demographics" OID="IG.DEMOGRAPHICS" Type="Section" Repeating="No">
<ItemRef ItemOID="IT.DOB" Mandatory="Yes"/>
<ItemRef ItemOID="IT.SEX" Mandatory="Yes"/>
<ItemRef ItemOID="IT.ETHNIC" Mandatory="Yes"/>
<ItemGroupRef ItemGroupOID="IG.RACE" Mandatory="Yes"/>
</ItemGroupDef>
...
<CodeList DataType="integer" Name="Sex" OID="CL.SEX">
<CodeListItem CodedValue="1">
<Decode>
<TranslatedText xml:lang="en" Type="text/plain">Male</TranslatedText>
</Decode>
</CodeListItem>
<CodeListItem CodedValue="2">
<Decode>
<TranslatedText xml:lang="en" Type="text/plain">Female</TranslatedText>
</Decode>
</CodeListItem>
</CodeList>
<CodeList DataType="integer" Name="Ethnicity" OID="CL.ETHNIC">
<CodeListItem CodedValue="1">
<Decode>
<TranslatedText xml:lang="en" Type="text/plain">Hispanic</TranslatedText>
</Decode>
</CodeListItem>
<CodeListItem CodedValue="2">
<Decode>
<TranslatedText xml:lang="en" Type="text/plain">Non-hispanic</TranslatedText>
</Decode>
</CodeListItem>
</CodeList>
...
</MetaDataVersion>
</Study>
<ClinicalData StudyOID="ST.DEMOGRAPHICS_EXAMPLE" MetaDataVersionOID="MV.1.0">
<SubjectData SubjectKey="001">
<StudyEventData StudyEventOID="SE.SCREENING">
<ItemGroupData ItemGroupOID="FO.DEMOGRAPHICS">
<ItemGroupData ItemGroupOID="IG.DEMOGRAPHICS">
<ItemData ItemOID="IT.DOB"><Value>1957-05-07</Value></ItemData>
<ItemData ItemOID="IT.SEX"><Value>1</Value></ItemData>
<ItemData ItemOID="IT.ETHNIC"><Value>2</Value></ItemData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="1">
<ItemData ItemOID="IT.RACE_CODE"><Value>1</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="2">
<ItemData ItemOID="IT.RACE_CODE"><Value>2</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>true</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="3">
<ItemData ItemOID="IT.RACE_CODE"><Value>3</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="4">
<ItemData ItemOID="IT.RACE_CODE"><Value>4</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>4</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="5">
<ItemData ItemOID="IT.RACE_CODE"><Value>5</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="6">
<ItemData ItemOID="IT.RACE_CODE"><Value>99</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
</ItemGroupData>
</ItemGroupData>
</StudyEventData>
</SubjectData>
...
<SubjectData SubjectKey="003">
<StudyEventData StudyEventOID="SE.SCREENING">
<ItemGroupData ItemGroupOID="FO.DEMOGRAPHICS">
<ItemGroupData ItemGroupOID="IG.DEMOGRAPHICS">
<ItemData ItemOID="IT.DOB"><Value>1961-06-09</Value></ItemData>
<ItemData ItemOID="IT.SEX"><Value>2</Value></ItemData>
<ItemData ItemOID="IT.ETHNIC"><Value>1</Value></ItemData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="1">
<ItemData ItemOID="IT.RACE_CODE"><Value>1</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="2">
<ItemData ItemOID="IT.RACE_CODE"><Value>2</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="3">
<ItemData ItemOID="IT.RACE_CODE"><Value>3</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="4">
<ItemData ItemOID="IT.RACE_CODE"><Value>4</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="5">
<ItemData ItemOID="IT.RACE_CODE"><Value>5</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="6">
<ItemData ItemOID="IT.RACE_CODE"><Value>99</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>true</Value></ItemData>
<ItemData ItemOID="IT.RACEOTH"><Value>Native Amazonian</Value></ItemData>
</ItemGroupData>
</ItemGroupData>
</ItemGroupData>
</StudyEventData>
</SubjectData>
</ClinicalData>
</ODM>
Checking the namespaces
When exploring an XML, it's essential to be aware of the namespaces it contains. Namespaces can provide a unique identification for each component tag. The use of namespaces is recommended by the W3C and is very necessary when we have tags or attributes with similar names but associated with different elements or tags. To see which namespaces appear in the XML document, use the nsmap object, as shown below:
ns = tree.nsmap
print(ns)
{None: 'http://www.cdisc.org/ns/odm/v2.0', 'xs': 'http://www.w3.org/2001/XMLSchema', 'xlink': 'http://www.w3.org/1999/xlink'}
Checking tags
Now that we know about the namespaces, we can analyze what the names of the tags are. These names will allow us to extract the data of interest later on.
elements = []
for elem in tree.iter():
elements.append(elem.tag)
elements = list(set(elements))
print(elements)
['{http://www.cdisc.org/ns/odm/v2.0}TranslatedText', '{http://www.cdisc.org/ns/odm/v2.0}Description', '{http://www.cdisc.org/ns/odm/v2.0}ItemGroupDef', '{http://www.cdisc.org/ns/odm/v2.0}Decode', '{http://www.cdisc.org/ns/odm/v2.0}ClinicalData', '{http://www.cdisc.org/ns/odm/v2.0}ODM', '{http://www.cdisc.org/ns/odm/v2.0}ItemGroupData', '{http://www.cdisc.org/ns/odm/v2.0}ItemRef', '{http://www.cdisc.org/ns/odm/v2.0}Question', '{http://www.cdisc.org/ns/odm/v2.0}CodeListItem', '{http://www.cdisc.org/ns/odm/v2.0}SubjectData', '{http://www.cdisc.org/ns/odm/v2.0}ItemGroupRef', '{http://www.cdisc.org/ns/odm/v2.0}MetaDataVersion', '{http://www.cdisc.org/ns/odm/v2.0}StudyEventData', '{http://www.cdisc.org/ns/odm/v2.0}Study', '{http://www.cdisc.org/ns/odm/v2.0}Value', '{http://www.cdisc.org/ns/odm/v2.0}CodeListRef', '{http://www.cdisc.org/ns/odm/v2.0}StudyEventDef', '{http://www.cdisc.org/ns/odm/v2.0}ItemDef', '{http://www.cdisc.org/ns/odm/v2.0}ItemData', '{http://www.cdisc.org/ns/odm/v2.0}CodeList']
Extracting the first attributes
Knowing how the XML file is structured and which tags are present, it's possible to select where the data of interest is located. The next step is to analyze what's inside the tags. This content is called attributes. Attributes are internal parts of the tags that follow a pattern 'Name="Value"'. Notice the attribute 'ItemOID' at the end of the first line; it is accompanied by a value "IT.DOB". This is the pattern used and recommended by the W3C when it comes to attributes. 8
print(etree.tostring(tree.find('.//ItemData', ns), pretty_print = True, encoding = str))
<ItemData xmlns="http://www.cdisc.org/ns/odm/v2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" ItemOID="IT.DOB">
<Value>1957-05-07</Value>
</ItemData>
To access the attributes of a specific tag, you can use two functions from lxml, depending on the situation: find and findall. The first one is used to retrieve only the first occurrence of a tag in the XML file, while the other returns all occurrences of tags with the mentioned name.
tree.find('.//ClinicalData', ns) #Only the first occurence of the tag 'ClinicalData'.
<Element {http://www.cdisc.org/ns/odm/v2.0}ClinicalData at 0x21052103e80>
tree.findall('.//SubjectData', ns) #Todas as ocorrências da tag SubjectData
[<Element {http://www.cdisc.org/ns/odm/v2.0}SubjectData at 0x21052102e00>,
<Element {http://www.cdisc.org/ns/odm/v2.0}SubjectData at 0x21052108540>,
<Element {http://www.cdisc.org/ns/odm/v2.0}SubjectData at 0x21052108840>]
Notice that when we use these functions, the return is an object of the class Element; however, this is not exactly what we are looking for. To find out what the tags and their attributes really are, we can use two objects with suggestive names: attrib and tag. For attributes, the results return as a dictionary, where the key is the attribute's name, and the value is the attribute's value. The tag function, on the other hand, returns the names of the mentioned tags.
tree.find('.//ClinicalData', ns).attrib #The attributes of the tag 'ClinicalData' (the only tag, in this case).
{'StudyOID': 'ST.DEMOGRAPHICS_EXAMPLE', 'MetaDataVersionOID': 'MV.1.0'}
for subject in tree.findall('.//SubjectData', ns):
subj = subject.attrib #Attributes of each 'SubjectData' tag.
print(subj)
{'SubjectKey': '001'}
{'SubjectKey': '002'}
{'SubjectKey': '003'}
tree.find('.//SubjectData', ns).tag #Name of the tag.
'{http://www.cdisc.org/ns/odm/v2.0}SubjectData'
for subject in tree.findall('.//SubjectData', ns):
for eve in subject:
print(eve.attrib) #All attributes of the 'SubjectData' tag.
{'StudyEventOID': 'SE.SCREENING'}
{'StudyEventOID': 'SE.SCREENING'}
{'StudyEventOID': 'SE.SCREENING'}
for subject in tree.findall('.//SubjectData', ns):
for event in subject:
for item in event:
ite = item.attrib
print(ite) #All the attributes of 'ItemGroupOID'
{'ItemGroupOID': 'FO.DEMOGRAPHICS'}
{'ItemGroupOID': 'FO.DEMOGRAPHICS'}
{'ItemGroupOID': 'FO.DEMOGRAPHICS'}
If you want to extract only the value of a specific attribute, just select the attribute using its name in brackets, similar to how you find the value of a specific key stored in a dictionary:
print(subj['SubjectKey']) #Value of the last 'SubjectKey' attribute in the 'SubjectData' tag.
003
Checking the XML structure for just one subject
We have already seen how to filter one or more tags by their names and discover their attributes and values. If it is interesting or necessary to analyze the XML structure of just one tag occurrence (in this case, an individual) more deeply, you can use the find function, which we already know. With the code below, we can check the XML structure for the first subject.
print(etree.tostring(tree.find('.//SubjectData', ns), pretty_print = True, encoding = str))
<SubjectData xmlns="http://www.cdisc.org/ns/odm/v2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" SubjectKey="001">
<StudyEventData StudyEventOID="SE.SCREENING">
<ItemGroupData ItemGroupOID="FO.DEMOGRAPHICS">
<ItemGroupData ItemGroupOID="IG.DEMOGRAPHICS">
<ItemData ItemOID="IT.DOB"><Value>1957-05-07</Value></ItemData>
<ItemData ItemOID="IT.SEX"><Value>1</Value></ItemData>
<ItemData ItemOID="IT.ETHNIC"><Value>2</Value></ItemData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="1">
<ItemData ItemOID="IT.RACE_CODE"><Value>1</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="2">
<ItemData ItemOID="IT.RACE_CODE"><Value>2</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>true</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="3">
<ItemData ItemOID="IT.RACE_CODE"><Value>3</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="4">
<ItemData ItemOID="IT.RACE_CODE"><Value>4</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>4</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="5">
<ItemData ItemOID="IT.RACE_CODE"><Value>5</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="6">
<ItemData ItemOID="IT.RACE_CODE"><Value>99</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
</ItemGroupData>
</ItemGroupData>
</StudyEventData>
</SubjectData>
Now, if we want to filter the tags and attributes related to a specific subject, we need to mention the SubjectKey attribute and specify the subject's value (in this case, '002'), as shown in the code below:
print(etree.tostring(tree.find('.//SubjectData[@SubjectKey="002"]', ns), pretty_print = True, encoding = str))
<SubjectData xmlns="http://www.cdisc.org/ns/odm/v2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" SubjectKey="002">
<StudyEventData StudyEventOID="SE.SCREENING">
<ItemGroupData ItemGroupOID="FO.DEMOGRAPHICS">
<ItemGroupData ItemGroupOID="IG.DEMOGRAPHICS">
<ItemData ItemOID="IT.DOB"><Value>1975-01-31></Value></ItemData>
<ItemData ItemOID="IT.SEX"><Value>2</Value></ItemData>
<ItemData ItemOID="IT.ETHNIC"><Value>2</Value></ItemData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="1">
<ItemData ItemOID="IT.RACE_CODE"><Value>1</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>1</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="2">
<ItemData ItemOID="IT.RACE_CODE"><Value>2</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="3">
<ItemData ItemOID="IT.RACE_CODE"><Value>3</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>true</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="4">
<ItemData ItemOID="IT.RACE_CODE"><Value>4</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="5">
<ItemData ItemOID="IT.RACE_CODE"><Value>5</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
<ItemGroupData ItemGroupOID="IG.RACE" ItemGroupRepeatKey="6">
<ItemData ItemOID="IT.RACE_CODE"><Value>99</Value></ItemData>
<ItemData ItemOID="IT.RACE_BOOLEAN"><Value>false</Value></ItemData>
</ItemGroupData>
</ItemGroupData>
</ItemGroupData>
</StudyEventData>
</SubjectData>
Looking for data from just one tag
Just as we can analyze how an entire XML document is structured, it is possible to specify a tag and separate its structure, allowing us to examine how the tag itself is arranged, as well as the tags connected to it.
print(etree.tostring(tree.find('.//ItemData', ns), pretty_print = True, encoding = str))
<ItemData xmlns="http://www.cdisc.org/ns/odm/v2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" ItemOID="IT.DOB">
<Value>1957-05-07</Value>
</ItemData>
Now, check the Value tag: it has a value we want to extract, but it's not in an attribute; it's text between the tags. To access the content between the opening and closing tags, we can use the text class from lxml.
tree.find('.//Value', ns).text
'1957-05-07'
We can do the same with the tag name, this time using the tag class.
tree.find('.//Value', ns).tag
'{http://www.cdisc.org/ns/odm/v2.0}Value'
Making spreadsheets
Now that we know some basic points about XML and how to manipulate them using Python libraries, we can move on to the next phase: creating spreadsheets (or producing DataFrames). But before we do that, with what we've already learned, we'll organize a data extraction scheme using lxml.
Our goal here is to create a DataFrame containing the date of birth, gender, and ethnicity data for each individual recorded in the XML document. To achieve this, we'll adopt a strategy of establishing a for loop, so that with each iteration, we can extract the data associated with the SubjectData tag and its descendants. After completing the iterations, we'll store the data in a results list, whose outcome is shown below. This is an interesting method because the extracted values are organized into three distinct groups, which will become the columns of our future DataFrame.
results = []
for ide in tree.findall('.//SubjectData', ns):
for subj in ide.findall('.//StudyEventData/ItemGroupData/ItemGroupData/ItemData', ns):
for value in subj:
results.append([ide.attrib['SubjectKey'], subj.attrib['ItemOID'], value.text])
results
[['001', 'IT.DOB', '1957-05-07'],
['001', 'IT.SEX', '1'],
['001', 'IT.ETHNIC', '2'],
['002', 'IT.DOB', '1975-01-31>'],
['002', 'IT.SEX', '2'],
['002', 'IT.ETHNIC', '2'],
['003', 'IT.DOB', '1961-06-09'],
['003', 'IT.SEX', '2'],
['003', 'IT.ETHNIC', '1']]
Once the list is created and stored in the results variable, we can easily use pandas to make a DataFrame with the pd.DataFrame function. It's easy because we can use the list we produced directly, without needing more transformations. In this function, we'll pass the list as an argument and specify the names of the columns we want to appear: ID, Variable, Value. It's possible, without any issues, to assign any name to each column.
results = pd.DataFrame(results, columns=['ID', 'Variable', 'Value'])
results
| ID | Variable | Value | |
| 0 | 001 | IT.DOB | 1957-05-07 |
| 1 | 001 | IT.SEX | 1 |
| 2 | 001 | IT.ETHNIC | 2 |
| 3 | 002 | IT.DOB | 1975-01-31> |
| 4 | 002 | IT.SEX | 2 |
| 5 | 002 | IT.ETHNIC | 2 |
| 6 | 003 | IT.DOB | 1961-06-09 |
| 7 | 003 | IT.SEX | 2 |
| 8 | 003 | IT.ETHNIC | 1 |
We have a DataFrame, but we need to go a bit further before considering it complete. The main idea here is to have results with only one ID per row, ensuring all data related to that ID is in the same row. Here, we see that we could use the values from the Variable column as column names, and what's in the Value column would be the values for each column by ID.
The good news is that pandas allows us to do this without any problem. What we will perform now is a pivot of the data we have. For this, we can use two functions: pivot and pivot_table, with a slight difference between them. The function chosen here is pivot_table, where we will use results as the data source, the ID column as the temporary index of the DataFrame, the Variable column to name the new columns, and the Value column to provide the values for the columns. Additionally, we need to invoke a function in the aggfunc argument, which serves as a function to perform calculations or organize data in a specific order (the first data to appear, the last, etc.). In this case, we will use the first function because we simply want the first (and only) value to appear in the DataFrame. Finally, we will use reset_index, making ID a manipulable column again.
results = pd.pivot_table(results, index='ID', columns='Variable', values='Value', aggfunc='first').reset_index()
results
| Variable | ID | IT.DOB | IT.ETHNIC | IT.SEX |
| 0 | 001 | 1957-05-07 | 2 | 1 |
| 1 | 002 | 1975-01-31> | 2 | 2 |
| 2 | 003 | 1961-06-09 | 1 | 2 |
This way, we are giving a more definitive appearance to our DataFrame. However, notice that the column names, as well as the values in the last two sheets, still do not clearly tell us what information is being shown. By looking more closely at the XML document, we can see that there is a part that provides valuable information to give more meaning to the data we have: the metadata section. This is what we will explore from now on.
Searching for the metadata of the XML document (and improving the spreadsheet)
Metadata are the definitions of each variable and value within the XML document. It is interesting, even necessary, to obtain them so we can truly understand what each column and value represents in a DataFrame. In this specific document, the metadata we are interested in are located in the tags named ItemDef, and we can extract them with the block of code illustrated below:
for meta in tree.findall('.//ItemDef', ns):
print(meta.attrib)
{'DataType': 'date', 'Name': 'Date of birth', 'OID': 'IT.DOB'}
{'DataType': 'integer', 'Length': '1', 'Name': 'Sex', 'OID': 'IT.SEX'}
{'DataType': 'integer', 'Length': '1', 'Name': 'Ethnicity', 'OID': 'IT.ETHNIC'}
{'OID': 'IT.RACE_CODE', 'Name': 'Race code', 'DataType': 'integer', 'Length': '1'}
{'DataType': 'boolean', 'Length': '1', 'Name': 'Race', 'OID': 'IT.RACE_BOOLEAN'}
{'DataType': 'text', 'Length': '20', 'Name': 'Other Race', 'OID': 'IT.RACEOTH'}
Notice that the output is a series of dictionaries with keys indicating aspects like 'Name' and 'OID' (a unique ID for each object in the XML). We can extract these to convert the coded column names into names that show what each column actually represents. To do this, we'll create a dictionary with each column's OID as the key and the column name as the value, as shown below:
names = {}
for meta in tree.findall('.//ItemDef', ns):
names[meta.attrib['OID']] = meta.attrib['Name']
names
{'IT.DOB': 'Date of birth',
'IT.SEX': 'Sex',
'IT.ETHNIC': 'Ethnicity',
'IT.RACE_CODE': 'Race code',
'IT.RACE_BOOLEAN': 'Race',
'IT.RACEOTH': 'Other Race'}
Now, with the rename function from pandas, we can change the column names in results, and things start to make more sense.
results = results.rename(columns=names)
results
| Variable | ID | Date of birth | Ethnicity | Sex |
| 0 | 001 | 1957-05-07 | 2 | 1 |
| 1 | 002 | 1975-01-31> | 2 | 2 |
| 2 | 003 | 1961-06-09 | 1 | 2 |
We also have two columns here that are in numeric codes: Ethnicity and Sex. The definitions of these codes are usually at the beginning of the XML document, along with the metadata. To get the names of the sex and ethnicity codes, in this case, there is a tag that will be our target: CodeList. The code below shows how the tag is structured:
print(etree.tostring(tree.find('.//CodeList', ns), pretty_print = True, encoding = str))
<CodeList xmlns="http://www.cdisc.org/ns/odm/v2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" DataType="integer" Name="Sex" OID="CL.SEX">
<CodeListItem CodedValue="1">
<Decode>
<TranslatedText xml:lang="en" Type="text/plain">Male</TranslatedText>
</Decode>
</CodeListItem>
<CodeListItem CodedValue="2">
<Decode>
<TranslatedText xml:lang="en" Type="text/plain">Female</TranslatedText>
</Decode>
</CodeListItem>
</CodeList>
Now, recall the previous method we used to get column names. Basically, the principle for obtaining the values of numeric codes is the same; however, it is a bit more complex, requiring a few more lines of code. When accessing the CodeList tag, if we want to extract values for Ethnicity, we need to filter by mentioning the name of this attribute using XPATH. This comes right after the tag name; thus, we can create a new dictionary, ethnicity, to hold codes and values for the ethnicities represented in the DataFrame.
ethnicity = {}
for eth in tree.findall('.//CodeList[@Name="Ethnicity"]', ns):
for code in tree.findall('.//CodeList[@Name="Ethnicity"]/CodeListItem', ns):
for decode in code:
for name in decode:
ethnicity[code.attrib['CodedValue']] = name.text
ethnicity
{'1': 'Hispanic', '2': 'Non-hispanic'}
To get the values of the codes in the Sex column, just use the same method as before, only replacing the name of the attribute to be filtered.
sex = {}
for eth in tree.findall('.//CodeList[@Name="Sex"]', ns):
for code in tree.findall('.//CodeList[@Name="Sex"]/CodeListItem', ns):
for decode in code:
for name in decode:
sex[code.attrib['CodedValue']] = name.text
sex
{'1': 'Male', '2': 'Female'}
With the dictionaries ready, we can proceed with the remaining substitutions. Here, we'll use the pandas map function, which associates the keys (i.e., the numeric codes) of the dictionaries we created with the values that name the numbers. We'll do this for both columns, Ethnicity and Sex; notice that the spreadsheet now makes much more sense and is ready for the desired analyses.
results['Ethnicity'] = results['Ethnicity'].map(ethnicity)
results['Sex'] = results['Sex'].map(sex)
results
| Variable | ID | Date of birth | Ethnicity | Sex |
| 0 | 001 | 1957-05-07 | Non-hispanic | Male |
| 1 | 002 | 1975-01-31> | Non-hispanic | Female |
| 2 | 003 | 1961-06-09 | Hispanic | Female |
To save the DataFrame that was created, simply use the built-in pandas functions for this purpose. This way, the extracted information will be stored in a file that is lighter and faster to handle.
## Exportando
results.to_csv('results.csv') #.csv file
results.to_excel('results.xlsx') #.xlsx file
results.to_parquet('results.parquet') #.parquet file
Notes
1. This is succinctly and very well supported in Shabo et al. (2006): “The Clinical Data Interchange Standards Consortium (CDISC) is leading the development of standards to improve data quality and accelerate product development in the pharmaceutical industry.19 The CDISC model focuses on the use of metadata, and the approach is to combine XML representation with the tabular presentation traditionally used for clinical-trial data.” - Shabo, A., S. Rabinovici-Cohen, e P. Vortman. “Revolutionary impact of XML on biomedical information interoperability”. IBM Systems Journal 45, nº 2 (2006): 361–72. https://doi.org/10.1147/sj.452.0361. ↩
2. “Interest in ODM as a research topic has grown significantly over the last several years with increasing interest in the CDISC data standards from regulatory authorities such as the FDA and the Japanese Pharmaceutical and Medical Devices Agency (PMDA)” - Hume, Sam, Jozef Aerts, Surendra Sarnikar, e Vojtech Huser. “Current Applications and Future Directions for the CDISC Operational Data Model Standard: A Methodological Review”. Journal of Biomedical Informatics 60 (abril de 2016): 352–62. https://doi.org/10.1016/j.jbi.2016.02.016. ↩
3. “While it is a requirement to submit pre-clinical and clinical data in CDISC format to regulatory bodies such as the US FDA and Japan’s Pharmaceuticals and Medical Devices Agency (PDMA), the actual usage of CDISC standards spans a much wider array of entities.” Hufstedler, Heather, Yannik Roell, Andressa Peña, Ankur Krishnan, Ian Green, Adriano Barbosa-Silva, Andreas Kremer, et al. “Navigating data standards in public health: A brief report from a data-standards meeting”. Journal of Global Health 14 ([s.d.]): 03024. https://doi.org/10.7189/jogh.14.03024. ↩
4. "The CDISC Operational Data Model (ODM) is an XML format that facilitates the exchange of clinical data captured during a clinical study. ODMbased files contain the study data and the associated descriptions of the data items, their groupings into Case Report Forms (CRFs), which are electronic documents to record the study data, and the associated questions and code lists. Furthermore, the FDA has mandated the use of other CDISC standards in clinical studies." Leroux, Hugo, e Laurent Lefort. “Semantic Enrichment of Longitudinal Clinical Study Data Using the CDISC Standards and the Semantic Statistics Vocabularies”. Journal of Biomedical Semantics 6, nº 1 (dezembro de 2015): 16. https://doi.org/10.1186/s13326-015-0012-6. ↩
5. “The Federal Drug Administration has mandated the use of the CDISC standards for the electronic capture and reporting of clinical study data” Leroux, Hugo, Alejandro Metke-Jimenez, e Michael J. Lawley. “Towards Achieving Semantic Interoperability of Clinical Study Data with FHIR”. Journal of Biomedical Semantics 8, nº 1 (19 de setembro de 2017): 41. https://doi.org/10.1186/s13326-017-0148-7. ↩
6. Lefort, Laurent, e Hugo Leroux. “Design and generation of Linked Clinical Data Cubes”, 2013. https://doi.org/10.13140/RG.2.1.3677.2967. ↩
7. Brix, Tobias Johannes, Philipp Bruland, Saad Sarfraz, Jan Ernsting, Philipp Neuhaus, Michael Storck, Justin Doods, Sonja Ständer, e Martin Dugas. “ODM Data Analysis—A Tool for the Automatic Validation, Monitoring and Generation of Generic Descriptive Statistics of Patient Data”. PLOS ONE 13, nº 6 (22 de junho de 2018): e0199242. https://doi.org/10.1371/journal.pone.0199242. ↩
8. An important note: From now on, in addition to Python libraries and XML, we will use another language, this one for querying: XPath. With it, we can properly access XML elements and attributes. I won't go into details about it in this article, but if you want to understand more about it, you can see more details here and here, as well as a good cheat sheet here. We won't use many different things from this language here, but it's interesting to explore it further later.. ↩






