Todd Mike CDISC SDTM Mapping 2009
Todd Mike CDISC SDTM Mapping 2009
Todd Mike CDISC SDTM Mapping 2009
Disclaimer
The views and opinions expressed in the following PowerPoint slides are those of the individual presenter and should not be attributed to Drug Information Association, Inc. (DIA), its directors, officers, employees, volunteers, members, chapters, councils, Special Interest Area Communities or affiliates, or any organization with which the presenter is employed or affiliated. These PowerPoint slides are the intellectual property of the individual presenter and are protected under the copyright laws of the United States of America and other countries. Used by permission. All rights reserved. Drug Information Association, DIA and DIA logo are registered trademarks or trademarks of Drug Information Association Inc. All other trademarks are the property of their respective owners.
2
Introduction
Current SDTM mapping methodology is well-established but limited
Many companies use some version of metadata-driven ETL mapping system However, it requires a mapping expert to define the metadata
the number of experts is limited
We need a fully automated expert system to convert clinical trial data to SDTM on a massive scale
3
Current Technology
All code is developed to be generic using the metadata to indicate when variations are required New studies only require changes to metadata
6
Data
Data 1
Project Programs
Reports
Project 2
Data etc.
9
Project Programs
Reports
ETL Process
10
Why it Works
Role of standards
Standards drive the process. Target has standard structure so can be standardized. While source variables differ, commonalities can be exploited
Knowledge required
CDISC Standards Understanding of raw data issues Study design Limited derivation
11
Technical Lead
12
Dataflow
SDTM V3.1.1 Guide Protocol Statistical Analysis Plan Raw Data Annotated CRF Raw Data Data Integration Specialist Mapping Specialist SDTM Data Annotated CRF Trial Design Datasets: TE, TA, TV, TI, TS SDTM Datasets and SAS Programs
SE.XPT SE.SAS
Statistician
SV.XPT SV.SAS
AE.XPT
AE.SAS
SUPPAE .XPT
SUPPAE .SAS
...
Project-Level Metadata Project-Level SAS Macros
CM..XPT
CM.SAS
Technical Lead
SUPPCM .XPT
SUPPCM .SAS
DM..XPT
DM.SAS
...
DEFINE.XML
13
Limitations
Requires experts Severely limited throughput, relative to amount of clinical trial data Converting legacy data on a systemic scale is infeasible
14
Future Directions
15
Challenge
Convert unstructured information such as text into relational tables that can be used to generate code to create SDTM & DEFINE.XML To create this system, imagine thinking like a computer.
You have sources of information You have a set of rules You have a storage of knowledge available. Apply heuristics to create SDTM datasets with a certain probability of accuracy.
17
Sources of Information
Data
Main source of information Can assume data exists, while protocol & CRF may not for legacy studies.
CRF
Usually this is an image, can it be processed?
20
System must handle tests may not exist today, but would still fit into findings, events, or interventions.
21
Each dataset contains keys: variable(s) that enable datasets to be joined together Dates and times have a sequence
Discoverable by sorting
22
23
Identifying C1
Left-most column often is a protocol Mixture of letters, numbers, and special characters: probably a code No hits for dictionary lookup for meaningful terms If the sponsor is known, there may be a list of protocols for lookup
25
Identifying C2
The same things we noted for C1 also apply to C2. FAB-10 is as likely to be a protocol number as X312. It is only because C1 has the same value for all records in the dataset that we can conclude with a high probability that FAB-10 is a protocol number.
26
27
Identifying C3
Is it a sequence number?
C3 contains only integers Increasing series from 1 to n with some gaps and some ties Most subjects have the same number of records Implies series of checkboxes on CRF, preprinted choices If we select distinct C2, C3, there should only be one record for each combination..
28
Identifying C4
Identifying C5
Identifying C6 and C7
If C4 is the body system and C5 the status, the remaining columns probably are verbatim descriptions
There are several disease-related words Appears to be verbatim text Unclear why there are multiple columns of information.
Probably legacy data structure with each description in a separate column. 31
Definitive Identification
Summary
Current SDTM mapping technology depends on experts Severely limits throughput relative to all legacy data needed for a comprehensive clinical trial database A fully automated expert system that can perform SDTM conversions with a high probability of accuracy is a promising approach.
33