About

I am a health data scientist and clinical epidemiologist who generates real-world evidence (RWE) from population-scale health data. I design and run cohort and observational studies that turn electronic health records (EHR), registries, and other real-world data into evidence supporting clinical, regulatory, and commercial decisions — using causal inference, comparative effectiveness, survival analysis, mixed-effects and latent-class models, machine learning, and clinical NLP/LLMs in R, Python, and SQL.

As a Research Fellow at NTU Singapore, I work on PRECISE-SG100K — a multi-ancestry Asian population cohort of ~100,000 participants whose deep phenotypes and whole-genome data are linked to electronic health records and analysed within the secure TRUST platform. I build reproducible endpoint-analysis pipelines for chronic diseases (cardiovascular disease, type 2 diabetes, chronic kidney disease, liver disease, and cancer) and LLM pipelines that normalise free-text medications to OMOP/RxNorm concepts.

Before NTU, my Oxford DPhil exploited hospital electronic health records to improve infection management — modelling infection-response trajectories, antibiotic prescribing, and drug dosing. As a postdoctoral health data scientist I led a national study of rare-disease prevalence and COVID-19 burden across 62.5M people in the NHS England Secure Data Environment, and collaborated on genetic analyses of shared mechanisms between hypertension and type 2 diabetes. Earlier, I worked on the industry side of RWE at IQVIA and Oracle (Cerner Enviza).

Research interests #

Real-world evidence and comparative effectiveness — designing and running cohort and observational studies that turn EHR, registries, and linked population data into evidence for clinical, regulatory, and commercial decisions
Infectious disease epidemiology and antimicrobial stewardship — modelling infection presentation, treatment response, and antibiotic prescribing patterns to inform clinical management and stewardship policy
Longitudinal phenotyping and trajectory modelling — characterising disease-course heterogeneity using latent-class mixed models, time-series biomarker analysis, and subgroup discovery in longitudinal EHR data
Clinical NLP and large language models — applied across the EHR pipeline for feature extraction, clinical phenotyping, medication normalisation, disease staging, and cohort identification from free-text records; includes fine-tuning and evaluation of domain-specific models
Multi-source data linkage and population-scale studies — linking EHR, genomics, registries, and mortality records to study disease burden and outcomes at national scale across diverse populations