Health Statistics for Data Scientists

Module codeHPDM096
Academic year2021/2
Dr Eilis Hannon (Convenor)

Professor William Henley (Convenor)

Duration: Term123
This module provides a broad introduction to statistical modelling for data scientists. The module starts by considering the different stages of a statistical investigation and emphasising the importance of problem formulation. The module highlights the benefits of exploratory data analysis based on descriptive statistics and graphs.  Key concepts in probability theory and the role of statistical distributions in modelling health data will be covered.  The core part of the module provides a foundation in regression modelling to include simple linear regression, logistic regression, survival analysis and models that account for complex temporal and hierarchical data structures. In the later part of the module, you will learn advanced techniques for making causal inferences about the effectiveness of health interventions. This will include instrumental variable analysis, regression discontinuity designs and the difference-in-differences method. Throughout this module, you will gain practical experience of statistical computing using the R software environment and exposure to case studies based on real-world health data.

The aim of the module is to provide a modern statistical framework for answering health research questions through interrogation of health datasets derived from randomised trials, electronic health records or other observational studies. The module will equip you with the theoretical underpinning and computational skills needed for advanced regression modelling of health data. Both frequentist and Bayesian approaches to modelling will be considered and contrasted. The module will consider potential sources of bias when key variables are unmeasured or contain missing values, and explore a range of advanced statistical methods for strengthening causal inferences in real-world health evaluations.  The module will emphasise the fundamental role of the statistician as a problem solver and consider the different stages of the “problem solving” cycle. Case studies will be used to help you develop an appreciation of modelling strategy and to give you practical experience of interpreting model findings in the context of real health problems.

On successfully completing the module you will be able to...

  • 1. Examine and apply fundamental concepts in statistical modelling and inference, including conditional probability, statistical distributions, sampling variability, estimators, bias and likelihood functions
  • 2. Critically evaluate the Bayesian and frequentist frameworks for statistical inference including their strengths, limitations and differences
  • 3. Examine the theoretical basis of linear regression, generalized linear models and survival analysis
  • 4. Apply a range of regression and causal inference methods to address health data science problems
  • 5. Critically evaluate the strengths and limitations of different statistical methods, including regression models and causal inference methods, within a health data science project

On successfully completing the module you will be able to...

  • 6. Formulate health research questions as statistical problems
  • 7. Draw conclusions from the results of a data analysis and justify those conclusions, appropriately acknowledging uncertainty in the results

On successfully completing the module you will be able to...

  • 8. Use the R software environment for statistical computing
  • 9. Understand and critically appraise academic research papers in research field
  • 10. Communicate effectively arguments, evidence and conclusions using a variety of formats in a manner appropriate to the intended audience

Syllabus plan

Whilst the module’s precise content may vary from year to year, an example of an overall structure is as follows:

  • Formulating statistical problems
  • Statistical computing using R
  • Exploratory data analysis
  • Probability theory and statistical distributions
  • Interval estimation and hypothesis testing including parametric and non-parametric methods
  • Likelihoods and maximum likelihood estimation
  • Bayesian and frequentist inference
  • Monte Carlo simulation
  • Power and sample size calculations
  • Linear regression modelling
  • Generalised linear models
  • Longitudinal and survival analysis
  • Multilevel models for hierarchical data
  • Missing data mechanisms and multiple imputation
  • Causal inference for healthcare evaluations including adjustment methods for addressing measured and unmeasured confounding

Potential changes to Learning & Teaching due to COVID-19:

-          Face-to-face scheduled lectures may be replaced by short pre-recorded videos for each topic (15-20 minutes) and/or brief overview lectures delivered via MS Teams/Zoom, with learning consolidated by self-directed learning resources and ELE activities.

-          Small-group discussion in tutorials and seminars may be replaced by synchronous group discussion on Teams/ Zoom; or asynchronous online discussion, for example via Yammer or ELE Discussion board

-          Workshops involving face-to-face classroom teaching may be replaced by synchronous sessions on Teams/Zoom; or Asynchronous workshop activities supported with discussion forum

-          Skills workshops involving practical skills acquisition demonstrations may be replaced by short pre-recorded videos as pre-learning; or workshop via Teams/Zoom.

-          Face-to-face meetings with dissertation supervisors may be replaced by meetings supported by email/phone/Teams/Zoom; and some lab/data projects may be replaced by Literature or data projects only.


Potential changes to Assessment due to COVID-19:

-          Written examinations (e.g. timed, invigilated, closed-book formal exam) may be replaced by an online equivalent (e.g. timed, non-invigilated, open-book, online exam).

-          Presentations (e.g. PowerPoint-based presentation to group in face-to-face setting) may be replaced by PowerPoint-based presentation to the group using Teams/Zoom; or submission of a narrated PowerPoint.     

-          Practical skills, or contribution to discussions, which are usually observed in class, may be replaced by observation via Teams/Zoom, monitoring of discussion boards; or may be replaced with a different assessment format

Learning activities and teaching methods (given in hours of study time)

Scheduled Learning and Teaching ActivitiesGuided independent studyPlacement / study abroad

Details of learning activities and teaching methods

CategoryHours of study timeDescription
Scheduled Learning and Teaching15Lectures (10 x 1.5 hours)
Scheduled Learning and Teaching20Computer based workshops (10 x 2 hours)
Guided independent study115Background reading and preparation for module assessments


Form of assessmentSize of the assessment (eg length / duration)ILOs assessedFeedback method
Multiple choice questions will be given as part of the workshops, and will be self-assessed10 questions for each workshop session AllOral staff where required

CourseworkWritten examsPractical exams

Details of summative assessment

Form of assessment% of creditSize of the assessment (eg length / duration)ILOs assessedFeedback method
Examination including a mixture of multiple choice questions and traditional exam questions501 hour1-7Written
Group presentation2010 minutes (in groups of 3/4)4-10Written
Written assignment301,500 words4-10Written


Original form of assessmentForm of re-assessmentILOs re-assessedTimescale for re-assessment
Examination including a mixture of multiple choice questions and traditional exam questions (50%)1 hour1-7Typically within six weeks of the result
Group presentation (20%)Individual presentation 10 minutes 4-10Typically within six weeks of the result
Written assignment (30%)1,500 words4-10Typically within six weeks of the result

Please refer to the TQA section on Referral/Deferral: 


Basic reading:

  • An Introduction to Generalized Linear Models, Third Edition. Dobson, AJ and Barnett, AG, Chapman & Hall (2008).

  • An Introduction to Statistical Learning with Applications in R Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

ELE page:

Module has an active ELE page

probability, statistical distribution, estimator, likelihood, regression, survival analysis, Cox model, mixed effects model, bias, confounding, causation, Bayesian methods

