Skip to main content

University of Exeter Medical School

Applied Data Science

Module titleApplied Data Science
Module codeCSC3031
Academic year2021/2
Credits15
Module staff

Dr Tj McKinley (Convenor)

Duration: Term123
Duration: Weeks

11

Number students taking module (anticipated)

30

Module description

Data science is an inter-disciplinary scientific field that aims to extract knowledge and insights from observed data, encompassing ideas of statistical modelling, machine learning, data visualisation and computer science. The ability to be able to process, visualise and analyse complex data sets is a sought-after skill set in modern biomedical science, particularly in the era of “big data” such as medical records and gene expression data. Prerequisite modules have introduced the fundamentals of statistical modelling and computer programming; this module will augment these by focusing on many of the practical skills required when working with complex data sets, such as: collating, transforming and manipulating complex data sets, advanced visualisation tools and reproducible science. We will use the freely available, open-source, multi-platform statistical programming language R—software that is now very widely used amongst biomedical and health scientists. In particular, we will introduce a powerful suite of packages known as the ‘tidyverse’, which are designed to make these routine, yet challenging, data science tasks faster, easier and dare-we-say-it enjoyable. The course will be almost entirely practical based using real-world examples of complex data sets.

This module requires that you have some experience of using computer programming. It does not require that you have used R before, since this will be introduced at the beginning of the module. All software is freely available, open-source and multi-platform.
The module has been developed for BSc Medical Sciences, but if capacity permits, it will be available to students from other programmes in the College of Medicine and Health or the wider University.

Module aims - intentions of the module

The aim of this module is to develop skills and experience in key areas that are at the forefront of efficient data science in the digital age. In particular you will understand the complexities involved in storing, manipulating, visualising and analysing complex data sets, and learn powerful ways to approach these challenges using the R statistical language, which is freely available and open-source, and is one of the principal software packages used in modern data science. With plenty of hands-on examples, you will see how each data set brings its own set of challenges, but by understanding key concepts you will be able to write effective code to facilitate efficient data analysis.
Alongside this you will also learn key areas of good research practice, such as scripting, literate programming for reproducible science, and version control. In particular you will learn how to use the ‘tidyverse’ suite-of-packages that were designed to make many key data science tasks, such as manipulating, transforming and visualising complex data sets faster and easier.

Intended Learning Outcomes (ILOs)

ILO: Module-specific skills

On successfully completing the module you will be able to...

  • 1. Program efficiently in the open-source statistical programming language R.
  • 2. Write efficient code using key R packages to import, clean and manipulate complex data sets.
  • 3. Demonstrate insightful and flexible ways to visualise complex data sets, and implement these ideas in R.
  • 4. Implement reproducible workflows using literate programming packages in R.

ILO: Discipline-specific skills

On successfully completing the module you will be able to...

  • 5. Demonstrate an awareness of the ways that data are stored and structured, and understand key challenges in using real-world data, and the benefits of moving beyond the limitations of spreadsheets.
  • 6. Demonstrate an awareness of key challenges in the storage, manipulation, visualisation and analysis of modern data sets—which are becoming even more pertinent in the era of “big data”.
  • 7. Demonstrate an awareness of the importance of reproducible science and how it underpins good data analysis practice, and implement structured workflows to make programming practice more efficient and reproducible.

ILO: Personal and key skills

On successfully completing the module you will be able to...

  • 8. Apply skills learned in critical thinking to tackle and solve problems in a variety of complex real-world data sets.
  • 9. Implement novel and informative ways to communicate complex ideas.
  • 10. Write fully reproducible data-driven reports that convey complex information in a clear and concise manner.

Syllabus plan

Whilst the module’s precise content may vary from year to year, an example of an overall structure is as follows:
Introduction to R: We will introduce the R statistical programming language, using the RStudio integrated desktop environment. This will build on the ideas introduced in CSC2020; so we will introduce familiar ideas such as variables, data types, for-/while-loops, if-else-statements, vectorised operations and other fundamental concepts. We will introduce the R syntax for implementing these ideas, and how these differ from e.g. Matlab.
Advanced Visualisation: We will show how to produce various plots using R’s default plotting interface. Whilst this is extremely powerful and versatile, we will show how this can quickly become challenging to produce complex plots, and instead introduce the ‘ggplot2’ package (part of the ‘tidyverse’), which is a pioneering package designed to make flexible yet complex visualisations straightforward to implement and customise. We will illustrate these ideas with some complex real-world datasets, including some from the Gapminder project. We will also show how we can create animations to help further understand patterns in data sets.
Data Wrangling: We will introduce various other ‘tidyverse’ packages that are designed to import and manipulate data sets. We will introduce the idea of ‘tidy’ data (at least as far as ‘tidyverse’ is concerned) and understand that this format is most often required by statistical analysis packages (and indeed visualisation tools such as ‘ggplot2’). We will introduce key ideas, such as filtering, transforming, grouping and summarising data sets, and explore common ways that large data sets are stored, such as in databases. We will show how different data sets can be joined together by common variables.
Literate Programming and Reproducible Science: We will reiterate the importance of using scripts to ensure that analyses are reproducible. We will also introduce literate programming tools, such as ‘rmarkdown’, to quickly and easily allow you to turn your analyses into high-quality documents, reports and presentations. This enables code and output to be seamlessly incorporated into the same document, improving efficiency and reducing common sources of error, such as copy-and-paste errors.
Version Control with Git: This final section will introduce how you can use the open-source software Git to document and keep track of all changes to your analysis code. This has many benefits: it can act as a backup of your code (particularly when coupled with an online Git repository, such as GitHub or Bitbucket); it can allow you to revert changes easily, or back-pedal to an earlier snapshot of the code; it allows you to develop multiple versions of your code, without worrying about losing any individual versions; it allows you to see changes in code from earlier versions, and it makes it very difficult to lose and/or overwrite code. It also means you don’t need to keep many copies of a script with increasingly unintelligible names (e.g. final.R, final1.R, honestlyFinal.R, definitelyFinal.R, …) It really shines if collaborating with others on the same piece of code, since it enforces a workflow that means that multiple people can work easily and efficiently on the same piece of code. Git is independent of any programming language, and is widely used in academia and industry.
Interspersed throughout the module will be hands-on practical and revision sessions to consolidate these ideas.

Learning activities and teaching methods (given in hours of study time)

Scheduled Learning and Teaching ActivitiesGuided independent studyPlacement / study abroad
33117

Details of learning activities and teaching methods

CategoryHours of study timeDescription
Scheduled Learning and Teaching33Practical-based workshops
Guided Independent Study117Worksheets and programming practice, revision

Formative assessment

Form of assessmentSize of the assessment (eg length / duration)ILOs assessedFeedback method
Practical worksheets (x4)1-2 hr each1-10Model answers provided

Summative assessment (% of credit)

CourseworkWritten examsPractical exams
40060

Details of summative assessment

Form of assessment% of creditSize of the assessment (eg length / duration)ILOs assessedFeedback method
Coursework402000 word equivalent + code1-10Written
Practical ‘coding exam’ (open book)602 hrs (plus 30 mins uploading time)1-10Written

Details of re-assessment (where required by referral or deferral)

Original form of assessmentForm of re-assessmentILOs re-assessedTimescale for re-assessment
Coursework (40%)Coursework (2000 word equivalent + code)1-10Ref/Def period
Practical ‘coding exam’ (open book) (60%)Practical ‘coding exam’ (open book) (2 hrs (plus 30 mins uploading time)1-10Ref/Def period

Re-assessment notes

Please refer to the TQA section on Referral/Deferral: http://as.exeter.ac.uk/academic-policy-standards/tqa-manual/aph/consequenceoffailure/

Indicative learning resources - Basic reading

“R for Data Science”—Hadley Wickham and Garrett Grolemund (https://r4ds.had.co.nz/)

“Pro Git Book”—Scott Chacon and Ben Straub (https://git-scm.com/book/en/v2)

Key words search

Data Science; R; programming; data visualisation

Credit value15
Module ECTS

7.5

Module pre-requisites

CSC2020, CSC2023 (or equivalents)

Module co-requisites

N/A

NQF level (module)

6

Available as distance learning?

No

Origin date

25/01/2021

Last revision date

24/02/2021