Applied Data Science

Module title	Applied Data Science
Module code	CSC3031
Academic year	2021/2
Credits	15
Module staff	Dr Tj McKinley (Convenor)

Duration: Term	1	2	3
Duration: Weeks	11

Number students taking module (anticipated)	30

Module description

Data science is an inter-disciplinary scientific field that aims to extract knowledge and insights from observed data, encompassing ideas of statistical modelling, machine learning, data visualisation and computer science. The ability to be able to process, visualise and analyse complex data sets is a sought-after skill set in modern biomedical science, particularly in the era of “big data” such as medical records and gene expression data. Prerequisite modules have introduced the fundamentals of statistical modelling and computer programming; this module will augment these by focusing on many of the practical skills required when working with complex data sets, such as: collating, transforming and manipulating complex data sets, advanced visualisation tools and reproducible science. We will use the freely available, open-source, multi-platform statistical programming language R—software that is now very widely used amongst biomedical and health scientists. In particular, we will introduce a powerful suite of packages known as the ‘tidyverse’, which are designed to make these routine, yet challenging, data science tasks faster, easier and dare-we-say-it enjoyable. The course will be almost entirely practical based using real-world examples of complex data sets.

This module requires that you have some experience of using computer programming. It does not require that you have used R before, since this will be introduced at the beginning of the module. All software is freely available, open-source and multi-platform.
The module has been developed for BSc Medical Sciences, but if capacity permits, it will be available to students from other programmes in the College of Medicine and Health or the wider University.

Module aims - intentions of the module

The aim of this module is to develop skills and experience in key areas that are at the forefront of efficient data science in the digital age. In particular you will understand the complexities involved in storing, manipulating, visualising and analysing complex data sets, and learn powerful ways to approach these challenges using the R statistical language, which is freely available and open-source, and is one of the principal software packages used in modern data science. With plenty of hands-on examples, you will see how each data set brings its own set of challenges, but by understanding key concepts you will be able to write effective code to facilitate efficient data analysis.
Alongside this you will also learn key areas of good research practice, such as scripting, literate programming for reproducible science, and version control. In particular you will learn how to use the ‘tidyverse’ suite-of-packages that were designed to make many key data science tasks, such as manipulating, transforming and visualising complex data sets faster and easier.

Intended Learning Outcomes (ILOs)

ILO: Module-specific skills

On successfully completing the module you will be able to...

1. Program efficiently in the open-source statistical programming language R.
2. Write efficient code using key R packages to import, clean and manipulate complex data sets.
3. Demonstrate insightful and flexible ways to visualise complex data sets, and implement these ideas in R.
4. Implement reproducible workflows using literate programming packages in R.

ILO: Discipline-specific skills

On successfully completing the module you will be able to...

5. Demonstrate an awareness of the ways that data are stored and structured, and understand key challenges in using real-world data, and the benefits of moving beyond the limitations of spreadsheets.
6. Demonstrate an awareness of key challenges in the storage, manipulation, visualisation and analysis of modern data setswhich are becoming even more pertinent in the era of big data.
7. Demonstrate an awareness of the importance of reproducible science and how it underpins good data analysis practice, and implement structured workflows to make programming practice more efficient and reproducible.

ILO: Personal and key skills

On successfully completing the module you will be able to...

8. Apply skills learned in critical thinking to tackle and solve problems in a variety of complex real-world data sets.
9. Implement novel and informative ways to communicate complex ideas.
10. Write fully reproducible data-driven reports that convey complex information in a clear and concise manner.

Syllabus plan

Whilst the module’s precise content may vary from year to year, an example of an overall structure is as follows:
Introduction to R: We will introduce the R statistical programming language, using the RStudio integrated desktop environment. This will build on the ideas introduced in CSC2020; so we will introduce familiar ideas such as variables, data types, for-/while-loops, if-else-statements, vectorised operations and other fundamental concepts. We will introduce the R syntax for implementing these ideas, and how these differ from e.g. Matlab.
Advanced Visualisation: We will show how to produce various plots using R’s default plotting interface. Whilst this is extremely powerful and versatile, we will show how this can quickly become challenging to produce complex plots, and instead introduce the ‘ggplot2’ package (part of the ‘tidyverse’), which is a pioneering package designed to make flexible yet complex visualisations straightforward to implement and customise. We will illustrate these ideas with some complex real-world datasets, including some from the Gapminder project. We will also show how we can create animations to help further understand patterns in data sets.
Data Wrangling: We will introduce various other ‘tidyverse’ packages that are designed to import and manipulate data sets. We will introduce the idea of ‘tidy’ data (at least as far as ‘tidyverse’ is concerned) and understand that this format is most often required by statistical analysis packages (and indeed visualisation tools such as ‘ggplot2’). We will introduce key ideas, such as filtering, transforming, grouping and summarising data sets, and explore common ways that large data sets are stored, such as in databases. We will show how different data sets can be joined together by common variables.
Literate Programming and Reproducible Science: We will reiterate the importance of using scripts to ensure that analyses are reproducible. We will also introduce literate programming tools, such as ‘rmarkdown’, to quickly and easily allow you to turn your analyses into high-quality documents, reports and presentations. This enables code and output to be seamlessly incorporated into the same document, improving efficiency and reducing common sources of error, such as copy-and-paste errors.
Version Control with Git: This final section will introduce how you can use the open-source software Git to document and keep track of all changes to your analysis code. This has many benefits: it can act as a backup of your code (particularly when coupled with an online Git repository, such as GitHub or Bitbucket); it can allow you to revert changes easily, or back-pedal to an earlier snapshot of the code; it allows you to develop multiple versions of your code, without worrying about losing any individual versions; it allows you to see changes in code from earlier versions, and it makes it very difficult to lose and/or overwrite code. It also means you don’t need to keep many copies of a script with increasingly unintelligible names (e.g. final.R, final1.R, honestlyFinal.R, definitelyFinal.R, …) It really shines if collaborating with others on the same piece of code, since it enforces a workflow that means that multiple people can work easily and efficiently on the same piece of code. Git is independent of any programming language, and is widely used in academia and industry.
Interspersed throughout the module will be hands-on practical and revision sessions to consolidate these ideas.

Learning activities and teaching methods (given in hours of study time)

Scheduled Learning and Teaching Activities	Guided independent study	Placement / study abroad
33	117

Details of learning activities and teaching methods

Category	Hours of study time	Description
Scheduled Learning and Teaching	33	Practical-based workshops
Guided Independent Study	117	Worksheets and programming practice, revision

Formative assessment

Form of assessment	Size of the assessment (eg length / duration)	ILOs assessed	Feedback method
Practical worksheets (x4)	1-2 hr each	1-10	Model answers provided

Summative assessment (% of credit)

Coursework	Written exams	Practical exams
40	0	60

Details of summative assessment

Form of assessment	% of credit	Size of the assessment (eg length / duration)	ILOs assessed	Feedback method
Coursework	40	2000 word equivalent + code	1-10	Written
Practical coding exam (open book)	60	2 hrs (plus 30 mins uploading time)	1-10	Written

Details of re-assessment (where required by referral or deferral)

Original form of assessment	Form of re-assessment	ILOs re-assessed	Timescale for re-assessment
Coursework (40%)	Coursework (2000 word equivalent + code)	1-10	Ref/Def period
Practical coding exam (open book) (60%)	Practical coding exam (open book) (2 hrs (plus 30 mins uploading time)	1-10	Ref/Def period

Re-assessment notes

Please refer to the TQA section on Referral/Deferral: http://as.exeter.ac.uk/academic-policy-standards/tqa-manual/aph/consequenceoffailure/

Indicative learning resources - Basic reading

“R for Data Science”—Hadley Wickham and Garrett Grolemund (https://r4ds.had.co.nz/)

“Pro Git Book”—Scott Chacon and Ben Straub (https://git-scm.com/book/en/v2)

Key words search

Data Science; R; programming; data visualisation

Credit value	15
Module ECTS	7.5
Module pre-requisites	CSC2020, CSC2023 (or equivalents)
Module co-requisites	N/A
NQF level (module)	6
Available as distance learning?	No
Origin date	25/01/2021
Last revision date	24/02/2021