Data Management Plan FAQs
What metadata is available for the datasets?
How is data privacy upheld and maintained?
What is the Smart Data Foundry Trusted Research Environment?
How are SDF datasets made up?
Our datasets are made up of UK banking microdata and associated relevant datasets (if applicable) in csv format, curated and held by Smart Data Foundry. For more detailed information about specific datasets, please see our data catalogue on myFoundry.
What metadata is available for the datasets?
Smart Data Foundry datasets have detailed metadata that conform with DataCite Metadata Schema 4.6 and contain, at minimum, the following fields:
- Title
- Creator
- Identifier
- Publisher
- Publication year
- Subject
- Resource type
- Format
- Description
- Geolocation
- Access type
- Access requirements
How is data privacy upheld and maintained?
The UK banking microdata that Smart Data Foundry holds is effectively anonymised at source, meaning there is no ability to identify individuals in the data as supplied.
Data are stored inside a Trusted Research Environment (TRE) – see below for further technical and procedural details.
What is the Smart Data Foundry Trusted Research Environment?
The SDF Trusted Research Environment (TRE) is managed in partnership with EPCC. It enables researchers to gain controlled access to microdata within a secure environment.
For a research user, a typical use case will involve access to a highly secure Virtual Machine environment with access to typical research data manipulation tools such as Python, R and Jupyter Notebooks – but with significant controls such as import and export checks on all data from our data governance operations team, lack of ability to copy-paste, and no internet access apart from controlled access to certain software package repositories.
What controls are in the TRE?
- Segregated and controlled storage for the data
- A secure virtual desktop interface to access the data – without access to external networks
- An encrypted file transfer system to take data in and out of the TRE
- Researchers are given access to highly secure, high utility data from data partners as well as controlled access to software tools and reference data – with no other internet access and strict copy-paste controls.
- All data transfers in and out of Research Environment are encrypted closely managed by the SDF Data Operations team
- All data in and out are checked by Data Operations to ensure data is not re-identifiable and complies with our responsibilities to data subjects and our data partners
Segregated and controlled storage
Physical controls
All service infrastructure and all project data are hosted & operated at EPCC’s Advanced Computing Facility (ACF) with significant on-premises controls.
Logical controls
Smart Data Foundry VMs are in a dedicated zone, preventing access to other Trusted Research Environment zones and their datasets and vice versa.
Access controls
The ACF data centre maintains strict firewalls and network controls – including no access to the internet - to prevent unauthorised and malicious access to the data. By default, to enable research and necessary data transfers, we set up:
- Access to a VPN for researchers
- Use of multifactor authentication to access the environment
- Specific IP addresses cleared to receive or send data to our partner institutions
A secure virtual desktop interface
Virtual Machine access
- Virtual VPN-managed access (with multifactor authentication) to data centre, managed by our information governance controls and Data Operations team
Access to data processing tools
- Access to Python and R, along with access to PyPi and CRAN package libraries from within the VM
- Access to Visual Studio, LibreOffice, Jupyter Notebooks
A secure encrypted file transfer system
File transfers
We use a combination of airlocked staging zones, file encryption and access controls to receive encrypted files from our data partners and to release approved outputs from the TRE.
Independent disclosure checks
We complete information governance and disclosure checks on all data cleared into and out of the secure environment, performed by a team independent to the data science and research teams. The purpose is to check that the data sent by providers is within the scope and purpose of the project, data quality checks, and mitigate risk of reidentification. In addition, some data quality checks are done inside the research environment.
Secure outputs
All outputs are checked and verified by our governance team to ensure the data is sufficiently aggregated to the point of being information (with a minimum "checksum" value of 10), following which the output can be accessed outside of the Trusted Research Environment via a secure output weblink to download the information.