How to build a data platform to support pioneering medical research

While medical imaging has long been integral to healthcare, recent years have seen its use expand far beyond traditional hospital settings. For example, most high street opticians now offer OCT scans of your retina as part of a regular eye test.

This means that where access to imaging equipment was once the bottleneck in diagnosing conditions, the issue now is finding time for doctors to assess images from all the referrals they receive. And with many of those referrals turning out to be false positives, the clinician time spent assessing these images can result in delays providing treatment to those who urgently need it.

To address this new challenge, clinicians and researchers are now exploring ways to use artificial intelligence (AI) to assess medical images, and flag up the ones requiring further attention.

Encouraging early signs

Early indications are that this approach can be extremely effective: studies involving Moorfields Eye Hospital and Google DeepMind as well as a system to predict whether an eye will convert to exAMD within the next six months.

While much of the fanfare has been around the use of different forms of AI in these situations, including machine learning (ML) and deep learning (DL), training the algorithms relies on having access to the right data. Moreover, AI isn’t the only way the vast quantities of data being collected through medical imaging can be useful: access to big datasets can be valuable to researchers in other ways as well. The AlzEye programme, for example, is looking to draw on an unprecedented large dataset to explore the patterns of retinal change associated with the development of dementia.

Supplying health data for research

To provide the right healthcare data to researchers – and to do this in a safe and secure way – requires a lot of work behind-the-scenes. Below, we provide a high-level overview of the main elements required to create a large-scale healthcare research data platform, such as the solution Softwire built for Moorfields Eye Hospital.

Data governance underpinnings

While it isn’t a topic we’re going to cover in detail, rigorous data governance is essential around any healthcare data platform. Importantly, it must strike the necessary balance between protecting patient information, while also encouraging cutting-edge research.

Data pipeline

To get the data into the platform, you first need a pipeline capable of ingesting large volumes of medical images from disparate sources. It needs to be able to extract key metadata from these images, such as the part of the body that’s been captured, the type of image, and the device make and model.

This needs to be combined this with other information about the individual from whom the image has been taken, by integrating with electronic health record systems. This could include their age, sex, ethnic background and details of any conditions they have. Knowing this information is essential for researchers, because conditions can affect different groups in different ways. Lastly, the pipeline needs to include suitable pseudonymisation/de-identification, to protect individuals’ data.

All this data about each image then needs to be appropriately stored, so that it can be securely accessed (with the proper authorisation in place) and queried by researchers.

Optimised storage

The cloud may offer conceptually limitless storage, but when you’re working with such large volumes of images and their associated metadata, it’s essential you use storage smartly. All the major cloud providers offer a range of storage services, from high-speed solid-state to archival options. When you’re building a research data platform, the key is to choose storage services and tiers that strike the optimal balance between cost and access speeds, based on your use cases.

Self-service data warehouse search

To extract real value from the image data you’ve ingested and stored, you then require a way for researchers to access it. Users will have varying levels of technical skill, so it’s important you have a front-end that enables even non-technical researchers to browse on their own.

What they see must be tightly controlled. Certain underlying data relating to the images needs to be protected. But at the same time, you do need to enable researchers to browse a controlled set of metadata, and filter based on parameters they set.

For example, they may wish to request all scans from patients within a certain age range, who have one or more specified conditions. Given this is healthcare data, these filters require an additional level of control, to ensure the criteria cannot be so specific as to enable a researcher to identify an individual patient.

The front-end then needs to show the researcher how many images match their criteria, and selected metadata about the results. From a user experience perspective, this must happen near-enough instantly. The user also needs to be able to amend their search and update results.

Serverless data warehouses are your friend here, enabling the system to quickly crunch through huge volumes of data on-demand, while eliminating the need to manage and pay for infrastructure that would spend significant amounts of time idle.

Once the researcher identifies the images and metadata they want access to, your platform should enable them to submit an access request, to be reviewed in line with your data governance processes. Once this is approved and conditions set, the platform then needs to extract the relevant images from where they’re stored, and the metadata from the data warehouse. It must apply any additional pseudonymisation that’s required, and transform the metadata and image-set into the shape required by the researcher.

Trusted research environments

To support the researcher’s work, while maintaining control over where data is stored and how it’s used, the platform should then make it available via ‘trusted research environments’ (or ‘secure research environments’). Hosted in the cloud, these provide ring-fenced areas for authorised researchers to work with the data.

Within the environment, you can provide a huge variety of general and healthcare-specific data-analysis tools, including those from your chosen cloud provider and its marketplace. And you can base an environment on a range of cloud compute instance types, to suit different performance needs and budgets.

Operating these trusted research environments in the cloud eliminates the need for you to run expensive and capacity-constrained resources on-premises, and instead enjoy access to the tools your researchers need, as soon as they need them.

Immense rewards

That concludes our whistle-stop tour of how to build a healthcare data research platform that encourages researchers to further push the boundaries of their fields.

While working with patient data introduces added complexities when building large-scale data warehousing and analytics solutions, the rewards for getting governance spot-on can be immense. Key among them is the ability to give researchers access to potentially bigger and richer datasets than ever before, or to unlock the potential of AI in healthcare.

And as Moorfields Eye Hospital have shown, in the hands of the right people, the outcomes look set to be genuinely transformational for population health and wellbeing.