Librarian-in-the-Loop Deep Learning To Curate Very Large Biomedical Image Datasets

A Brief Intro For Librarians

AI Enhanced Segmentation of Very Large FIB-SEM Images For Cell Biology

For Data Science or Computer Science Students

Collaborators

Students

Synopsis

From double helix to pillars of creation, science is often driven by innovative instrumentation and imaging. Recent breakthroughs in Focused Ion Beam Scanning Electron Microscopy (FIB-SEM) technology have made it possible to reliably acquire nanometer scale 3D imaging from sizable volumetric biomedical samples, each resulting in tens to hundreds of terabytes of raw image data. After generating landmark datasets for neuroscience and cell biology research, scientists at Yale are now bringing enhanced FIB-SEM to enable discoveries in translational and clinical research. Similar to many other data intensive science challenges, the bottleneck has now shifted from data collection and storage to data curation for the primary purpose of extracting insights and knowledge from data, albeit with more stringent requirements on efficiency, timeliness, replicability, and reusability. 

Existing data curation models and frameworks are insufficient to address these challenges. In addition, the very large data volume has rendered comprehensive close reading and manual image annotation impractical. For example, it has been estimated that FIB-SEM images taken from a single cell may take up to 60 person-years to annotate manually. To make sense of these images, researchers increasingly resort to machine learning methods. Supervised deep learning has been applied to FIB-SEM images but its performance can be unreliable. Training a model for automatic image segmentation may take months on a GPU cluster and still result in overfitting.  Thankfully, a recent study suggests that interventions from experienced and insightful domain experts and data curators may drastically speed up the training, although the performance gain originating from such human interventions has not been carefully benchmarked. Very large FIB-SEM datasets therefore present an archetypal test case on how to best orchestrate scientists, data curators, cyberinfrastructure, software, and deep learning algorithms to achieve best data-to-insight performance. 

This project will draw insights from our prior IMLS funded project curating very large research datasets. Our past experience has shown that 1) data curators/librarians should be deployed in the big data pipeline as early as possible, even at the stage of physically acquiring data. Knowledge in data acquisition often affords pertinent opportunities to optimize the data pipeline. 2) Data curation should be driven primarily by data use and reuse, which closely aligns librarians/data curators with domain scientists. Long-term preservation activities are better performed as a side effect of data use and reuse. 3) The efficiency, cost, and performance of extracting insights from data are often the critical success factors for data curation and are closely associated with both the data format and the cyberinfrastructure options and choices. Experimenting and benchmarking are often the more effective way to achieve balanced results, therefore this prototyping project.

Inputs & Expected Outputs

The input sources of this project include three sets of 3D FIB-SEM neuron images at 8-nanometer resolution taken from genetically modified mouse. Each is about 2-4TB, as shown below. 

Expected outputs include fully segmented images containing all voxels of nuclear pores on all imaged nuclei and related statistics and analysis.

Zoomed out to 1/8000 of the original resolution, showing multiple neuron nuclei

A tiny section of the same image at the original resolution, showing the pores

Preliminary Results (Updated Oct 2024)

Our initial work has been primarily focused on reproducing the results from the Open Organelle project, as described in this Nature paper

Since Oct 2022, we have started to work on FIB-SEM images taken from Bordey Lab's mouse brain samples. Our current focus is on identifying nuclear pores. Zhiwu lead the labelling efforts to produce ground truth data, and work with Yinlin on the training and prediction efforts.

Initial Labeling Scheme

Modified Labeling Scheme

Initial Prediction

Predicted Nuclear Pores in 3D (Jan 2024) 


Before

After Adding 2 More Tiny Slices of Ground Truth

Predicted Nuclear Pores in 3D (July 2024, after Post-Processing)


Predicted Nuclear Pores in 3D (Sep 2024, before Post-Processing)