A deep learning pipeline for three-dimensional brain-wide mapping of local neuronal ensembles in teravoxel light-sheet microscopy

Table of Contents

Datasets and experiments

Training dataset

Our training data consisted of two LSFM cohorts of TRAP2-Ai9 mice (total n = 18): (1) ten animals aged 3–4 months with whole-brain data acquired and (2) eight animals aged 2 months with data acquired from the left hemisphere. All animal procedures followed animal care guidelines approved by the Institutional Animal Care and Use Committee at the Scripps Research Institute. These animals had an immediate early gene (c-fos) promoter, contributing to the activity-dependent expression of inducible recombinase Cre-ERT2. The expression of fluorescent protein tdTomato was activated following the injection of 20 mg kg⁻¹ 4-hydroxytamoxifen. We used the following splits for training and evaluation: the training set consisted of ten animals, with three and five used for the validation and test sets, respectively.

Unseen dataset 1

We used whole-brain LSFM data from a model of spinal cord injury (n = 3 mice) to validate our DL segmentation models. Briefly, mice were induced under anesthesia with a mixture of isoflurane and O₂. Mice were then placed on a heating pad set to 37 °C and maintained on 1–3% isoflurane and O₂. A midline skin incision was made and the T10 lamina identified. A T10 laminectomy was performed, followed by left lateral hemisection using either microscissors or a microscalpel⁶¹. Muscle closure was performed with 6-0 Vicryl followed by skin closure with 6-0 Ethilon. Postoperatively, mice were placed on a heating pad and given subcutaneous fluids as needed. Pain control for 48 h postoperatively was also provided via subcutaneous daily administration of Rimadyl (5 mg/kg⁻¹). Bladders were expressed twice daily until spontaneous recovery of bladder function. All hemisection lesions were histologically confirmed as adequate. All three mice were adult female C57BL/6 mice (≥8 weeks of age at the start of the experiment, 15–30 g body weight). All procedures were performed in compliance with the Swiss Veterinary Law guidelines and were approved by the Veterinary Office of the Canton of Geneva, Switzerland (license no. GE/112/20).

Unseen dataset 2

We used whole-brain LSFM data obtained from a male C57BL/6 J mouse (Jackson Laboratory, strain 000664), aged 18 months. All animal procedures followed protocols approved by the Stanford University Institutional Animal Care and Use Committee, and met the guidelines of the National Institutes of Health Guide for the Care and Use of Laboratory Animals.

Cold dataset

We analyzed the cold and thermoneutral c-Fos dataset from our previous publication⁴⁸. Briefly, single-housed male wild-type (WT) C57BL/J6 mice were exposed to either 4 °C (cold) or 30 °C (thermoneutral) for 6 h, with free access to food and water, before perfusion. Mouse brains were harvested and fixed in 4% paraformaldehyde (PFA) following perfusion.

Walking dataset

We used two groups of adult female C57BL/6 mice (≥8 weeks of age at the start of the experiment, 15–30 g body weight, n = 3 per group). Briefly, mice were trained to run quadrupedally on a treadmill (Robomedica, Inc.) 5 days per week for 2 weeks before perfusion. To elicit c-Fos expression, mice ran on the treadmill for 45 min at a speed of 9 cm s⁻¹ and were then perfused for 1 h (walking group). Mice in their home cages were perfused for homecage group analysis.

Fine-tuning dataset

We used LSFM data from a hemisphere of a mouse brain. The clearing-assisted tissue click chemistry method was used for in situ fluorescence imaging of drug molecules that bind to specific targets. In this method, the covalent monoamine oxidase inhibitor pargyline-yne was administered to the mouse at a dose of 10 mg kg⁻¹ for 1 h by intraperitoneal injection. The drug was labeled with AF647 dye using a click reaction, allowing us to visualize drug molecules within brain tissue.

Tissue clearing and immunolabeling

Training dataset

Two weeks following injection of 4-hydroxytamoxifen, mice were perfused; their brains were collected and underwent overnight fixation in 4% PFA solution. Fixed brain specimens were treated using the SHIELD⁸ protocol to maintain the integrity of protein antigenicity. Thereafter, an active clearing procedure was used for tissue CLARITY⁵. Samples were index matched by adding them to EasyIndex medium before imaging with a light-sheet microscope.

Unseen dataset 1 and walking dataset

Adult mice were anesthetized with intraperitoneal pentobarbital (150 mg kg⁻¹), followed by intracardiac perfusion of 1× PBS then 4% PFA in PBS. The brain was dissected and the sample postfixed in 4% PFA overnight at 4 °C. Brains then underwent processing with iDISCO+ (ref. ¹). Briefly, samples underwent methanol pretreatment by dehydration with a methanol/H₂O series, each for 1 h, as follows: 20, 40, 60, 80 and 100%. Samples were then washed with 100% methanol for 1 h and chilled at 4 °C, followed by overnight incubation in 66% dichloromethane/33% methanol at room temperature. This was followed by two washes in 100% methanol at room temperature, then bleaching in chilled fresh 5% H₂O₂ in methanol overnight at 4 °C. Samples were rehydrated with a methanol/H₂O series as follows: 80, 60, 40 and 20% then PBS, each for 1 h at room temperature. Samples underwent washing for 2× 1 h at room temperature in PTx.2 buffer, and were then incubated in permeabilization solution for 2 days at 37 °C. Samples were then incubated in blocking solution (42 ml of PTx.2, 3 ml of normal donkey serum and 5 ml of DMSO, for a total stock volume of 50 ml) for 2 days at 37 °C with shaking. This was followed by incubation in a primary antibody solution consisting of PBS/0.2% Tween-20 with 10 μg/ml heparin (PTwH), 5% DMSO, 3% normal donkey serum and c-Fos (rabbit anti-c-Fos, 1:2,000, Synaptic Systems, catalog no. 226003) for 7 days at 37 °C with shaking. Next, samples were washed in PTwH for 24 h, followed by incubation in a secondary antibody solution consisting of PTwH, 3% normal donkey serum and donkey anti-rabbit Alexa Fluor 647 (1:500, Thermo Fisher Scientific) for 7 days at 37 °C with shaking. Samples were then washed in PTwH for 24 h, followed by tissue clearing; final clearing was performed using iDISCO+ (ref. ¹). Briefly, samples were dehydrated in a methanol/H₂O series as follows: 20, 40, 60, 80 and 100%, each 2× 1 h at room temperature. This was followed by a 3-h incubation in 66% dichloromethane/33% methanol at room temperature, then incubation in 100% dichloromethane for 2× 15 min. Samples were then incubated in dibenzyl ether for at least 24 h before imaging.

Cold dataset

Whole-brain clearing was performed by LifeCanvas Technologies through a contracted service. To preserve samples’ protein architecture, they were fixed with proprietary SHIELD⁸ solutions from LifeCanvas Technologies. Samples were then placed in the SmartBatch+ system to carry out active clearing and immunolabeling with rabbit anti-c-Fos primary antibody (CST, catalog no. 2250S). Samples were index matched by placing them in EasyIndex medium.

Unseen dataset 2

The animal was anesthetized with isoflurane and transcardially perfused with PBS, followed by 4% PFA. The whole mouse brain was extracted carefully and fixed in 4% PFA for 24 h; the PFA-fixed sample was then processed using the SHIELD⁸ protocol (LifeCanvas Technologies). Active clearing of samples was carried out using the Smartbatch+ device. Samples were then immunolabeled using eFLASH⁶² technology, incorporating the electrotransport and SWITCH⁶³ methods. Samples were labeled with a recombinant anti-c-Fos antibody (abcam, catalog no. ab214672) and then fluorescently conjugated with Alexa Fluor 647 (Invitrogen, catalog no. A-31573) at a primary:secondary molar ratio of 1.0:1.5. Following labeling, samples were index matched (n = 1.52) by incubation in EasyIndex medium (LifeCanvas Technologies).

LSFM imaging

Training dataset

Samples were index matched with EasyIndex medium and imaged on a light-sheet microscope (SmartSPIM) with a ×4 objective lens. For eight subjects, data were acquired from a hemisphere while, for the remainder, data were acquired from the whole brain. Nominal lateral spatial resolution was 3.5 µm, and step size was set to 4 µm in the z-direction, with 561-nm excitation and an emission filter of 600/52 nm.

Unseen dataset 1 and walking dataset

Imaging of whole brains was performed using a CLARITY-optimized light-sheet microscope, as previously described⁶⁴, with a pixel resolution of 1.4 × 1.4 µm² in the x and y dimensions and a z-step of 5 µm, respectively, using the ×4/0.28 (numerical aperture) objective, which is suitable for resolving cell nuclei labeled by c-Fos. Using a custom-made quartz cuvette filled with dibenzyl ether, whole brains were imaged. Two channels were imaged: one with autofluorescence (auto channel, 488 nm) to demonstrate anatomy, and a second demonstrating c-Fos labeling (cell channel, 647 nm). All raw images were acquired as 16-bit TIFF files and were stitched together using TeraStitcher⁶⁵.

Cold dataset

Whole-brain imaging and automated analysis were performed by LifeCanvas Technologies through a contracted service. Samples were index matched by placing them in EasyIndex medium, and imaged on a light-sheet microscope (SmartSPIM) with a ×4 objective lens. Nominal lateral spatial resolution was 1.75 µm, and step size was set to 4 µm in the z-direction.

Unseen dataset 2

The index-matched (n = 1.52) whole-brain sample was imaged using a SmartSPIM light-sheet microscope. The sample was imaged with a ×4 objective lens. Samples were imaged with a 642-nm laser. Nominal lateral spatial resolution was 1.8 µm, and step size was set to 4 µm in the z-direction

GT label generation

Our annotation strategy for generating silver standard GT labels of neuronal somas comprised three stages: MIRACL¹⁰ segmentation, Ilastik pixel classification and postprocessing.

MIRACL segmentation

To create GT labels, we first used the MIRACL¹⁰ segmentation workflow that incorporates image-processing tools implemented as FIJI/ImageJ macros³⁸. The workflow includes a 3D watershed marker-controlled algorithm and a postprocessing 3D shape filter to omit FP. This resulted in binary GT labels across the entire dataset with a low FP rate and relatively higher FN rate.

Ilastik pixel classification

To improve MIRACL-generated GT labels we used Ilastik²², which performs pixel classification using image filters (as input features) and a random forest (RF) algorithm (as a classifier) through a user-friendly interface. Filters include pixel color and intensity descriptors, edginess and texture in 3D and at different scales. The RF combines hundreds of decision trees and trains each one on a slightly different set of features. The final predictions of RF are made by averaging the predictions of each tree. We used 37 three-dimensional filters and an RF classifier with 100 trees. To train the RF classifier, we imported MIRACL’s segmentation outputs as silver input annotations (initialization) to Ilastik (that is, in lieu of manual annotations). To address the prohibitive speed and memory requirements of the RF algorithm, we trained it on image patches (512³ voxels). For each brain, three 512³ patches from different depths were randomly selected. The RF model was trained several times by providing feedback (that is, by correcting the results with expert annotation and modifying the labels to achieve optimal results). Ilastik output did not correctly detect the boundaries of neuronal soma, and frequently overestimated their spatial extent.

Postprocessing

To further reduce the number of FP voxels in GT labels, we applied a 3D shape filter using the ImageJ shape filter plugin⁶⁶ on Ilastik-generated labels. To solve the problem of volume overestimation, we applied a 3D erosion filter with a sphere-like kernel (radius of one voxel) to the GT labels. The final whole-brain GT labels were visually quality checked in three 512³ patches per brain, by two raters; specifically, the raters were asked to randomly select and validate one 512³ patch in the cerebrum, brain stem and cerebellum.

Deep neural network architectures

ACE’s segmentation module consisted of 3D ViT-based and CNN-based (U-Net) architectures (Extended Data Fig. 1).

UNETR

The popular U-Net architecture has powerful representation learning capabilities, and generates more accurate dense-segmentation masks than do other CNN architectures (such as Mask R-CNN), thanks to the preservation of spatial information^26,67. However, fully convolutional models are limited in their ability to learn long-range dependencies, resulting in potentially suboptimal segmentation of objects in large volumes, including neuronal cell bodies of varying shape and size (for example, in different regions of the brain). To address this issue, a ViT architecture (UNETR) has been proposed in which the encoder path of a U-Net is replaced by a transformer to learn contextual information from the embedded input patches⁶⁸. This motivated us to develop and deploy optimized UNETR-based segmentation models with residual blocks and dropout layers to improve the robustness and generalizability of previous pipelines. Specifically, in UNETR⁶⁸ the encoder was replaced with a stack of 12 transformer blocks, operating on a 1D sequence (16³ = 4,096) embedding of the input (one channel 3D image patch, 96³ ≃ 0.27 × 0.27 × 0.48 mm³). Subsequently, a linear layer projected the vectors into a lower dimensional embedding space (embedding size K = 768), which remained constant throughout the transformer layers. A 1D learnable positional embedding was added to the projected patch sequence to preserve spatial information. This embedded patch sequence, with the dimension of N × K (N is the sequence length and N = H/16 × W/16 × D/16 × K and H, W and D are the 3D input’s height, width and depth), was passed as input to the stack of transformer blocks. Each transformer block consisted of multihead attention and multilayer perceptron layers. All multihead attention modules consisted of multiple self-attention heads, where each self-attention block learned mapping in the patch sequence in parallel. We extracted learned sequence representations at four different depths of the transformer stack and reshaped each back to $\scriptstyle\frac{H}{16}\times \frac{W}{16}\times \frac{D}{16}\times K$. The embedded sequence vectors then underwent five different encoder blocks with consecutive convolutional layers to achieve supervision at different depths (Extended Data Fig. 1).

The encoder was connected to a decoder via skip connections at multiple resolutions, to predict segmentation outputs. We developed an optimized UNETR⁶⁸ architecture with some modifications, including the addition of dropout layers in all blocks and residual units⁶⁹ in convolution blocks in the encoder, as a regularization strategy to avoid overfitting and vanishing gradient problems. For a given 3D input cube, the segmentation models generated a 3D volume containing voxel-wise probabilities of neuronal cell bodies (0 < P < 1). Prediction maps were binarized using either a default (0.5) or user-fed threshold to generate a map representing whether a voxel belongs to a neuron.

U-Net

We similarly implemented an optimized version of the seminal U-Net^70,71,72 architecture, with some modifications, based on our previous work⁴⁰. The U-Net architecture consists of contracting (encoder) and expanding (decoder) paths. The encoder is based on 3D convolution and pooling operators; it takes an image patch as input and generates feature maps at various scales, creating a multilevel, multiresolution feature representation. Meanwhile, the decoder with up-convolution operators leverages these feature representations to classify all pixels at the original image resolution. The decoder assembles the segmentation, starting with low-resolution feature maps that capture large-scale structures, and gradually refines the output to include fine-scale details. In our U-Net architecture, the standard building blocks have been replaced with a residual block⁶⁹. In addition, parametric rectifying linear units⁷³ have been deployed to provide different parameterized nonlinearity for different layers. We converted all 2D operations, including convolution and max-pooling layers, into 3D and used batch normalization rather than instance normalization to achieve a stable distribution of activation values throughout the training, and to accelerate training⁷⁴. A dropout layer was added between all convolution blocks in the architecture as a regularizer, to avoid overfitting^36,75. Except for the first residual units, all convolutions and transpose convolutions had a stride of 2 for downsampling and upsampling, respectively. The first residual unit used a stride of 1, which has been shown to increase performance by not immediately downsampling the input image patch⁷⁰.

All brain light-sheet data are first divided into smaller image patches. Deploying ACE segmentation models, binary and uncertainty maps are obtained per image patch; the pipeline then automatically stitches all the maps to create whole-brain segmentation and uncertainty maps that match the input data.

Loss function

The DSC is a widely used metric that measures similarity between two labels. The class average of the DSC can be computed as

$$D({{\mathrm{GT}}},P)=\frac{2}{N}\mathop{\sum }\limits_{n=1}^{N}\frac{\mathop{\sum }\nolimits_{k=1}^{K}{{{\mathrm{GT}}}}_{k,n}{P}_{k,n}}{\mathop{\sum }\nolimits_{k=1}^{K}{{{\mathrm{GT}}}}_{k}^{2}+\mathop{\sum }\nolimits_{k=1}^{K}{P}_{k}^{2}},$$

where N is the number of classes, K is the number of voxels and GT_k,n and P_k_,n denote GT and output prediction, respectively, for class n at voxel k.

Cross-entropy (CE) measures the difference between two probability distributions over the same sets of underlying events, and was computed as

$${{\mathrm{CE}}}({{\mathrm{GT}}},P)=\frac{1}{K}\mathop{\sum }\limits_{k=1}^{K}\mathop{\sum }\limits_{n=1}^{N}{{{\mathrm{GT}}}}_{k,n}\log ({P}_{k,n}).$$

In regard to ACE DL architecture, we used an equally weighted Dice–cross-entropy loss, which is a combination of Dice loss and cross-entropy loss functions; this was computed in a voxel-wise manner as

$$L({\mathrm{GT}},{P})\,=\,D({\mathrm{GT}},{P})\,+\,{\mathrm{CE}}({\mathrm{GT}},P).$$

Model training

To train our DL models, we used a total of 36,480 unique input patches (96³ voxels), not accounting for data augmentation. To address the issue of class imbalance in our dataset (the majority of voxels representing background), patches containing only background—or a small number of foreground voxels (<100,000)—were filtered out from the 512³ image patches generated by the annotation strategy. Hence, for the UNETR model with an input size of 96³, we used 15,200, 9,120 and 12,160 unique input patches (not accounting for data augmentation) for training, validation and testing, respectively (n = 30,400 patches with augmentation for training alone); for the U-Net model with an input size of 128³, we used 7,600, 3,840 and 5,120 patches, respectively.

For hyperparameter tuning, we used a Bayesian optimization approach via the Adaptive Experimentation Platform ( The following hyperparameters were optimized during training: input image size, encoder and decoder depth, kernel size, learning rate, batch size, kernel size and loss function. For the UNETR architecture, which in total had 92.8 million parameters, the best model based on DSC performance on the validation set had the following parameters: 12 attention heads, feature size of 16, input patch size of 96³ voxels and batch size of 24. The U-Net architecture, which in total had 4.8 million parameters, had the following parameters: a five-layer encoder with an initial channel size of 16 and kernel size of 3 × 3 × 3, input patch size of 128³ voxels and batch size of 27.

The UNETR and U-Net models were trained for 700 and 580 epochs, respectively. To avoid overfitting, early stopping was set to 50 epochs where performance (DSC) on the validation dataset did not improve. The Adam optimizer⁷⁶ was used with an initial learning rate of 0.0001.

Implementation

All DL models were implemented in Python using the Medical Open Network for Artificial Intelligence framework (MONAI⁷⁷), and the PyTorch machine learning framework⁷⁸. All training was performed on the Cedar and Narval cluster provided by the Digital Research Alliance of Canada (www.alliancecan.ca), using NVIDIA V100 Volta graphic processing units with 32 GB of memory, and A100 graphic processing units with 40 GB of memory. Registration of our data to ARA was performed with our in-house, open-source MIRACL¹⁰ software. MIRACL is fully containerized and available as Docker and Singularity/Apptainer images ( For results visualization, we used a variety of open-source software applications and Python libraries, including matplotlib, seaborn, Brainrender⁷⁹, Fiji/ImageJ³⁸, itk-SNAP⁸⁰ and Freeview ( We also used BioRender for figure creation in this manuscript (https://www.biorender.com/).

Data augmentation

To increase the generalizability of the DL model and model distribution shifts in LSFM data, the training set images were randomly augmented in real time at every epoch. Supplementary Fig. 3 shows all the data augmentation transforms used, namely: affine transformations, contrast adjustment, histogram shift, random axis flipping and different noise distributions such as salt and pepper, and Gaussian. These transforms were selected because they are representative of distortions that occur during LSFM imaging. Briefly, a range (0, 1) of scaling was applied to each 512³-image patch, based on the intensity distribution of that patch. The intensity of each 512³-image patch was scaled from the [0.05, 99.5] percentile to [0, 1], where 0.05 and 99.95 are the intensity values at the corresponding percentiles of the image patch. Subsequently, each data augmentation transform was randomly applied (with a probability of P = 0.5) to the 512³-image patch at each epoch. The parameters of each data augmentation were also randomly selected at each epoch from the predetermined range of values.

Voxel-wise uncertainty map

Epistemic uncertainty, as described previously³⁵, is rooted in a lack of knowledge about model parameters and structure, rather than stemming from inherent variability in the observed data; this type of uncertainty is often referred to as model uncertainty³⁵. To estimate the models’ uncertainty and confidence in predictions, we used the Monte Carlo dropout approach. During training, dropout layers (with a probability of P = 0.2) were utilized as a regularization technique. It has been shown that turning on dropout layers (randomly switching neurons off) in inference mode can be interpreted as a Bayesian approximation of the Gaussian process³⁶. At test time, when the dropout layers are turned on, each forward pass yields a stochastically different prediction as a sample from the approximate parametric posterior distribution ($P(\,{y}|{X}\,)$)³⁷. This technique has been shown to provide useful insights into the model’s uncertainty, by computing the variance of numerous predictions ($Y=\{\,{y}_{1},{y}_{2},\ldots ,{y}_{N}\,\}$). Voxel-wise uncertainty (variance) can thus be defined⁷:

$${\rm{Voxel}}{\hbox{-}}{\rm{wise}}\; {\rm{uncertainty}}\simeq \frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}{y}_{n}^{2}-{\left(\frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}{y}_{n}\right)}^{2}$$

The resulting voxel-wise uncertainty map provides a measure of variation in predictions of slightly different models for the same data, which can be particularly useful in identifying regions of high uncertainty.

Ensemble of ensembles

Each model’s output was obtained by averaging the probability maps of 50 models ($Y=\left\{\,{y}_{1},{y}_{2},\ldots ,{y}_{50}\right\}$) using the Monte Carlo dropout technique:

$${\hat{Y}}_{{{\mathrm{Ensembles}}}}=\frac{1}{50}\mathop{\sum}\limits_{i=1}^{50}{y}_{i}$$

Subsequently, to combine the prediction maps of both models, a final mapping of neurons was generated using an ensemble of both models (ensemble of ensembles):

$${\hat{Y}}_{{{\mathrm{Ensemble}}\; {\mathrm{of}}\; {\mathrm{ensembles}}}}=\frac{1}{2}\left({\hat{Y}}_{{{\mathrm{Ensembles}}}_{{\mathrm{U}}-{\mathrm{Net}}}}+{\hat{Y}}_{{{\mathrm{Ensembles}}}_{{{\mathrm{UNETR}}}}}\right).$$

Model evaluation

Evaluation metrics

To evaluate the performance of the segmentation models, several volume- and shape-based metrics were used, including DSC, recall, precision, F1 score and HD₉₅. We used the metrics derived from the resulting confusion matrix and associated true-positive (TP), FP and FN values. Each neuronal soma was defined as a 3D connected component. Given this definition, TP was defined as the number of correctly detected neurons following comparison of GT with prediction P. Sensitivity or recall measures the proportion of TP relative to the number of individual neurons delineated in GT, and was defined as

$${\rm{Recall}}=\frac{{{\mathrm{TP}}}}{{{\mathrm{TP}}}+{{\mathrm{FN}}}}.$$

Precision measures the proportion of TP against all positive predictions, and was defined as

$${\rm{Precision}}=\frac{{{\mathrm{TP}}}}{{{\mathrm{TP}}}+{{\mathrm{FP}}}}.$$

While recall is useful to gauge the number of FN pixels in an image, precision is useful to evaluate the number of FP pixels in a prediction.

The F1 score combines precision and recall, and is often used to measure the overall performance of a model. The F1 score measures the number of wrongly detected neurons in P:

$${{F}}\,1\,{\rm{score}}=\frac{2\times {{\mathrm{Precision}}}\times {{\mathrm{Recall}}}}{{{\mathrm{Precision}}}+{{\mathrm{Recall}}}}.$$

DSC measures the number of elements common to GT and P datasets, divided by the sum of the elements in each dataset, and is defined as

$${{\mathrm{DSC}}}({{\mathrm{GT}}},P)=2\times \frac{\left|{{\mathrm{GT}}}\cap P\right|}{\left|{{\mathrm{GT}}}\right|+\left|P\right|}.$$

Hausdorff distance is a mathematical measurement of the ‘closeness’ of two sets of points that are subsets of a metric space; it is the greatest of all distances from a point in one set to the closest point in the other. We used the 95th percentile of Hausdorff distance rather than the maximum results, to provide a more robust and representative measure of the segmentation’s performance in image analysis tasks. Given two sets of points, X and Y, Hausdorff distance between these two sets was defined as

$${{\mathrm{HD}}}\left(X,Y\,\right)=\max \{{\sup }_{x\in X}{\inf }_{y\in Y}d\left(x,y\right),{\sup }_{y\in Y}{\inf }_{x\in X}d\left(x,y\right)\}.$$

Simulation distribution shift

We deployed our recently published ROOD-MR platform ( on our test data, which includes methods for simulating distribution shifts in datasets at varying severity levels using imaging transforms and generating benchmarking segmentation algorithms based on robustness to distribution shifts and corruptions. We employed three commonly used transforms that disrupt low-level spatial information (Gaussian noise, smoothing and sharpening). We added Gaussian noise to the image with zero mean and 0 < σ < 1. For sharpening, we used a Gaussian blur filter with a zero mean and 0.05 < σ < 2. We also applied a Gaussian smooth filter to the input data based on the specified sigma (σ) parameter (0.05 < σ < 2).

Comparison against Ilastik

We used the pixel classification module in Ilastik and trained a RF classifier using all training subjects (18 whole-brain LSFM images). We used 37 three-dimensional filters and an RF classifier with 100 trees. To train the RF model, we randomly selected three 512³-voxel image patches from each subject and applied the same scale-intensity transform (Supplementary Fig. 3) used in the DL training approach, to provide a fair comparison. Next, we dedicated around 2 h per image patch on a personal computer (with 24 central processing unit cores and 512 GB memory) for annotating neurons and providing feedback to the RF algorithm, to achieve optimal results. Finally, the trained RF classifier was used to generate segmentation maps for all test and unseen datasets. For quantitativel evaluation of ACE robustness across different regions of the brain, we registered the test set to ARA 10 µm, warped ARA labels back to their native space and integrated warped labels with segmentation maps.

Comparison against Cellfinder

We used Cellfinder to generate whole-brain detection maps of neuronal cell bodies. We first used their pretrained Resnet model to generate detection maps on both test and unseen datasets. We deployed the Cellfinder command with –no-registration flag to detect and classify cells to either background or neuron. For retraining Cellfinder, we randomly selected two subjects from the test set and generated ~6,100 (~3,200 cells and ~2,900 non-cells) annotated cell candidates for the first subject and ~9,000 (~4,300 cells and ~4,700 non-cells) for the second subject, using their Napari-cellfinder plugin. The training data were then used to retrain the Resnet model by incorporating the function cellfinder_train and the flag –continue-training, keeping other options as default. Lastly, the best retrained model based on validation error was used to generate whole-brain detection maps. For quantitative comparison against Cellfinder, we transformed ACE segmentation maps into detection maps by finding the center of mass of each neuron in 3D.

Voxelization, registration and heatmap generation

Voxelization

To synthesize and correlate our segmentation results within the ARA space, a voxelization process was used. Voxelization entailed the transformation of high-resolution segmentation outcomes into a 10-μm resolution space while minimizing loss of information. Segmentation volumes underwent convolution with a spherical kernel featuring a radius of ~5 μm—a dimension that aligns with the downsampling factor and ARA space. Subsequently, within each convolved sphere, the average count of labels (cells or nuclei) was computed, resulting in a voxelized map. This 3D voxelized representation, which was generated using Python’s skimage library with parallel computation facilitated by the joblib and multiprocessing libraries, allowed for efficient feature extraction summarized by ARA regions/labels.

Registration

To bring whole-brain tissue-cleared microscopy images, segmentation maps and a reference atlas (ARA) into spatial correspondence, we used our open-source MIRACL¹⁰ platform. MIRACL contains specialized workflows that are optimized for multimodal registration of cleared data, based on tools from ANTs⁸¹ ( Registration workflows include a cascade of image-preprocessing techniques, such as denoising and intensity correction, as well as an intensity-based alignment⁴⁰. The alignment process consists of two main steps. In the first of these, an initial alignment is carried out using the antsAffineInitializer tool from ANTs⁸¹; the second step consists of an intensity-based, multistage, b-spline registration algorithm encompassing a rigid six degrees of freedom, an affine (12 degrees of freedom) and a nonrigid (deformable) b-spline symmetric normalization stage. The rigid and affine stages are based on maximization of the mutual information similarity metric between ARA and microscopy data, while the deformable stage uses cross-correlation as the similarity metric. The resulting transformations perform bidirectional warping of images to and from tissue-cleared microscopy native space and ARA space.

Heatmap generation

Voxelized segmentation maps were warped to the ARA space (10-μm resolution) using a deformation field obtained through registration. Subsequently, a smoothing Gaussian filter was applied with a sigma of four pixels. Difference heatmaps were then computed by subtracting the average of voxelized and warped segmentation maps in each group.

ROI-based analysis

We passed the voxelized and warped segmentation map at a resolution of 10 μm, in addition to the registered labels for each subject, to MIRACL’s function seg feat_extract, to extract the density of cells per ARA region for both the whole brain and labels grouped to a maximum atlas ontology depth of 6. The ARA labels are structured into descending depth, from coarse- to fine-grained groupings of brain regions, with MIRACL’s function combining labels at a higher depth by their parent labels. The resulting density results were then passed to MIRACL’s function group_unpaired_ttest, which applied Student’s paired t-test (two-sided) per label with an alpha value of 0.05 for both whole-brain regions and depth 6 regions, and both applications. This function creates bar plots that compare the density of cells per ARA label, including both significant ROIs and trending regions (P < 0.1). Lastly, MIRACL’s function proj_stats_on_atlas was used to project the resulting P values on the atlas regions.

Cluster-wise, permutation-based statistical algorithm and analysis

We developed a cluster-wise, permutation-based statistical algorithm with TFCE. The statistical pipeline consisted of three main steps, and is based on the implementation of the function spatio_temporal_cluster_test from MNE ( In the first step, a voxel-wise statistical test using two-way ANOVA was performed between two groups of study. To incorporate the correlation structure of the data and correct for multiple comparisons, we considered a null hypothesis regarding the sizes of clusters in our data rather than focusing on individual voxels. Thus, in the second step, clusters were defined using adjacency structure in our data (connecting each voxel to its neighbors in 3D) and the TFCE technique⁴⁴, which addresses the challenge of selecting a threshold for both cluster-forming and smoothing problems. We optimized the adjacency structure of the data using a priori knowledge from group-wise heatmaps to boost the sensitivity of the statistical results in teravoxel LSFM data, due to the exceedingly large number of voxels (and hence the number of statistical tests). Specifically, the adjacency matrix was masked using a thresholded (90% percentile) and then dilated version of the group-difference heatmap, allowing the algorithm to focus only on putative clusters. TFCE transforms a raw statistical image into a map that reflects the strength of local spatial clustering. The new value assigned to each voxel is determined by aggregating the scores of all supporting sections below it. The score of each section is computed by taking the height of the section (raised to a power H) and multiplying it by the extent of the section (raised to a power E):

$${\mathrm{TFCE}}(v)=\mathop{\int }\nolimits^{{h}_{\,v}}_{h={h}_{\,0}}{e(h)}^{E}{h}^{H}{{\mathrm{d}}h},$$

where h₀ is typically around zero and h_v is a statistic value corresponding to voxel v. In practice, this integral is estimated as a sum, using finite step sizes (dh). The exponents of the powers (E and H) are free parameters, but fixing these values has been shown to yield robust outcomes justified by theory and empirical results^44,59. Increasing H gives more weight to clusters with higher effect size, while increasing E gives more weight to larger clusters⁴⁴. In our analysis we chose E = 2, H = 0.5, h₀ = 0 and a step size of 5 to mitigate the potential impact of FP in LSFM data, and fixed the values for both applications, walking and food-seeking datasets. The application of TFCE transforms results in a weighted sum of local cluster-like signals, eliminating the need for a fixed threshold to define clusters while keeping the local minima and maxima at the same spot. The size of each cluster was measured by the sum of voxels’ F-values within the cluster. In the final step, a nonparametric permutation test (N = 1,000) was applied and new cluster-wise F-statistics were obtained. In each permutation, the same procedure was applied to define clusters and compute their statistics, with the largest cluster size being retained. We also used the stepdown P value procedure to boost sensitivity while controlling for family-wise error rate⁸². To test the significance of clusters in our actual data, a null distribution was obtained via permutations. Cluster sizes observed in the actual data are compared with those in the null distribution to calculate P values, which can then be compared with a predetermined threshold (such as alpha <0.05 or false-detection rate-corrected P values) to test for significance. Next, we applied connected component analysis on the resulting P value image to summarize significant clusters. Finally, by integration of connected component analysis results, ARA labels and density heatmaps, we extracted the center, mean effect size and volume of each cluster, along with the percentage volume within each brain region spanned.

Cluster-wise connectivity analysis

The significant clusters identified through ACE’s cluster-wise, permutation-based statistical algorithm were systematically ranked based on their statistical effect size and volume. To streamline the subsequent connectivity analysis, we selected the top 20 clusters from this ordered list. Integrating each cluster’s location with voxelized and warped segmentation maps, we calculated the mean intensity of each cluster per subject in both treated and control groups. Subsequently we employed a Pearson correlation test, utilizing the function pearsonr from the scipy library, between the mean intensity values of each pair of clusters. For each correlation coefficient, we established an 80% confidence interval using the bias-corrected and accelerated bootstrap method with 10,000 iterations, and selected the lower bound. Finally, the P value associated with each correlation test was determined through permutation testing, involving 10,000 permutations. For visualization, significant correlation coefficients (P < 0.01) were plotted alongside a heatmap illustrating Euclidean distance between each pair of cluster centroids.

Native space cluster validation

The significant clusters identified through ACE’s cluster-wise TFCE permutation-based statistical algorithm were binarized using σ = 0.05 (or a user-fed threshold). Next, binarized clusters underwent a dilation (one iteration, using binary_dilation from the scipy.ndimage package) and a connected component analysis (using label function from the scipy.ndimage package) to differentiate each cluster. Processed significant clusters were then warped to the native space of each subject using registration transformations and the function miracl lbls warp_clar. Utilizing ACE segmentation maps, the number of neurons within each cluster was computed by identifying the coordinates of each neuron. A two-sided Mann–Whitney U-test (from the scipy.stats package) was deployed to compare the number of neurons within each cluster across two groups. The numbers of neurons per subject for each cluster—in addition to their volume, atlas space P value and cluster-wise statistics (obtained by ACE cluster-wise TFCE permutation), percentage volume within each overlapping brain region and native space P value (obtained by post hoc Mann–Whitney U-test)—are summarized in a comma-separated values file.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

link

Datasets and experiments

Training dataset

Unseen dataset 1

Unseen dataset 2

Cold dataset

Walking dataset

Fine-tuning dataset

Tissue clearing and immunolabeling

Training dataset

Unseen dataset 1 and walking dataset

Cold dataset

Unseen dataset 2

LSFM imaging

Training dataset

Unseen dataset 1 and walking dataset

Cold dataset

Unseen dataset 2

GT label generation

MIRACL segmentation

Ilastik pixel classification

Postprocessing

Deep neural network architectures

UNETR

U-Net

Loss function

Model training

Implementation

Data augmentation

Voxel-wise uncertainty map

Ensemble of ensembles

Model evaluation

Evaluation metrics

Simulation distribution shift

Comparison against Ilastik

Comparison against Cellfinder

Voxelization, registration and heatmap generation

Voxelization

Registration

Heatmap generation

ROI-based analysis

Cluster-wise, permutation-based statistical algorithm and analysis

Cluster-wise connectivity analysis

Native space cluster validation

Reporting summary

More Stories

You may have missed