The MacqD deep-learning-based model for automatic detection of socially housed laboratory macaques

The MacqD deep-learning-based model for automatic detection of socially housed laboratory macaques

Data collection

A collection of video recordings, subsequently referred as Macaque data, was acquired at the macaque research facility of Newcastle University, UK, between 2014 and 2020. The facility complies with the NC3Rs Guidelines for ‘Primate Accommodation, care and use’52, and comprises cages 2.1 m wide, 3 m deep, and 2.4 m high, exceeding the minimal requirement of the UK legislation, and where animals are housed in pairs. Besides the presence of a social partner, the cages are enriched by a multitude of structural elements and objects (e.g. shelves, swings, ropes) to promote the well-being of the animals. Data recording was approved by Newcastle University Animal Welfare and Ethical Review Body (project number: ID 928).

Video recordings were collected from 20 macaques which were selected as the primary subjects of observation (focal), using a remotely-monitored, wall-mounted digital cameras (Cube HD Y-cam, 1080p and Axis M1065-L, 1080p), fixed outside and positioned directly opposite each focal cage. While the cameras remained stationary for most of the study, they could be manually repositioned or zoomed in/out when necessary to improve visibility. Data were stored in .mp4 or .mov formats, with a spatial resolution of 1280 × 720 pixels, and sampled at 15 frames per second. The number of macaques visible on videos was variable. While by default the two cagemates were present, one animal was sometimes temporarily absent (e.g. when it was in the experimental laboratory). Due to the positioning of some cages back-to back, some videos also included non-focal animals in neighbouring cages. Examples of video frames from Macaque data are shown in Fig. 1.

To further evaluate the model’s generalisation capabilities, a video from the Institut des Sciences Cognitives Marc Jeannerod, referred to as the ISC dataset, was also used. The ISC dataset has a frame rate of 24 frames per second and a spatial resolution of 2880 × 2160 pixels.

Fig. 1
figure 1

Example video frames used in this study. (a) Single macaque, partially hidden, with light overexposure; (b) Single macaque with cage railing occlusion; and (c) Pair of macaques in the focal cage, with partial overlap of the two individuals, partial occlusion from cage enrichment and one macaque from a neighbouring cage appearing in the background.

Data description

Macaque data were divided in several training and testing datasets (see Table 1) to be used in Experiments 1 and 2 . Experiments 1 and 2 only differ by the number of macaques in the focal cage, respectively one and two. For the training datasets, video recordings from 10 individuals (either alone or in pairs) were selected. The same individuals were used for both experiments. Individual video frames were pseudo-randomly selected from recordings spanning various dates and times of day, ensuring that the datasets encompassed a wide range of cage settings, macaque postures, and positions within the cage. This approach aimed to ensure the training datasets were representative of the macaques housed at the Newcastle facility.

For the testing datasets, 5-min videos from each single individual (Experiment 1) and each pair (Experiment 2) were used. In both experiments, two different testing datasets were employed: (1) the ‘Same’ dataset, comprising new video recordings of the same 10 individuals used in training, and (2) the ‘Different’ dataset, consisting of recordings from 10 new individuals. The ‘Different’ dataset was included to assess the generalisability of the models to macaques not encountered during training. While the training datasets were composed of isolated frames, we used consecutive frames for testing, in order to mirror real-world scenarios where behaviour recognition applications require dynamic information across frames. Because consecutive frames are often highly similar in terms of macaque cage position and body posture, the total number of frames was increased by a factor of 20 to enhance variability and representation. As a result, each video in the testing datasets consisted of 45,000 frames for videos featuring single macaques and 22,500 frames for paired macaques (see Table 1). Further variability was ensured by using videos from a relatively high number of individuals (n = 10).

Table 1 Description of Macaque data sub-setting into different training and testing datasets. For paired animals (Experiment 2), the same videos were used to assess the detection of each pair member (see “Results” section for details).

Both the training and testing datasets included instances of occlusion, reflections, rapid motion, and overexposure. They also incorporated footage of animals displaying a comprehensive list of natural macaque behaviours (e.g. slow and fast locomotion, body shaking, foraging , interacting with objects, allogrooming and self-scratching). These diverse challenges were incorporated to enhance the comprehensiveness of the study and ensure that the analysis was conducted under realistic and varied conditions.

Additionally, a 17-second video from the Institut des Sciences Cognitives Marc Jeannerod (ISC dataset) containing 420 frames was used to further test the model. This video presented additional challenges, including occlusion, the presence of objects such as toys used for stimuli, and human reflections on the glass.

Data annotation

Training and testing datasets were annotated using the VIA image annotator53. In the training dataset, each video frame was annotated with pixel-level masks, a technique known as segmentation, where each pixel is assigned to an individual macaque. For the testing datasets, including the ISC dataset, annotations were made with bounding boxes, with rectangles drawn around each macaque to include all body parts while minimising the box area. This approach was selected to facilitate the comparison of different algorithms, some of which only output bounding boxes (see section “Performance metrics”). Annotations were performed by ten different research assistants, with each annotation verified by at least one other assistant and subsequently validated by the first authors. In the training datasets, macaques visible in neighbouring cages were also annotated to maximise learning, whereas for the testing datasets, the detections and ground truths of macaques in the background were excluded to focus on evaluating how well models detected animals in the focal cage. This approach does not count correct detections of neighbouring macaques toward model performance and does not penalise the model for failing to detect them.

Macaque detection algorithms

In this study, macaque detection was assessed using three different algorithms. The first two implemented a deep learning approach using Mask R-CNN as the framework, training a neural network through supervised learning (where the network learns from labelled examples). The third algorithm was based on background elimination, an approach that does not require any training.

Mask R-CNN

Mask R-CNN54 is a state-of-the-art, two-stage framework widely used for segmenting individual objects in images (instance segmentation). It not only detects objects but also outlines their precise location with pixel-level masks. In the first stage, the image passes through a feature extractor (a series of convolutional layers) that generates feature maps representing key characteristics of the image. These maps are enhanced by a Feature Pyramid Network (FPN), which improves detection at different scales by combining detailed high-resolution features with more abstract low-resolution ones. This helps the model detect objects of various sizes in complex scenes. The refined maps are then processed by the Region Proposal Network (RPN), which suggests areas likely to contain objects (Regions of Interest, or ROIs). In the second stage, features from each ROI are extracted using RoIAlign, a method that ensures precise alignment with the original image. This alignment is critical for refining bounding boxes, classifying objects, and generating accurate segmentation masks, especially for small or detailed objects (Fig. 2).

In this study, a modified Mask R-CNN model, referred to as MacqD, was created for detecting macaques in video frames, incorporating SWIN55, a state-of-the-art transformer-based feature extractor. Throughout this paper, the term ‘MacqD’ refers specifically to this modified model. The images processed in MacqD were resized to a maximum of 1333 × 800 pixels (smaller images retained their original size), maintaining the aspect ratio. Images were padded to meet network requirements and normalised using standard ImageNet values. During training, random horizontal flipping was applied to improve the model’s generalisability by increasing dataset variability while preserving biological validity, while no changes were applied during testing to ensure consistent evaluation. As a benchmark, MacqD was compared with SegNet, another Mask R-CNN variant implemented in SIPEC51, which uses ResNet10156, a widely used convolutional neural network, as its feature extractor. SegNet was trained on frames resized to 1280 × 1280 pixels, maintaining the aspect ratio. During training, random rectangular regions were hidden (converted to black pixels) to help detect partially occluded objects, while no data augmentation was applied during testing.

Fig. 2
figure 2

Overview of the Mask R-CNN framework for macaque detection.

Background elimination (BE)

Unlike deep-learning techniques which require extensive training and computational resources, the background elimination method is more efficient, as it does not require any training. In this research, an optimised Background Elimination (BE) pipeline was designed specifically for extracting macaques from video footage.

Our pipeline (Fig. 3) processes a video by extracting every 10th frame to build a background image, which is first resized to 1280 × 720 pixels to ensure consistency. The extracted frames are grouped into sets of 120, slightly blurred to smooth out minor variations, and the 70th percentile of each pixel’s RGB (red, green, blue) values is calculated to create sub-background images. These sub-backgrounds are then combined by taking the median RGB values to generate the final background image. Each video frame is compared to this background using the MOG257 algorithm from the OpenCV library58, which detects macaques by separating them from the static background. Even after the background is removed, small amounts of noise or gaps within the macaque’s outline may remain. To correct these imperfections, morphological operations are applied to group nearby pixels into clusters and fill gaps. A convex hull is then calculated for each cluster, forming a polygon that encloses the object by connecting its outermost points. The overlapping convex hulls are merged to smooth the outline, and small irrelevant clusters are filtered out, isolating the primary macaque. Finally, a bounding box is placed around the refined cluster to localise the detected macaque within the frame.

Fig. 3
figure 3

Overview of the background elimination pipeline.

Experimental design: Experiments 1 & 2

Figure 4 illustrates the models used in the two first experiments: Experiment 1 tested models on single macaque recognition, and Experiment 2 on paired macaques. In Experiment 1, the MacqD model was trained on a dataset where a single animal was present in the focal cage (Macaque Single dataset) and compared with background elimination (BE) and three SegNet variants: SegNet – Primate, from the original paper51, trained on a primate dataset with macaque images from the authors’ research facility; SegNet – Macaque Single, trained exclusively on our Macaque Single dataset; and SegNet – Primate + Macaque Single, which used SegNet – Primate as a starting point and was further trained with our Macaque Single dataset.

In Experiment 2, the MacqD model trained on the Macaque Single dataset was compared to two other MacqD derivatives: MacqD – Macaque Curriculum, which used MacqD – Macaque Single as a starting point and was further trained with the dataset featuring paired macaques in the focal cage (Macaque Paired dataset); and MacqD – Macaque Combine, which was trained on a merged training dataset combining Macaque Single and Macaque Paired datasets. MacqD – Macaque Curriculum was used to assess curriculum learning59, a strategy where models are trained by gradually increasing task complexity, mimicking human learning by starting with simpler concepts and progressing to more difficult ones. In contrast, MacqD – Macaque Combine was trained on a combined dataset, exposing the model to diverse scenarios all at once while saving time. MacqD models were also compared to BE but not SegNet models, due to poor results obtained with these models in Experiment 1 (see section “Experiment 1: Detection of single macaques”). All MacqD models and the SegNet – Macaque Single model were trained for 100 epochs, with the final model selected based on the epoch with the minimum validation loss. All final models were tested on the Macaque ‘different’ dataset and, where applicable, the Macaque ‘same’ datasets (Fig. 4).

Fig. 4
figure 4

Overview of experiments 1 and 2 illustrating how the models compared in this study differed in terms of training and testing datasets. ‘Same’ and ‘Different’ correspond to the datasets described in Table 1, with the labels referring to the fact that the model was tested with videos of individual macaques ‘seen’ during the training phase (‘same’) or not (‘different’) (see section “Data description” for more details).

Tracking algorithm (Experiment 3)

In computer vision, tracking algorithms monitor object movement across consecutive video frames by estimating the target object’s positions in subsequent frames, given its initialised position60,61. In Experiment 3, results from the detection models were compared before and after implementing a tracking algorithm to test performance improvement. The tracking algorithm aims to maintain detection continuity across frames by estimating the location and motion of macaques, thereby reducing instances where the macaque is not detected in subsequent frames.

The centroid (geometric centre) of each detected macaque, whether from a mask produced by MacqD and SegNet or a cluster from BE, was used as input for the Kalman filter62, a mathematical algorithm that predicts an object’s position based on past movements. If a macaque was not detected in a given frame, the Kalman filter estimated its position using the centroid from previous frames while retaining the same bounding box size from the last known detection. This prediction process continued until a match was found between the predicted and detected positions or until 20 frames had passed without a match. To associate predicted positions with current detections, the Hungarian algorithm63 was used. This algorithm matches the predicted position of an object to the closest detected object in the current frame, identifying which detection belongs to which macaque (see Fig. 5).

Fig. 5
figure 5

Tracking algorithm pipeline.

Performance metrics

Evaluation of the different models was based on bounding boxes in order to standardise comparisons across all models (MacqD and SegNet provide bounding boxes and pixel-based masks but BE only outputs bounding boxes). The Intersection over Union (IoU) metric was used to measure the overlap between predicted and ground-truth boxes, with an IoU of 0.50 or higher considered a true positive (TP) and an IoU below this threshold classified as a false positive (FP).

$$\begin{aligned} IoU = \frac{\text {Ground truth} \cap \text {Detected box}}{\text {Ground truth} \cup \text {Detected box}} = \frac{\text {Area of Overlap}}{\text {Area of Union}} \end{aligned}$$

(1)

Performance was assessed using precision, recall, and the F1 score applied at the individual level, with precision measuring how often the model was correct when it identified a portion of the frame as containing an individual macaque, recall measuring the model’s ability to not miss a macaque when one was present in the frame. The F1 score is the harmonic mean of precision and recall. It ensures that F1 is high only when both precision and recall are high (for instance reaching 1 only if both are 1), and low when both are low (dropping to 0 if either is 0). Macaques in the non-focal cage were ignored.

$$\begin{aligned} Precision = \frac{TP}{TP+FP} \end{aligned}$$

(2)

$$\begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned}$$

(3)

$$\begin{aligned} F1 score = \frac{2*Precision*Recall}{Precision+Recall} \end{aligned}$$

(4)

MacqD and SegNet models provide a confidence score to each bounding box, indicating the likelihood of correct identification. In Experiment 1, the optimal threshold for filtering out false positives was determined by maximising precision and recall performance metrics based on the validation dataset. Confidence scores were assessed from 0.50 to 0.95 in 0.05 increments, and the median precision versus recall was plotted to identify the optimal threshold (see Supplementary material Fig. S1).

Statistical test

Precision, recall, and F1 score were used for statistical tests to assess the differences in performance. Initial attempts to fit linear mixed-effects models indicated non-normally distributed residuals, violating parametric assumptions. Consequently, non-parametric approaches previously applied for evaluating machine learning methods were utilised64. The Friedman test compared the performance of multiple models across datasets, while the Wilcoxon (signed-rank) test assessed the performance of pairs of models. The Benjamini–Hochberg procedure65 was employed to control the false discovery rate for multiple comparisons. Additionally, Wilcoxon tests were conducted to compare results before and after implementing the tracking algorithm.

Computing environment

MacqD models were implemented from the open-source framework MMDetection version 2.2566, and SegNet models from the open-source pipeline67. We used Python (3.7.13), CUDA toolkit (10.1) and GPU-accelerated library (CUDNN 7.6.3) with NVIDIA GeForce GTX 1080 Ti GPU. Python (3.6.13), CUDA (10.2.89) and CUDNN (7.6.5) were used to develop BE.

link