Improved food image recognition by leveraging deep learning and data-driven methods with an application to Central Asian Food Scene

Improved food image recognition by leveraging deep learning and data-driven methods with an application to Central Asian Food Scene

One of the main challenges of the CV in the food domain is the variability of food images. Environmental and technical factors like lighting conditions and camera angles, as well as differences in cooking styles and ingredients, make them more difficult than other image tasks. Moreover, fine-grained classification issues arise from intra-class variance, where foods within the same category exhibit diverse appearances due to factors like cooking styles and cultural influences. These challenges highlight the complexity of accurately detecting and classifying food items41,42. Fig. 5 illustrates the intra-class variation samples for the six classes of our CAFSD and their respective mAP50 scores.

Fig. 5
figure 5

Samples of six different classes to illustrate the intra-class variation and their mAP50 scores.

Figure 6a illustrates the mAP50 scores variation across classes with different numbers of instances for both validation and test sets. It can be seen that due to the intra-class variation and complexity of the food domain, some classes might need fewer samples to achieve the same class mAP compared to other classes that have high intra-class variation. Furthermore, the quality of the images and the size of the bounding box can affect the output predictions, which can be observed from Fig. 6b illustrating the variation of the mAP50 score metrics for different bounding box sizes for both validation and test sets. Generally, larger bounding box sizes tend to provide more accurate predictions and higher mAP50 scores, likely due to the presence of finer features and details.

Fig. 6
figure 6

Plots illustrate the detailed evaluation of the model performance based on the mAP50 score and the dataset parameters: (a) average mAP50 score for classes with different numbers of instances in test and validation sets.; (b) average mAP50 score variation with the bounding boxes’ diagonal size for the test and validation sets; (c) average mAP50 score per image for food scenes with different numbers of food items.

Besides the above-mentioned challenges such as intra-class variation, several other factors affect the model performance such as a cluttered background in the images and a large number of bounding boxes per image. Figure 6c shows that in general, the mAP50 score decreases with the number of bounding boxes per image in both validation and test sets.

Overall, the training results on the CAFSD suggest a higher capacity for generalizing, localizing, and predicting 239 food classes, showing a noticeable improvement compared to reported results on publicly available datasets. From the parametric experiments performed using our CAFSD, the best results were obtained using the YOLOv8lx model (i.e., mAP50 of 69.9% and 67.7% for validation and test sets, respectively). As for the other detection datasets, it has been reported that the macro average accuracy (MAA) on UNIMIB2016 is 56% which spans across 65 classes28. Experiments performed on the BTBUFood-60 dataset containing 60 food categories using the VGG16 model resulted in an mAP of 67.7%29. To compare, the parametric experiments on our first CAFD classification dataset with 42 classes showed Top-1 and Top-5 accuracy metrics of 88.70 and 98.59, respectively, using the ResNet152 model30. Historically, the Central Asian diet is known for the high consumption of meat dishes and dairy products due to the nomadic lifestyle43. Figures 7 and 8 show the distribution of different dishes within these categories.

Fig. 7
figure 7

Distribution of instances across meat and processed meat products.

Fig. 8
figure 8

Distribution of instances across dairy products.

Figure 7 illustrates that the dataset’s meat-based classes with the highest number of instances include beef/lamb shashlik (11.9%), chicken shashlik (9.5%), sausages (8.8%), and fried beef/lamb (8.4%). Beef and lamb are grouped due to the difficulty in visually distinguishing these types of red meat. Notably, kazy-karta, a national dish based on horse meat, represents 6.9% of these instances. This distribution mirrors national statistics from the Bureau of National Statistics, indicating that beef is the most consumed meat in Kazakhstan, with an average of 24.68 kg per capita in 202344. The per capita consumption of horse meat is 6.74 kg. Lamb dishes and chicken are also popular, with per capita consumption at 5.29 kg and 5.04 kg, respectively. Additionally, sausages are consumed at 2.12 kg per capita, while minced meat products have a per capita consumption of 7.07 kg.

High meat consumption in Central Asia is influenced by cultural preferences and local availability. The average meat consumption in Central Asia ranges from 50 to 70 kg per capita annually, with an average daily intake of 124.76 g/day, among the highest globally. According to the World Population Review, Kazakhstan has the highest per capita lamb consumption worldwide, averaging 8.5 kg per person annually. Recent studies suggest that shifting from a diet high in animal-based proteins to one higher in plant-based proteins may help to reduce risk factors associated with cardiovascular diseases and overall mortality43.

As for the dairy products, Fig. 8 depicts the distribution of image instances in the dataset by dairy product type. The most frequently represented dairy products are smetana (18.4%), kurt (15.7%), and kymyz-kymyryan (15.6%), followed by cheese (14.4%), irimshik (9.2%), butter (8.1%), suzbe (6.7%), airan-katyk (5.6%), milk (4.9%), and condensed milk (1.5%). According to the Bureau of National Statistics, per capita dairy product consumption in Kazakhstan was 227.2 kg in 202344. This includes 14.889 liters of raw milk, 0.230 liters of concentrated milk without sugar, 13.166 liters of airan, 3.818 liters of smetana, 2.198 kg of irimshik, 0.581 kg of processed cheese, 4.224 kg of suzbe, and 1.084 kg of butter etc44.

Kazakhstan and Central Asia have a rich tradition of dairy consumption, with beverages like kumys and airan being particularly popular, especially in rural areas.Kumys, made from fermented mare’s milk, is known for its distinct taste and slight alcohol content, which results from lactic acid and alcoholic fermentation. It may also offer probiotic benefits. It is a nutritious, protein- and vitamin-rich drink with probiotic benefits45,46. Similarly, airan, a fermented milk drink made from mixed goat, cow, and sheep milk, is a daily staple known for its probiotics and easily digestible fatty acids and amino acids, which promote digestive health45,46.

Dairy products such as kurt, smetana, irimshik, and suzbe are also common in local cuisine. Kurt, a hard cheese made from dried sour milk, is a concentrated source of protein and calcium, is often consumed as a snack or with tea. Smetana, like sour cream, is widely used in various dishes, contributing fats and vitamins A and D. Irimshik, or dried curd, has a sweet, baked milk flavor and extended storage capacity, providing protein and minerals. Suzbe, a type of curd made by straining sour milk, is seasonally prepared and used in soups or consumed with milk or water, offering fats and proteins46.

These traditional dairy products are integral to the local diets, providing cultural significance and essential nutrients that support health and well-being. Their consumption reflects their importance in Central Asian culinary practices and their nutritional benefits45,46.

To conclude, CAFSD makes a valuable contribution to computer vision-assisted food and dietary tracking applications, which can also be used in various settings like smart restaurants and supermarkets. It could play a significant role in advancing local CV food-tracking applications, with the potential to enhance nutrition literacy, increase dietary awareness, and promote healthier food choices. These advancements have the potential to impact overall agriculture, the environment, and the food system in the region.

The performance outcomes of the trained food object detection models on our dataset demonstrate the effectiveness of the YOLOv8xl model as compared to smaller size models of the YOLOv8 family in accurately retrieving food-related information. As the next steps, we plan to integrate the model into a smartphone application and develop a dataset with macro- nutritional values and corresponding prediction models for more detailed guidance. A follow-up project will extend the dataset composition and the number of classes integrating regional food composition databases that encompass local foods, dishes, and their nutritional profiles. This step ensures our research is aligned with precise, region-specific data to support evidence-based decision-making for public health policies tailored to the target population. Furthermore, a comprehensive codebook to systematically link food images to their nutritional labels will be developed to visual data based on food type, portion size, and preparation method, with each image cross-referenced to nutritional data, including macronutrient and micronutrient profiles. This integration bridges the gap between visual representations of food and their corresponding nutritional values, making the dataset a valuable resource for dietary analysis and research, particularly in nutrition-related interventions.

link