Anatomy of climate change research in Italian doctoral dissertations using a machine learning approach

Anatomy of climate change research in Italian doctoral dissertations using a machine learning approach

In Table 1, we compare the performance of several supervised learning models for classifying climate change-related doctoral dissertations in both English and Italian. The models include LinearSVC, Logistic Regression, GBM, Random Forest, Naïve Bayes, and MLP, each evaluated with and without the ROSE balancing technique. Accuracy metrics for each model are presented in the table. Given the variability in linguistic structure and content richness between English and Italian dissertations, we aim to assess how different models generalize across languages and whether oversampling influences classification accuracy.

Table 1 Classification accuracy of machine learning models for English and Italian dissertations on climate change.

LinearSVC (both weighted and with ROSE) consistently outperforms all other models, achieving the highest accuracy for Italian dissertations (0.97) and English dissertations (0.95). Logistic Regression performs similarly, with accuracy scores lower by just 0.01 (rounded up to 0.97), highlighting its robustness as a baseline model. This result indicates that LinearSVC effectively captures key discriminative features across both languages, making it the most reliable model for climate change-related dissertation classification. For English dissertations, Random Forest emerges as the best-performing model, achieving an accuracy of 0.96. This indicates that Random Forest is particularly effective for English text classification, although its performance on Italian dissertations is lower (0.89).

As a result, we identified a total of 1,178 PhD dissertations written in English addressing climate change (11.1% of English-language dissertations), compared to 318 in Italian (18.1% of Italian-language dissertations). We ran topic modelling separately for the two languages, and sentiment analysis only for English-language dissertations, due to the availability of pretrained sentiment dictionaries. Using this optimized Random Forest model and LinearSVC for English and Italian dissertations respectively, we then applied the classification out-of-sample to the entire corpus of Italian doctoral dissertations, systematically mapping climate change research across disciplines and geographical areas.

Knowing now which doctoral dissertations are climate change-related, thanks to our classification models, we explored how disciplinary patterns in climate change-related doctoral dissertations have evolved over time (Fig. 1). We analyzed the relative distribution of ERC (European Research Council) domains—Social Sciences and Humanities (SSH), Physical Sciences and Engineering (PE), and Life Sciences (LS)—from 2008 to 2021. Figure 1 presents the annual percentage shares of the three most frequent ERC domains, separately for dissertations written in Italian and English, along with an aggregated profile.

Fig. 1
figure 1

Annual distribution of climate change-related dissertations by ERC domain and language (2008–2021).

The analysis reveals a marked prevalence of SSH in Italian-language dissertations, where Social Sciences and Humanities consistently account for a larger share of climate change-related research compared to PE and LS. In contrast, English-language dissertations show a clear predominance of Physical Sciences and Engineering (PE), followed by Life Sciences (LS), while SSH is markedly underrepresented. These findings suggest that language is not merely a medium of communication, but also a structural dimension of how climate change research is framed in Italian doctoral education. Italian-language dissertations appear to engage more with climate change through social, political, and cultural lenses, possibly reflecting nationally oriented research agendas, while English-language dissertations are more rooted in technical and scientific approaches, likely oriented toward international academic audiences and journals.

Italian Language dissertations

Figure 2 presents the frequency distribution of the most common unigrams, bigrams, and trigrams identified in the corpus of PhD dissertations related to climate change. The layout of the figure is structured to allow a hierarchical interpretation of lexical patterns, with single-word terms (1-grams) positioned at the top left, two-word combinations (2-grams) at the top right, and three-word sequences (3-grams) centred below. These keywords provide a useful representation of linguistic structures in the text, as they highlight recurrent patterns in the corpus.

Fig. 2
figure 2

Top N-grams Frequency Distribution in Italian dissertations.

The most frequently occurring unigrams include ambientale (environmental), urbano (urban), processo (process), and energetico (energy-related). This highlights a strong emphasis on environmental and urban sustainability, as well as on processes related to energy systems and economic aspects (economico). The presence of words such as strumento (tool), valutazione (evaluation), and politico (political) suggests a methodological and governance-related component in climate change research. The bigrams provide a more contextualized understanding of key research themes. The most frequent phrase, cambiamento climatico (climate change), confirms the centrality of the topic. Other high-frequency bigrams include policy-related terms such as processo decisionale (decision-making process) and sostenibilità ambientale (environmental sustainability), along with technical aspects like efficienza energetica (energy efficiency) and consumo suolo (land consumption). The presence of pinus nigra and spazio urbano (urban space) suggests a research focus on ecosystem and urban studies. The trigram distribution further refines these themes, showing more specialized phrases such as emissioni gas serra (greenhouse gas emissions) and adattamento cambiamento climatico (climate change adaptation), indicating a research focus on both mitigation and adaptation strategies. Additionally, the keyword sistema edificio impianto (building system) suggests applications in architecture and urban planning.

The lexical patterns revealed in this analysis show the interdisciplinary nature of climate change research, spanning from energy transition and environmental impact assessment to urban studies and governance. The presence of both mitigation-related (e.g., efficienza energetica, sostenibilità ambientale) and adaptation-focused (e.g., adattamento cambiamento climatico) terms highlights the dual approach that characterizes climate-related research. Moreover, the strong occurrence of decision-making and evaluation-related terminology suggests an interest in evidence-based policy frameworks.

Figure 3 shows the yearly trends in log-scaled relative frequencies of selected climate-related keywords in Italian PhD abstracts (2008–2022). Values are normalized by the total keyword frequency in each year: Ambientale (Environmental), Urbano (Urban), Cambiamento Climatico (Climate Change) and Emissione Gas Serra (Greenhouse Gas Emission).

Fig. 3
figure 3

Temporal Trends of Key N-grams in Italian dissertations.

The term Ambientale (Environmental) has remained consistently present across the years, with minor fluctuations. This suggests that environmental concerns have been a stable component of Italian climate-related PhD research, with no drastic peaks or declines in attention. The term Urbano (Urban) exhibits a fluctuating but consistently high presence, indicating a sustained focus on urban sustainability, planning, and climate resilience. The curve for Cambiamento Climatico (Climate Change) shows a clear upward trend, with notable peaks around 2012, 2015, 2018 and 2021. This suggests an increasing recognition of climate change as a major research topic, potentially aligned with key international climate agreements (e.g., the Paris Agreement in 2015). The trajectory of Emissione Gas Serra (Greenhouse Gas Emission) reveals alternating periods of high and low emphasis, with particularly low values observed in 2012, 2016 and 2019.

Topic modelling

Topic analysis was conducted using two complementary approaches: a model based on Bidirectional Encoder Representations from Transformers (BERT) and K-means clustering applied to document embeddings. Specifically, we used ClimateBERT, a domain-specific language model fine-tuned on climate-related texts28, to extract contextual semantic representations of PhD dissertation abstracts. This approach enables the automatic identification and structuring of emerging climate-related topics.

In parallel, we applied the K-means clustering algorithm to the document embeddings to group abstracts based on semantic similarity, providing a data-driven categorization of research themes. The visualization in Fig. 4 represents the distribution of topics identified by BERT, where each point corresponds to a document and the clustering was determined by semantic similarity. The algorithm automatically assigns optimal clusters by detecting latent structures in the data, without the need for predefined categories. This approach ensures a data-driven categorization of climate change doctoral dissertations in Italian.

Fig. 4
figure 4

Topic Clustering using BERT Embeddings.

The visualization reveals distinct thematic areas emerging from the data. The cluster labels were initially in Italian and then translated into English for visualization purposes. In the upper left region, a cluster in pink represents research focused on climate change and biodiversity, covering topics such as environmental sustainability, biodiversity conservation and the broader impacts of climate change. On the right side of the graph, the blue cluster highlights studies related to energy, buildings and thermal systems, emphasizing advancements in energy efficiency, innovative building technologies and the optimization of thermal infrastructures. At the centre of the graph, a light blue cluster brings together research on food sustainability and ecological design, reflecting discussions on sustainable food production, green infrastructure and the integration of ecological principles in design practices. Meanwhile, in the bottom left section, a red cluster captures work on urban planning, landscape and urban development, focusing on spatial planning strategies, landscape architecture and approaches to enhancing urban sustainability. The spatial arrangement of these clusters suggests meaningful relationships among the topics, with some areas positioned closer together, indicating thematic overlaps and interdisciplinary connections. In particular, on the y-axis, we can observe the distance between more generic dimensions of climate change (Climate Change and Biodiversity) compared to applied themes of research related to urban landscapes (Urban Planning, Landscape, and City Design). Between these and the topic related to buildings and infrastructures (Energy, Buildings, and Thermal Systems), a cluster of documents is positioned at the intersection, focusing on the sustainable design of infrastructures (Sustainable Food and Ecological Design).

The next Table 2 presents the results of K-means clustering applied to Italian dissertations. To select the number of clusters, we used a perplexity-inspired heuristic adapted to the elbow method, which helped identify six as the optimal number of topics. This ensures a well-balanced thematic categorization without excessive fragmentation or overlap (Fig. 15). Each topic, as summarized in Table 2, is characterized by a set of representative keywords that highlight its thematic focus. The representative keywords are identified by analysing the most dominant terms within each cluster and further refined through feature importance analysis (using Random Forest) to ensure they are both frequent and thematically discriminative. The labels were assigned by an expert, taking into account the most important keywords identified in the text.

Fig. 5
figure 5

Distribution of abstracts across Topics identified by K-Means Clustering.

Table 2 Topic classification using K-Means Clustering.

One of the main themes that emerged is related to the environment, biodiversity and land use (Topic 0). This category includes discussions on ecosystem management and the impact of human activities on natural resources, as suggested by keywords such as cambiamento climatico, biodiversità, uso suolo and vegetazione. Another topic (Topic 1) revolves around sustainability, environmental management and the green economy, which captures studies on sustainable practices, policy frameworks and eco-friendly decision-making processes. A separate cluster (Topic 2) focuses on industrial thermal processes and energy optimization, where research explores energy efficiency and thermal process improvements in industrial contexts. Keywords such as efficienza energetica, consumo energetico, scudo termico, and flusso termico suggest a strong focus on heat transfer, energy conservation and carbon footprint reduction. Similarly, another topic (Topic 3) is centred on environmental assessment, water and soil quality and chemical analysis, highlighting research related to pollution monitoring, chemical analysis and risk evaluation in environmental systems. Urban and rural development, spatial planning, and governance represents another key area (Topic 4), covering research on urbanization, spatial organization and governance mechanisms. The discussion in this cluster addresses urban sustainability, rural development, and policy interventions, with keywords including urbano, paesaggio, città, sviluppo rurale, and spazio pubblico. Lastly, a cluster (Topic 5) dedicated to sustainable building design and energy performance captures studies on green construction technologies and energy-efficient building practices. This topic encompasses discussions on sustainable architecture, energy planning and the use of innovative materials.

The comparison between BERT-based topic modelling and K-Means clustering reveals that BERT tends to generate broader, semantically rich clusters, whereas K-Means is more effective in identifying specialized subcategories, providing a finer-grained segmentation of the dataset. This difference might be partially influenced by the number of clusters: in our case, BERTopic produced four clusters, while K-Means identified six. One of the most evident similarities is the identification of topics related to climate change, biodiversity and land use, which appear consistently in both models. The BERT-based approach groups these elements under a general category, highlighting the relationship between climate change and biodiversity conservation, while K-Means isolates land use aspects more explicitly, focusing on specific environmental and ecological dynamics. A similar alignment is observed in the classification of urban and rural development, spatial planning, and governance, where both methods recognize the importance of urban sustainability, governance structures and territorial policies. However, divergences emerge in how each method structures topics related to energy, sustainability and environmental management. While K-Means clustering differentiates between industrial thermal processes and energy optimization as a distinct category, BERT integrates energy topics into a broader cluster that includes buildings, thermal systems and sustainable construction. This indicates that BERT tends to form clusters that encompass multiple related concepts under a larger thematic umbrella, whereas K-Means is more inclined to separate topics into finer-grained subdomains. Another key difference is in the treatment of sustainability and environmental management. K-Means groups these aspects under a general sustainability and green economy category, emphasizing decision-making processes, certifications and impact assessments. In contrast, BERT identifies a more specific cluster focusing on sustainable food and ecological design, demonstrating its ability to recognize conceptual themes that might be embedded within broader sustainability discussions. Further discrepancies arise in the categorization of environmental assessment, water and soil quality and chemical analysis. K-Means clustering identifies this as a distinct topic, highlighting research related to pollution monitoring, water resource evaluation and chemical risk assessment. In contrast, BERT does not generate a separate cluster for these studies but instead incorporates environmental considerations into broader sustainability and urban planning discussions. This suggests that the approach of BERT prioritizes contextual relationships between research topics, while K-Means is more attuned to specific thematic divisions based on textual similarity.

The bar chart (Fig. 5) shows the distribution of abstracts across the K-Means algorithm.

Fig. 6
figure 6

Topic Distribution across Scientific Sectors (SSD) based on K-Means Clustering.

Topic 4 (Urban and Rural Development, Spatial Planning, and Governance) has the highest number of abstracts, indicating strong research interest in urban sustainability and governance. Conversely, Topics 0 and 5 have the fewest abstracts, suggesting that studies on environment, biodiversity and land use, as well as sustainable building design and energy performance are less represented in the dataset.

The next Fig. 6 illustrates the percentage distribution of different topics across Scientific Sisciplinary Sectors (SSD).

Fig. 7
figure 7

Topic Distribution across Geographical Areas based on K-Means Clustering.

Topic 0 (Environment, Biodiversity and Land Use) is dominant in biological sciences (BIO/05, BIO/07), architectural and urban design (ICAR/14) and law (IUS/13), reflecting the focus of these fields on biodiversity conservation and land-use policies. Topic 1 (Sustainability, Environmental Management and Green Economy) appears strongly represented in economics and policy-related disciplines (SECS-P-07). Topic 2 (Industrial Thermal Processes and Energy Optimization) is more prevalent in chemistry (CHIM/02). Topic 3 (Environmental Assessment, Water and Soil Quality, and Chemical Analysis) is well represented in chemistry (CHIM/06), biology (BIO/07) and geosciences (GEO/05). Topic 4 (Urban and Rural Development, Spatial Planning, and Governance) is mostly found in architecture and planning disciplines (ICAR/12, ICAR/21). Finally, Topic 5 (Sustainable Building Design and Energy Performance) is strongly linked to engineering and architecture (ING-IND/11, ICAR/14), particularly in sectors related to infrastructure development, energy-efficient building design and sustainable construction practices.

The following Fig. 7 show the distribution of topics in Italian language across different geographical areas of University affiliations in Italy.

Fig. 8
figure 8

Top N-grams Frequency Distribution in English dissertations.

Topic 0 (Environment, Biodiversity, and Land Use) is prominently represented in all regions except the South, where it is nearly absent. This suggests that research on biodiversity conservation and land-use policies is more concentrated in the Centre, Northeast and Northwest. Topic 2 (Industrial Thermal Processes and Energy Optimization) appears in the Northwest, reflecting a strong focus on energy efficiency and industrial applications in this macro region. Similarly, Topic 3 (Environmental Assessment, Water and Soil Quality, and Chemical Analysis) is well represented in the Center and Northwest, highlighting a focus on environmental monitoring and resource management in these areas. A significant part of dissertations in the South and Northwest is dedicated to Topic 4 (Urban and Rural Development, Spatial Planning and Governance), suggesting that studies on urban sustainability, governance and spatial planning are more prevalent in these regions. Topic 5 (Sustainable Building Design and Energy Performance) is mainly addressed in the Centre and Northeast of Italy.

English language dissertations

Figure 8 presents the frequency distribution of the most common unigrams, bigrams, and trigrams extracted from the keywords associated with PhD dissertations written in English on climate change.

Fig. 9
figure 9

Temporal Trends of Key N-grams in English dissertations.

The most frequently occurring unigrams include energy, process, change, water and plant. This suggests a strong focus on energy systems, environmental changes and resource management. Terms like production, effect and species indicate an interest in ecological dynamics and industrial processes, while development and application highlight technological and applied aspects of climate change research. The bigrams provide a deeper contextual understanding. The most frequent phrase, climate change, confirms the core focus of the research. Other notable bigrams include policy-related expressions such as environmental impact, long term and point view, alongside technical and industrial aspects like renewable energy, fuel cell and energy consumption. The presence of study area and supply chain suggests an interest in regional analysis and industrial sustainability. For the trigram distribution, key expressions such as greenhouse gas emission and impact climate change indicate a focus on climate mitigation strategies, while renewable energy source and energy efficiency measure emphasize technological solutions for sustainability. The presence of wastewater treatment plant and municipal solid waste reflects research on urban sustainability and waste management. Additionally, decision make process and climate change adaptation suggest an interest in governance, decision-making, and adaptation strategies.

Figure 9 illustrates the temporal evolution of the relative frequency of four key terms in English-language PhD dissertations from 2008 to 2020: Energy, Water, Climate Change, and Greenhouse Gas Emission. The y-axis, on a logarithmic scale, indicates the frequency of each term normalized by the total keyword occurrences in the respective year.

Fig. 10
figure 10

Topic Clustering using BERT Embeddings.

The term Energy shows relatively minor fluctuations over time, suggesting a continuous and sustained focus on energy-related research. This aligns with the long-standing importance of energy systems in addressing climate challenges. The trajectory of Water exhibits a sharp increase between 2008 and 2010, followed by a period of relative stabilization. This initial growth may reflect increasing awareness of climate-induced hydrological changes, water resource management and drought resilience. The Climate Change curve shows a trend with cyclical fluctuations. Notable peaks around 2008, 2014, 2016 and 2019 may correspond to major international climate summits.

The trend for Greenhouse Gas Emission exhibits periodic peaks and declines, with a peak in 2008 and 2016.

The next Fig. 10 presents the topic clustering analysis based on BERT embeddings, illustrating the semantic relationships among different themes in climate change.

Fig. 11
figure 11

Distribution of abstracts across Topics identified by K-Means Clustering.

Figure 10 shows the results of BERTopic applied to climate-related research. In the top-right region, the corporate innovation and sustainability strategies cluster (cyan) emerges, reflecting research on business approaches to sustainability. Slightly below, the smart grids and renewable energy systems cluster (yellow) groups studies on energy transition and sustainable power generation. On the leftmost side, the climate risk assessment and disaster management topic (blue) is positioned, representing research on assessing and mitigating climate-related hazards.

Further down on the left, the air pollution and atmospheric monitoring cluster (green) appears, covering studies related to air quality assessment and environmental impact monitoring. Close to this, the advanced materials for clean energy and decarbonization cluster (orange) is present, focusing on innovative materials designed to enhance energy efficiency and reduce carbon emissions.

In the central-lower section, the biodiversity conservation and climate change impacts cluster (purple) is identified, gathering research on ecosystem resilience. In the bottom-left region, the sustainable transport and engine emissions reduction cluster (grey) groups documents addressing strategies to minimize the environmental impact of transportation through technological advancements.

Finally, on the right side, the sustainable building and energy efficiency cluster (brown) is located, representing research on green architecture, energy-efficient construction, and sustainable urban planning.

The next Table 3 shows the results of K-means clustering that were applied to English dissertations. The perplexity analysis indicates that the optimal number of topics for this dataset is eight (Fig. 16). The most prevalent terms in each cluster are examined to determine the representative keywords, which are then further optimized through feature importance analysis (using Random Forest) to make sure they are both frequent and thematically discriminative. The labels were directly assigned by an expert, taking into account the most important keywords.

Fig. 12
figure 12

Topic Distribution across Scientific Sectors (SSD) based on K-Means Clustering.

Table 3 Topic classification using K-Means Clustering.

The topic modelling analysis identifies eight distinct research areas. The first topic (Topic 0) is Ecology and Plant Sciences, which includes studies on plant growth, species interactions and microbial communities. Closely related to this field, Agriculture and Food Production (Topic 1) emerges as another key area, focusing on agricultural management, food production and livestock farming. The dataset also highlights research in Microbiology and Biodiversity (Topic 2), which explores microbial ecosystems, species adaptation and the role of microorganisms in environmental processes. Keywords such as bacterial community, microbial community, planktonic foraminifer, cold adapt and alien species point to investigations into biodiversity at a microscopic level and its implications for ecosystem stability. Another critical research area is Climate Policy and Regulations (Topic 3), which captures discussions on environmental governance, sustainability policies and legal frameworks. The presence of terms like climate change, sustainability, environmental social, life cycle, policy, and competition law suggests a strong focus on regulatory approaches to climate mitigation and the intersection between environmental policies and socio-economic factors. Water resource management and hydrological studies are also well represented in Hydrology, Water Resources and Population Studies (Topic 4), where research examines the impact of climate change on water systems and population dynamics.

Energy-related topics are divided into two distinct clusters. Energy Performance and Management (Topic 5) addresses energy efficiency, consumption patterns and sustainable resource use. The presence of keywords such as energy consumption, energy performance, heat pump and water use highlights efforts to optimize energy efficiency and implement long-term sustainable solutions. Meanwhile, Energy, Building Design, and Infrastructure (Topic 6) focuses on the integration of energy-efficient technologies within urban planning and building construction. Terms such as building energy, residential building, smart grid, and synthetic polymer suggest an emphasis on sustainable architecture, smart infrastructure, and energy-efficient design. Finally, Biodiversity and Cryosphere (Topic 7) represents a specialized research area dedicated to species distribution, genetic variation and the impact of climate change on polar and glacial environments.

In K-Means topic modelling, climate change is distributed across multiple categories, including Climate Policy and Regulations, Hydrology and Water Resources, and Biodiversity and Cryosphere. In contrast, BERT clusters climate-related discussions into broader thematic areas, such as Climate Risk Assessment and Disaster Management, highlighting its ability to capture semantic relationships within a single cluster. Energy-related topics also differ in their segmentation. K-Means separates Energy Performance and Management from Energy, Building Design and Infrastructure, making a clear distinction between energy efficiency in systems and energy use in construction. BERT, however, combines these aspects into Smart Grids and Renewable Energy Systems and Sustainable Building and Energy Efficiency, suggesting a stronger emphasis on technological solutions and energy integration. The treatment of biodiversity and ecology also varies. K-Means distinguishes Ecology and Plant Sciences from Microbiology and Biodiversity, identifying microbiology as a separate topic. Conversely, BERT merges biodiversity aspects into Biodiversity Conservation and Climate Change Impacts, integrating plant, microbial, and species-level studies within a broader ecological framework. Additionally, corporate sustainability and innovation is more explicit in BERT, with a dedicated cluster (Corporate Innovation and Sustainability Strategies), whereas K-Means does not create a distinct topic for corporate strategies. Air quality and pollution studies are captured differently as well. K-Means clusters these topics under Hydrology, Water Resources and Population Studies, linking water resources with climate impacts. BERT, on the other hand, forms a separate category for Air Pollution and Atmospheric Monitoring, indicating a finer distinction between different environmental monitoring approaches. Finally, the transport and emissions reduction topic is more explicitly recognized in BERT (Sustainable Transport and Engine Emissions Reduction), whereas K-Means does not form a distinct category for transportation.

The bar chart (Fig. 11) illustrates the distribution of abstracts across the identified research topics.

Fig. 13
figure 13

Topic Distribution across Geographical Areas based on K-Means Clustering.

Energy, Performance and Management (Topic 5) has the highest number of abstracts, indicating a strong focus on energy efficiency, consumption patterns and sustainable resource use. This suggests a significant research interest in optimizing energy consumption and developing long-term sustainable solutions. Similarly, Agriculture and Food Production (Topic 1) also has a high representation, reflecting the importance of sustainable agricultural practices and food systems in environmental research. Other well-represented topics include Ecology and Plant Sciences (Topic 0) and Energy, Building Design, and Infrastructure (Topic 6). In contrast, Microbiology and Biodiversity (Topic 2) and Climate Policy and Regulations (Topic 3) are less represented, suggesting that research in these areas is more specialized or less frequently addressed.

The next Fig. 12 illustrates the distribution of topics across different Scientific Disciplinary Sectors (SSD). Each bar represents a specific SSD, with different colours indicating the percentage of abstracts associated with each topic.

Fig. 14
figure 14

Sentiment Analysis using ClimateBERT

The agricultural sciences (AGR/02, AGR/07, AGR/09, AGR/11, AGR/12, AGR/14, AGR/18) exhibit a high diversity of topics, with topics 0, 3, and 5 (Ecology and Plant Sciences; Climate Policy and Regulations; and Energy, Performance, and Management) being the most represented. In biological sciences (BIO/01, BIO/03, BIO/05, BIO/07), Topic 0 is also dominant, reinforcing the focus on ecological and plant-related aspects of climate change. However, these disciplines also show a notable presence of Topic 1 (Agriculture and Food Production), Topic 2 (Microbiology and Biodiversity), and Topic 4 (Hydrology, Water Resources and Population Studies). Chemistry-related SSDs (CHIM/01, CHIM/05, CHIM/06) are primarily associated with Topic 5 (Energy, Performance and Management) and Topic 7 (Biodiversity and Cryosphere), suggesting a focus on energy systems optimization, material performance under climate stress, and the chemical processes related to biodiversity conservation and cryospheric dynamics. In geosciences (GEO/04, GEO/05, GEO/08), there is a strong representation of Topic 6 (Energy, Building Design, and Infrastructure), highlighting the discipline’s contribution to the assessment of geophysical factors in energy and construction planning. Architecture and urban planning disciplines (ICAR/01, ICAR/02, ICAR/08, ICAR/09, ICAR/20, ICAR/21) are primarily associated with Topic 5 (Energy, Performance and Management) and Topic 6 (Energy, Building Design, and Infrastructure), reinforcing their engagement in the analysis of energy consumption and the design of energy-efficient infrastructure. The presence of Topic 6 in these disciplines suggests an increasing interest in integrating energy efficiency principles into architectural and spatial planning. The engineering fields (ING-IND/08, ING-IND/09, ING-IND/11, ING-IND/17, ING-IND/22, ING-IND/24, ING-IND/25, ING-IND/26, ING-IND/27, ING-IND/33, ING-IND/34, ING-IND/35, ING-INF/05) display a strong representation of Topic 1 (Agriculture and Food Production), Topic 5 (Energy, Performance and Management) and Topic 7 (Biodiversity and Cryosphere), reflecting their multifaceted role in developing technological solutions for sustainable agriculture, energy systems, and environmental monitoring. IUS/04 (Business Law) and MED/42 (Hygiene and public health) show a thematic focus on Climate Policy and Regulations, addressing topics such as life cycle assessment and human rights from both legal and public health perspectives. Finally, economics and social sciences (SECS-P/06, SECS-P/07, SECS-P/08, SECS-S/06, SPS/01, SPS/04) show a stronger presence of Topic 1 (Agriculture and Food Production), Topic 3 (Climate Policy and Regulations), and Topic 6 (Energy, Building Design, and Infrastructure), suggesting an interest in the economic and social implications of environmental policies.

The next Fig. 13 illustrates the distribution of topics for English dissertations across different macro geographical areas of Universities in Italy.

Fig. 15
figure 15

Perplexity Analysis for optimal topic selection in Italian dissertations using K-Means. The best choice for the number of topics appears to be 6, as increasing beyond this range results in rapidly increasing perplexity, which suggests a decline in model quality. At 6 topics, the slope of perplexity starts changing, marking the transition point between coherence and unnecessary complexity.

In the Northwest, research is well-balanced across multiple themes, with a strong presence of studies on Ecology and Plant Sciences. This suggests a significant focus on plant growth, species interactions, and biodiversity conservation, likely influenced by the region’s natural landscapes and environmental research initiatives. The Northeast follows a similar trend but places greater emphasis on Agriculture and Food Production and Hydrology, Water Resources, and Population Studies. This is underscored by the fact that among the representative keywords for Topic 4 is “Northern Italy”, emphasizing the regional context. The strong presence of agricultural research reflects the importance of farming, livestock production, and food sustainability in this area, while the attention to hydrology and water resources indicates a focus on climate change impacts (e.g. drought), groundwater management, and flood risk assessment, particularly relevant for Northern Italy. Moving to the Centre of Italy, the research landscape is characterized by a diverse range of topics, with a notable focus on Energy, Performance, and Management. This suggests that studies in this region are largely centred on energy efficiency, resource consumption, and sustainable technological advancements, possibly linked to urban infrastructure and industrial applications. A similar trend is observed in the South, where Energy, Performance, and Management also emerges as a dominant topic, with an even greater proportion of research dedicated to this area. This indicates a strong regional interest in energy efficiency and long-term sustainability, likely driven by efforts to integrate renewable energy solutions and optimize resource use in response to climatic conditions. The Islands, in contrast, show a predominant focus on Climate Policy and Regulations and Hydrology, Water Resources, and Population Studies. This suggests that research in these regions is particularly concentrated on sustainability policies, environmental governance, and climate adaptation strategies, alongside studies on water management and ecosystem services, which are critical for island territories facing challenges such as water scarcity and environmental vulnerability.

Sentiment analysis

The next Fig. 14 presents the results of Climate-BERT sentiment analysis on English-language dissertations, examining how different perspectives on climate change have evolved over time. The main graph shows the overall distribution of sentiments—categorized as “Challenges”, “Neutral” and “Solutions”—from 2008 to 2021, while the smaller inset graph focuses specifically on BIO/07 (Biology), which is the most frequently occurring Scientific Disciplinary Sector (SSD) in English-language dissertations discussing climate change. Dissertations categorized as “Challenges” tend to highlight the risks, negative impacts, or unresolved issues surrounding climate change. Those identified as “Solutions” present more optimistic or action-oriented perspectives, often focusing on proposed interventions, mitigation strategies, or innovations. Lastly, “Neutral” texts discuss climate change in a more descriptive or analytical way, without explicitly presenting it as a problem to solve or a solution to implement.

Fig. 16
figure 16

Perplexity Analysis for optimal topic selection in English dissertations using K-Means. The optimal number of topics is 8, as this range maintains low perplexity while preventing excessive topic fragmentation. At 8 topics, the slope of perplexity starts changing, marking the transition point between coherence and unnecessary complexity.

In the main box of Fig. 14, the bars illustrate the percentage distribution of the three sentiment categories over time. The yellow section represents “Solutions” the green section denotes “Neutral” perspectives, and the purple section indicates “Challenges”. It is possible to observe that neutral and solutions-oriented perspectives consistently hold the highest share across the years, while the share of challenges remains relatively stable. The proportion of solutions-oriented sentiment is particularly high between 2008 and 2010, before slightly decreasing and maintaining a steady presence at around 35% of the total dissertations in each subsequent year. This suggests that academic discourse has maintained a focus on actionable solutions, albeit with a slightly moderated emphasis over time. The right chart, which isolates BIO/07 (Biology), shows a predominance of dissertations focusing on solutions to climate change. Neutral perspectives also have a significant share, while there are no dissertations explicitly addressing challenges within the biological field. This pattern indicates that biological research is heavily oriented towards finding ways to address climate-related issues, such as biodiversity conservation, ecosystem restoration and adaptation strategies. The consistently high representation of solutions-oriented and neutral perspectives over time suggests that climate change research in academia has gradually shifted towards a more proactive and action-focused approach.

link