Shared genetic architectures of educational attainment in East Asian and European populations

Study selection

This study has been approved by the ethics committee of National Health Research Institutes, Taiwan (TWB; EC1090402-E and EC1110608-E) and Seoul National University Bundang Hospital, South Korea (KoGES; X-2107-699-902).

TWB

TWB is a population-based prospective cohorts study, which was planned to recruit 200,000 volunteers between 20 and 70 years of age with no prior diagnosis of cancer from 29 recruitment centres across Taiwan16 (see URLs). In total, TWB has recruited 159,195 participants since 2012. Baseline characteristic data were collected from structured interviews, physical measurements, biomarkers and genetic data. We obtained genome-wide genotype data from two customized chips, including 27,719 samples in the TWB v1 custom array and 81,236 samples in the TWB v2 custom array. The TWB v1 custom array (batch 1) was designed on the basis of the Thermo Fisher Axiom Genome-Wide CHB Array with customized contents in 2011, and the TWB v2 custom array (batch 2) was designed by Thermo Fisher Scientific in 2017 on the basis of whole-genome sequencing data from 946 TWB samples with customized contents51.

KoGES

The KoGES is a large prospective cohort study initiated by the National Institute of Health, South Korea. KoGES provides epidemiological and genetic data from three population-based cohorts: Ansan/Ansung, Health Examinee and the Cardiovascular Disease Association study. Ansan/Ansung is a community-based cohort that recruited 10,030 individuals aged 40–69 years living in Ansan or Ansung. Health Examinee is an urban-based cohort study that recruited 173,208 individuals aged 40–79 years between 2004 and 2013 at a hospital health check-up centre. The Cardiovascular Disease Association study is a rural-based cohort conducted between 2005 and 2011 and recruited 28,337 individuals aged 40–69 years. From these three cohorts, we obtained genome-wide genotype data from the Korea Biobank Array, which is a customized Korean-specific chip52,53. In total, 71,678 individuals with genotypic and phenotypic information were included in this study.

Genotype data QC and imputation

We conducted stringent pre-imputation QC, followed by the PBK genotype QC project pipeline (see URLs), for samples in TWB batch 1, TWB batch 2 and KoGES. First, we included samples with a call rate >0.98 and variants with a call rate >0.98. We then filtered out variants that were duplicated, monogenic or incorrectly mapped to a genomic position. Using a random forest model with the top six principal components (PCs) and the 1KG Project phase 3 data as ref. 54, we classified genetic ancestry and identified samples with a predicted probability of EAS ancestry >0.8. When we estimated the PCs with LD pruning at r2 = 0.1, we removed multi-allelic and strand ambiguous SNPs, SNPs with call rate <0.98, SNPs with MAF >5%, and SNPs located in long-range LD regions (chromosome 6: 25–35 Mb; chromosome 8: 7–13 Mb). We then excluded samples with mismatched genetic and self-reported sex, as well as samples with heterozygosity rates outside six standard deviations from the sample average. We also excluded population outliers by conducting in-sample PC analysis in three rounds. We excluded samples with any of the top ten PCs that were more than six standard deviations away from the sample average in each round of the in-sample PC analysis. Finally, we included homogeneous EAS samples and discarded variants with call rate <0.98 and Hardy–Weinberg equilibrium P-value <1010. After pre-imputation QC, we performed imputation independently for TWB batch 1, TWB batch 2 and KoGES using Eagle v2.4 (ref. 55) for pre-phasing and Minimac4 for genotype imputation, with the 1KG Project phase 3 EAS data as the reference panel56.

Phenotype: EduYears

The education system in Taiwan is similar to that in South Korea (for cohort details, see Supplementary Note). EduYears was collected from different questionnaires using a multiple-choice question in the TWB and KoGES when the participant was 30 years old or older. To ensure comparability across cohorts, including TWB, KoGES and cohorts in GWAS meta-analysis in EUR population, we mapped each category in these questions to the International Standard Classification of Education (ISCED) category. We then imputed each ISCED category to the number of years of schooling, which is referred to as EduYears. We have summarized the questions and results of mapping from the ISCED Level to EduYears in Supplementary Table 21.

Genetic association analysis

We performed genetic association analyses on EduYears using post-QC imputed genotype data and a two-step whole-genome regression model implemented in Regenie v2.2.4 (ref. 57), which accounts for sample relatedness and population structure. We excluded duplicate samples by randomly removing one sample from each pair in the two-step whole-genome regression models for TWB and KoGES. We adjusted for the birth year (BY), BY2, BY3, sex, BY by sex interaction, BY2 by sex interaction, BY3 by sex interaction, and the top ten PCs, based on previous GWAS for EduYears14. The top ten PCs were included as covariates to control for population stratification. We used the GWAS summary statistics derived from Regenie57 as the main result and utilized them for all downstream analyses, except for the analyses using LDSC18 and LDSC-based methods such as S-LDXR31 and stratified LDSC.

For analyses using LDSC-based methods (for example, LDSC, S-LDXR, stratified LDSC and so on), we separately performed association analyses using linear regression implemented in PLINK v2.0 (ref. 58) in unrelated samples of TWB and KoGES. We estimated genetic relatedness to check for family relationships using PLINK58 with kinship coefficients of 0.0884 and 0.354 as thresholds for second-degree relations and duplicate samples, respectively. We randomly excluded one sample from each pair of second-degree or more closely related relatives within TWB batch 1, TWB batch 2 and KoGES independently. Across batches in TWB, we also excluded batch 2 samples from each pair of duplicated samples. We then performed association tests adjusted for BY, BY2, BY3, sex, BY by sex interaction, BY2 by sex interaction, BY3 by sex interaction, and the top ten PCs on the remaining unrelated individuals for TWB batch 1, TWB batch 2 and KoGES.

GWAS meta-analyses in EAS population

Before performing the following meta-analyses, we first filtered the variants in individual biobank association summary statistics (TWB batch 1, TWB batch 2 and KoGES) by imputation INFO >0.6 and MAF >0.5% for Regenie whole-genome linear regression analyses and by imputation INFO >0.8 and MAF >1% for PLINK linear regression analyses. We first synthesized TWB batch 1 and TWB batch 2 GWASs using an inverse-variance-weighted (IVW) fixed-effect model implemented in METAL 2020-05-05 (ref. 29). We conducted a GWAS meta-analysis of EduYears in EAS population, including TWB and KoGES, using an IVW fixed-effect model. In the meta-analyses of EduYears, we used the ID obtained from imputation56 as unique identifiers for each variant, and we checked the heterogeneity in effect size using Cochran’s Q test implemented in METAL29. We estimated Fst between TWB and KoGES to measure population differentiation due to genetic structure59. We removed variants with inconsistent allele on the same strand. We also removed variants that were not included in both biobanks. Subsequently, we annotated the reference SNP with the dbSNP build 155 data from the National Center for Biotechnology Information Search database60 using SnpSift v4_3t_core61.

Heterogeneity of genetic effects within EAS population

To identify underlying factors contributing to heterogeneity between TWB and KoGES, we performed the following procedures.

Phenome-wide association study lookup

To investigate the pleiotropic effects of variants showing heterogeneity, we conducted a search in the KoGES PheWeb (see URLs).

Global and local genetic correlation analyses

To explore the relationship between alcohol-related traits and EduYears, we performed global and local genetic correlation analyses using KoGES data. The global genetic correlation was estimated using LDSC v1.0.1 (ref. 18), while the local genetic correlation within specific genomic regions was assessed using LAVA62. The details are summarized in Supplementary Note.

Stratified GWAS analysis

In KoGES, individuals were classified as drinkers if they had a history of past or current alcohol consumption, and non-drinkers if they had no history of alcohol consumption. We then performed genetic association analyses for EduYears using Regenie v2.2.4 (ref. 57), adjusting for BY, BY2, BY3, sex, BY by sex interaction, BY2 by sex interaction, BY3 by sex interaction, and the top ten PCs.

Heritability and genetic correlation analyses

We performed LDSC v1.0.1 (ref. 18) to estimate the SNP-based heritability of EduYears in the TWB, KoGES, EAS and EUR populations14 using population-specific LD scores based on the 1KG Project phase 3 data. We applied LDSC18 and S-LDXR v0.3-beta31 to estimate the genetic correlations for within-EAS and cross-population genetic correlations between EAS and EUR populations, respectively. We used the default LD scores for the EAS and EUR populations provided by S-LDXR31 as reference panels to estimate the cross-ancestral genetic correlation. For comparability and unbiased estimation, we used GWAS summary statistics derived from linear regression models implemented in PLINK to perform LDSC18 and S-LDXR31, which excluded strand ambiguous SNPs and variants with imputation INFO <0.8 and MAF <1%.

eQTL analysis

To investigate the influence on gene regulation of SNPs associated with EduYears, we performed an eQTL analysis using gene expression data implemented in FUMA v1.3.7 (ref. 22) with brain tissue expression data from the GTEx v8 database. We then performed gene mapping for significant SNP-gene pairs with an FDR <5%.

Gene-based and gene set enrichment analyses

Gene-based and gene set analyses were performed using MAGMA gene-property analyses20,21 implemented in FUMA v1.3.7 (ref. 22) to identify the genes and gene sets related to EduYears. Gene-based analysis was performed by mapping SNPs to 18,123 protein-coding genes using the SNP-wide mean model. Next, gene set analysis was conducted with 10,678 gene sets, including curated gene sets and GO terms from MsigDB v6.2. We employed a competitive test to determine whether genes in a gene set were more strongly associated with EduYears than the other genes. We then applied a Bonferroni correction to all tested genes and gene sets to account for multiple comparisons. MAGMA gene property analysis was performed using test statistics obtained from gene-based and gene set analyses.

Partitioned heritability analysis

Based on GWAS summary statistics of EAS samples, we used LDSC-SEG v1.0.1 (ref. 25) to prioritize tissues and cell types relevant to EduYears. We partitioned genome-wide SNP heritability into 97 baseline-LD annotations introduced by Gazal et al.26 and 9 tissue-specific categories as specified by Finucane et al.24. We used LD scores for the EAS and EUR populations using the 1KG Project phase 3 data provided by LDSC GitHub repository as a reference (see URLs).

Pathway enrichment analyses in EAS and EUR populations

We applied GSA-SNP2 (ref. 28) based on all P values from both EAS and EUR GWAS to detect biological pathways associated with EduYears. GSA-SNP2 employs the Z-statistics of the random set model, assessing pathways by combining adjusted gene scores for SNP counts in each gene using a monotone cubic spline trend curve. We evaluated gene set enrichment using the MSigDB C5 collection v5.2 database63,64. For the detailed options regarding the genes and pathways in the analysis, the race was selected as ancestry-matched (EUR or EAS), the reference genome version was set as GRCh37 (hg19), the padding size for genes was set to 20 kb, and the pathway size window was chosen as 10–200. Significantly enriched pathways were defined as those with q value <0.05.

Cross-ancestry GWAS meta-analyses in EAS and EUR populations

We obtained summary statistics for all variants that passed QC filters in the GWAS meta-analysis for EduYears in EUR population from all discovery cohorts except 23andMe14, which included 766,345 participants of EUR ancestry with 10.1 million genetic variants. Next, we conducted a cross-ancestry GWAS meta-analysis to synthesize EUR and EAS data using an IVW fixed-effect model implemented in METAL29, in which genomic control correction was applied to the EUR and EAS data. To evaluate the effect size heterogeneity between the two populations, we examined Cochran’s Q statistic implemented in METAL29. We applied MAMA30 to account for potential differences between the EAS and EUR populations in effect size, allele frequency and LD. We used the 1KG Project phase 3 data as a reference panel54 to calculate the LD score for the EAS and EUR populations. Next, we filtered out variants with MAF <0.5%, and the remaining options for running the meta-analysis were set to default values. Okbay et al.10 reported an updated meta-analysis of EduYears in a sample of 3,037,499 individuals in 2022, which was nearly three times larger than that reported by Lee et al. in 2018 (n = 1,131,881)14. However, there is an access limitation to publicly released data for both GWAS results. The sample size in the publicly released data from Lee et al. (n = 766,345) was slightly larger than that reported by Okbay et al. (n = 765,283); therefore, we used the GWAS results from Lee et al. in this study.

Assessment of transferability

To assess the transferability of EduYears-associated loci between EAS and EUR populations, we employed the PAT ratio approach32. We initiated the analysis with 246 loci identified from publicly available EUR summary statistics (n = 766,345) by Lee et al.14. For each locus, we generated credible sets, incorporating lead SNPs and proxy SNPs, using the same criteria of the study by Huang et al.32. Specifically, we included SNPs within a 50-kb window of the lead SNP with r2 ≥ 0.8 and P < 100 × Plead using the 1KG Project phase 3 EUR data as the reference panel. A locus was considered ‘transferable’ if at least one variant within its credible set exhibited an association with EduYears in EAS (P < 0.05) and demonstrated the same effect direction as observed in EUR. To estimate statistical power, we used the default parameter (α = 0.05) and the summed-up power estimates for all published loci to obtain the expected number of transferable loci. Finally, by dividing the observed number of loci by the expected number of loci, we calculated the PAT ratio to estimate the transferability of EduYears loci between the EAS and EUR populations.

Fine-mapping analysis

We used the SuSiEx approach33, which builds on the Sum of Single Effects model, to perform within-EAS fine-mapping for EduYears in genome-wide significant loci and cross-population fine-mapping integrating EAS and EUR EduYears GWAS14. The GWAS summary statistics in EAS population were derived from the two-step whole-genome regression models implemented in Regenie57, which filtered out variants with imputation INFO <0.6 and MAF <0.5%. The 1KG Project phase 3 data were used as the reference panel to calculate the LD matrix in the corresponding populations. We extended the region of a significant locus, identified through FUMA22, by adding 250 kb to each side, if the region was less than 1 Mb. We then identified a 95% credible set in each region with the maximum number of the causal signals set to 10 and the default settings in the remaining options. We showed regional plots for these genome-wide significant loci in both EAS and EUR populations using LocusZoom v0.9.6 (ref. 65). We also showed plots of PIPs in all credible sets identified from within- and cross-population fine-mapping, after filtering out variants with P > 5 × 10−8. Moreover, as the mismatch of the LD patterns between the reference panel and the GWAS discovery sample may bias the results for fine-mapping, we applied the SuSiE-RSS model34 to evaluate the consistency of the LD using the susieR package v0.12.10 implemented in R v4.2.1 (ref. 66). A larger ‘s’ metric from SuSiE-RSS implies a strong inconsistency between GWAS summary statistics and the LD matrix from the reference panel. We also constructed diagnostic plots to compare the observed z-scores against the expected z-scores for SNPs included in the fine-mapping.

Genetic correlation analysis with other traits

We estimated the cross-trait genetic correlation between EduYears and other traits by using LDSC v1.0.1 (refs. 18,67). We used publicly available GWAS summary statistics of socioeconomic and health-related traits for the EAS and EUR populations. A full list of GWAS summary statistics used in the analysis can be found in Supplementary Table 18. For both populations, we downloaded LD scores calculated from the 1KG Project phase 3 data via the LDSC GitHub repository (see URLs). We then applied FDR correction to control for false positive discoveries.

Polygenic prediction

We assessed the predictive ability of PGSs derived from the current EAS genome-wide meta-analysis for EduYears and the EUR GWAS for EduYears14 by using three testing cohorts of EAS ancestry, which are the EMCIT, a Korean-based cohort and the Chinese samples in UKBB, and one testing cohort of EUR ancestry, which is the NIA-LOAD. These testing cohorts are summarized in Supplementary Note.

We constructed PGSs for EduYears using two Bayesian polygenic prediction methods, PRS-CS v1.0.0 (ref. 35) and PRS-CSx v1.0.0 (ref. 36). The advantages of PRS-CS are robustness to varying genetic architectures, accurate LD modelling and computational efficiency. The posterior SNP effect sizes in PRS-CS were inferred from the EAS and EUR GWAS meta-analysis for EduYears. PRS-CSx can be considered as an extension of PRS-CS, which improves cross-population polygenic prediction by integrating GWAS summary statistics from multiple ancestry groups. The posterior SNP effect sizes in PRS-CSx were inferred from both the EAS and EUR GWAS meta-analyses for EduYears14. We then synthesized the SNP effect sizes across populations using an IVW meta-analysis of population-specific posterior effect size estimates. The 1KG Project phase 3 samples (EAS (n = 504), EUR (n = 503)) that matched the ancestry of the discovery samples were used as external LD reference panels. We fixed the global shrinkage parameter to be 0.01 in both PRS-CS and PRS-CSx, which is suitable for highly polygenic traits.

We evaluated the prediction accuracy of PGS using the partial R2 in each testing cohort, which was implemented in R v4.2.1. We adjusted for BY, sex, and BY by sex interaction, and the top ten PCs in all models. The P value of the partial R2 was derived from a likelihood ratio test comparing the goodness of fit of the models with and without PGS. To infer confidence intervals, we used the boot package68 in R with 1,000 bootstrap replicates over samples.

The details for the analyses under the same sample sizes are summarized in Supplementary Note.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

link