The genomic portrait of the Picene culture provides new insights into the Italic Iron Age and the legacy of the Roman Empire in Central Italy

The analyzed samples are under the protection of the Superintendence Archaeology, Fine Arts and Landscape of the Marche Region and the Superintendence Archaeology, Fine Arts and Landscape of the Toscana Region. They were collected in the frame of the agreements between the Department of Biology and Biotechnologies “C. Darwin” of the Sapienza University of Rome and the aforementioned Superintendencies: protocol Number 0010166 for the Picene necropolises, protocol number 0000934 for the Pesaro necropolis and protocol number 10527 for the Etruscan one. In this article, we labeled the genetic clusters composed of our samples by using the archeological culture nomenclature [5, 11, 12, 59]. Nevertheless, we note that archeological/cultural labels do not necessarily correspond to genetic clusters or individual identities.

DNA extraction and library preparation were performed at the ancient DNA laboratory at the Estonian Biocenter, Institute of Genomics, University of Tartu, Tartu, Estonia. Quantification and sequencing of the libraries were carried out at the Estonian Biocenter Core Laboratory. Radiocarbon dating was performed at Vilnius Radiocarbon, Vilnius, Lithuania.

DNA extraction

We extracted DNA from a total of 102 human remains samples, consisting of 3 petrous bones and 99 tooth roots. Small slices of bone were sampled from petrous bone and root portions were taken from teeth samples. Both these procedures were performed with a sterile drill wheel that was sterilized with 6% (w/v) bleach followed by distilled water and ethanol rinse in between the samples. The collected samples were placed in 6% (w/v) bleach for 5 min, then rinsed with 18.2 MΩcm water for 3 times and soaked for 2 min in 70% (v/v) ethanol. During the previous bleach and ethanol steps the samples were shaken to allow the detachment of particles. Later, the samples were placed on a clean paper towel inside a class IIB hood, and they were left to dry for 2 h with the UV light on. To calculate the correct volume of EDTA and Proteinase K needed for the extraction, the samples were weighted. We considered 20× EDTA (µL) of sample mass (mg) and 0.5× Proteinase K (µL) of sample mass (mg). The samples, EDTA and Proteinase K were placed into PCR-clean conical tubes (Eppendorf) of 5 or 15 mL, depending on the total volume required, under the IIB hood. These tubes were incubated on a slow shaker for 72 h at room temperature. The resulting DNA extracts were concentrated to a final volume of 250 µL with the Vivaspin Turbo 15 (Sartorius). Then, they were purified in large volume columns using the High Pure Viral Nucleic Acid Large Volume Kit (Roche) with 2.5 mL of PB buffer, 1 mL of PE buffer, and 100 µL of EB buffer (MinElute PCR Purification Kit, QIAGEN). The silica columns were placed in a collection tube to dry and later in a 1.5-mL DNA lo-bind tube (Eppendorf) for elution. Samples were incubated at 37 °C for 10 min with 100 µL of EB buffer and then centrifuged at 13,000 rpm for 2 min. The silica columns were removed after centrifugation and the samples were stored at − 20 °C, 30 µL of the samples were used for library preparation.

Library preparation and sequencing

The libraries for sequencing were made with NEBNext DNA library Prep Master Mix Set for 454 (E6070, New England Biolabs) and with Illumina-specific adaptors [60] using established protocols [60,61,62]. The end repair part was implemented (as described in Saupe et al. [23]) using 118.75 µL of water, 7.5 µL of buffer, and 3.75 µL of enzyme mix, incubated at 20°C for 30 min. Then, the samples were purified with 500 µL of PB buffer and 650 µL of PE buffer and eluted in 30 µL of EB buffer (MinElute PCR Purification Kit, QIAGEN). As previously, the adaptor ligation step was implemented as in Saupe et al. [23], using 10 µL of buffer, 5 µL of T4 ligase, and 5 µL of adaptor mix [60], incubating for 14 min at 20°C. The samples were purified as described above and then eluted in 30 µL of EB buffer (MinElute PCR Purification Kit, QIAGEN). The step of the adapter fill-in was performed using 13 µL of water, 5 µL of buffer, and 2 µL of Bst DNA polymerase, incubating 30 min at 37°C and for 20 min at 80°C [23]. For PCR amplification of the libraries, we used 50 µL of DNA library, 1X PCR buffer, 2.5 mM MgCL2, 1 mg/mL BSA, 0.2 µM inPE1.0, 0.2 mM dNTP each, 0.1 U/µL HGS Taq Diamond and 0.2 µM indexing primer. Cycling conditions were settled in the following way: 5 s at 94°C, followed by 18 cycles of 30 s each at 94°C, 60°C and 68°C, with a final extension of 7 min at 72°C. After PCR amplification, the samples were purified in 35 µL of EB buffer (MinElute PCR Purification Kit, QIAGEN). To measure the concentration of dsDNA/sequencing libraries and to confirm that library preparation was successful, we performed three verification steps: fluorometric quantification (Qubit, Thermo Fisher Scientific), parallel capillary electrophoresis (Fragment Analyzer, Agilent Technologies) and qPCR. DNA sequencing was performed using the Illumina NextSeq500/550 High-Output single-end 75 cycle kit and 20 samples were sequenced together on one flow cell.

Mapping

The sequences of the adapters, the indexes, the poly-G tails that occur because of the NextSeq 500 technology specifics, and sequences shorter than 28 bp (--minimum-length 28, to reduce the risk of random mapping of sequences from other species) were removed from the DNA sequences before the mapping with cutadapt-2.1 [63]. The resulting sequences were mapped to the human reference sequence GRCh37 (hs37d5) with BWA-0.7.17 [64]. To reduce the effect of reference bias in the following analyses we used the command bwa aln with relaxed alignment parameters (-n 0.01 -o 2) in combination with disabling seeding (-l 1024) [65,66,67]. The sequences were converted to BAM file format subsequent to the alignment and only the mapped sequences were retained using samtools-1.9 [68]. Duplicates were removed with picard-2.20.8 (http://broadinstitute.github.io/picard/index.html) and indels were realigned using GATK-3.5. With samtools-1.9 we filtered out reads with mapping quality lower than 25 as suggested by Martiniano et al. [65]. We estimated the number of final reads, average read length, average coverage, and other parameters with samtools-1.9 on the final BAM files. The endogenous DNA content of the samples (calculated as the proportion reads mapping to the human reference genome) spanned between 0.16 and 69.34% with an average of 22.33% (Additional file 2: Table S1).

aDNA authentication and contamination rate

We employed the program MapDamage-2.0 [69] to estimate the frequency 5′ ends of sequences C->T transitions, one of the characteristic patterns of ancient DNA (aDNA) damage, to confirm that the sequences we obtained are mostly ancient. We estimated contamination rates on the mitochondrial DNA (mtDNA) with the method detailed in Jones et al. [70] by computing the fraction of non-consensus bases at mtDNA haplogroup defining position. Samples with a contamination rate lower than the 3% were used for subsequent analyses. We additionally estimated nuclear contamination for males on the basis of the X chromosome with the two methods described in Rasmussen et al. [71] implemented in the ANGSD tool [72] and with HapCon [73]. In these cases, samples with a contamination estimate lower than 6% were considered suitable for further analyses.

Genetic sex estimation

Genetic sex was estimated with the procedure described in Skoglund et al. [74], computing the proportion of reads mapping to the Y chromosome with respect to the total number of reads mapping either to the X or the Y chromosome. In some cases, it was not possible to unequivocally assign the genetic sex, but we note that these samples were not considered for population genomics analysis because of other issues (i.e., low coverage or contamination, Additional file 2: Table S1).

mtDNA haplogroups assignment

We assigned the mitochondrial DNA haplogroups with the online tool Haplogrep2 [75] (https://haplogrep.i-med.ac.at/haplogrep2/index.html). The VCF file that was used to perform haplogroup prediction was obtained by calling the variants from the .bam files with bcftools-1.14 [76]. The command used was bcftools mpileup with the additional flag --ignore-RG and then only the variant positions were called with the command bcftools call -m --ploidy 1 -v.

Y chromosome haplogroup assignment and phylogeny

Y chromosome phylogenetic relationship among newly reported samples and other ancient Euroasiatic samples [8,9,10, 23, 24, 26, 28,29,30,31,32, 77,78,79] (Additional file 2: Table S5) were reconstructed with pathPhynder [80] starting from the .bam files, using standard parameters. As a reference tree, we used the one provided in Martiniano et al. [80] which spans across all the genetic variability of human Y chromosome haplogroups with a total of 2014 individuals included. To visualize the resulting tree, we used the R package ggtree [81].

Variant calling on autosomes

Variant calling for autosomes was performed with ANGSD-0.917 [72]. We called haploid genotypes sampling a random base (-doHaploCall 1) for each position that is present in the 1240K SNPs panel using the -sites options. In addition, we specified the major and minor alleles as they are indicated in the 1240K SNPs panel with the option -doMajorMinor 3. The function haploToPlink was used to convert the .haplo file, resulting after the SNP calling, into Plink [82] format files.

Design of datasets for genomic analysis

For genomic analyses, we compared our newly reported samples with modern and ancient individuals from the literature. The included samples belong to the AADR dataset v52.2 ([20, 21], both the 1240K and the 1240K+HO dataset, Lazaridis et al. [26] and Aneli et al. [10]. To maximize the number of SNPs available for each analysis we built three different datasets: (A) including 1240K and HO data, used for PCA and Admixture analysis; (B) including only 1240K data, used for kinship analysis, f3-statistic, D-statistics, qpAdm; (C) including 1240K, HO and modern Italians from Raveane et al. [57], employed for PCA to compare ancient and modern Italians (Additional file 3: Fig. S2). All the data coming from different datasets were merged with Plink-1.9 [82]. To minimize the possible error due to post-mortem damage of ancient DNA, for all the population genetics analyses employing these 3 datasets (PCA, Admixture, f3, D-statistics, qpWave/qpAdm, and f4 NNLS analysis) we used only transversions. The resulting total number of autosomal SNPs was 98,845 for dataset (A), 209,089 for dataset (B), and 98,842 for dataset (C). After merging, we retained for subsequent analysis only the newly reported samples showing mtDNA contamination lower than 3% and a coverage of SNPs of at least 10,000 from dataset (A) and 5000 from dataset (B) and (C). The label for the population assigned to each individual sample used in this study can be found In Additional file 1: Table S4 and S5. Because they belong to the same material culture and for the genetic similarities outlined, for several analysis (D-statistic, qpWave/qpAdm, f4/NNLS, IBDs, ROHs and phenotype prediction) Etruscan individuals reported in this study were grouped together with individuals from the literature [9], maintaining putative outliers individuals separated.

Kinship analysis

Kinship estimations were performed independently with READ [83] and TKGWV2 [84]. Since READ with default parameters estimates the median pairwise distance of unrelated individuals in the population under study to assess the potential kinship relationships, using different individuals may change these estimates. Therefore, for READ, we performed tests with four different datasets: (1) samples from Central Italy Iron Age [8, 9], but excluding the ones identified as outliers; (2) only the Picenes, with both the necropolises of Novilara and Sirolo-Numana; (3) only the Picenes from the necropolises of Novilara; (4) the LA individuals from Pesaro with the Imperial time individuals from Antonio et al. [8]. For each of these tests, we performed the analysis with all the SNPs and with only the transversions (Additional file 2: Table S2). The samples were selected from the original datasets with the option --keep of Plink [82] and converted into the .tped format that is required for READ. With TKGWV2 we performed two different tests with all SNPs and only transversions including all the samples of interest at the same time (newly reported individuals, other IA samples from Central Italy, and Imperial time individuals), since it requires an external source for allele frequencies estimation. We used the allele frequencies of the European populations of the 1000 Genomes data [85] as provided here: https://github.com/danimfernandes/tkgwv2 (Additional file 2: Table S3). In all the tests performed the two Etruscan samples EV7A and EV19 resulted to be the same individuals (or monozygotic twins), therefore for population genetics analysis the sample with the lowest coverage (EV19) was excluded.

Principal component analysis

For PCA we selected a subset of modern and ancient Eurasian and North African samples from dataset A, for a total of 1464 individuals (Additional file2: Table S4). The subset was performed with the option --keep of Plink [82]. The resulting Plink format files were converted into EIGENSTRAT format with the program convertf of the EIGENSOFT-7.2.0 package [86, 87] using the parameter “familynames:NO”. The program smartpca of the EIGENSOFT-7.2.0 package was used to perform PCA with the parameters lsqproject:YES, autoshrink:YES, and outliermode:2. We projected all the ancient individuals onto the PCs built based on modern samples. In the PCA performed including modern Italians (employing dataset C, Additional file 3: Fig. S2), these had to be projected like ancient individuals due to the high number of missing SNPs (>60%), since the original data were produced on a different beadchip with respect to the HO and 1240K data (Infinium Omni2.5-8 Illumina beadchip). The results of the first two PCs were visualized in R-4.1.3 (https://www.r-project.org/) with the package ggplot2 (https://ggplot2.tidyverse.org/). We note that for low coverage (less than 10,000 transversions) putative genetic outliers, their deviation from the general population may be influenced lower amount of available markers.

Admixture analysis

We performed unsupervised Admixture analysis [22] with 1708 modern and ancient Eurasian and North African individuals selected from dataset A (Additional file 2: Table S4). Before running Admixture, SNPs were pruned for linkage disequilibrium with Plink [82] (option --indep-pairwise with parameters: 50 5 0.5) for a final set of 98,759 transversions. Moreover, modern and high-coverage ancient diploid genomes were converted into pseudo-haploid by picking a random allele for each variant position. We performed 10 independent repetitions (giving different random seeds with the -s option) for k values from 2 to 10 using the option --haploid='*'. The results from different runs were merged and visualized with the R package pophelper (Additional file 3: Fig. S3) [88]. We portrayed k = 4 because it is the most representative, showing a differentiation between CHG/Iran Neolithic and EHG/Yamnaya components. The names of the ancestral components in the main text (Anatolia Neolithic, Serbia Mesolithic, EHG/Yamnaya, CHG/Iran Neolithic) were given on the basis of the populations considered basal to modern Europeans which have the highest amount of that component.

The interpolation maps with the proportion of the ancestral components obtained with Admixture were performed with QGIS-3.26.1 (https://www.qgis.org/en/site/). For this analysis we kept only the samples from the 1st millennium BCE excluding known outliers for a total of 396 samples and, in archeological sites where more than one individual was present, we performed the mean for each component (Additional file 2: Table S7). The interpolation was calculated with the IDW method and for representation the option “singleband pseudocolor” was chosen, the minimum value was set to 0.01 and the maximum to 0.6, and, finally, the option “Interpolation: Discrete” was selected.

Outgroup f3 statistic

To explore the genetic relationships between different IA Italian groups, the Pesaro LA individuals and the putative outliers identified with the PCA and/or Admixture analysis in the form f3(X,Y;YRI.SG), where X is one of the following population: Picene, Etruscan, Italy_IA_Romans, Italy_IA_Apulia, Pesaro_LA, Italy_Imperial; or one of the following samples: EV15A, EV18, PN43, PN146, PN91, PN20, PN3, PN87 PF1, PF32; Y is a modern or ancient Eurasian or North African population selected from the dataset B (Additional file2: Table S4). f3 statistics were computed with the program qp3pop of the package admixtools-7.0.1 [89] using the option “inbreed:YES”.

D-statistic

D-statistic was performed in the form D(X, Y; Italy_BA_EBA, YRI.SG), where X and Y are a set of the populations present in Additional file 2: Table S4, selected from the dataset B. An additional test comparing Italy_IA_Tyrrhenian and Italy_IA_Adriatic populations was added. D-statistic was calculated with the program qpDstat of the package admixtools-7.0.1. [89] with the options “printsd: YES” and “inbreed: YES”.

qpWave/qpAdm analysis

To explore more in detail the ancestral genetic components of the Italian IA and LA populations, and the putative outliers we identified, we exploited a qpWave/qpAdm framework [90]. As a target, we used the populations and single individuals in the following list: Italy_IA_Adriatic, Italy_IA_Tyrrhenian, Picene, Etruscan, Italy_IA_Romans, Italy_IA_Apulia, Pesaro_LA, Italy_Imperial, EV15A, EV18, PN43, PN146, PN91, PN20, PN3, PN87, PF1, PF32. In detail:

i)

We exploited qpWave to evaluate if the targets can be described as a combination of the sources.

ii)

If (i) returned a p-value > 0.01, we used qpAdm to model the targets as a mixture of the sources.

We tested models with two and three left populations (sources) using all the possible combinations of the following: Anatolia_N_Barcin, Serbia_IronGates_Mesolithic, Yamnaya, Morocco_EN, Levant_PPN, CHG, Iran_N, Italy_N, Sardinia_N, Germany_BellBeaker, Italy_CA, Italy_BA_EBA, EHG. We discussed plausible models with a p-value ≥ 0.01 as also reported in Skourtanioti et al. [91]. As right populations (outgroup), we always used the following: Mbuti.DG, ISR_Natufian_EpiP, Morocco_Iberomaurusian, Mesopotamia, Russia_AfontovaGora3, Russia_MA1_HG.SG, Turkey_Boncuklu_N, Turkey_Epipaleolithic, WHG2, Ethiopia_4500BP.SG (the corresponding samples can be found in Additional file 2: Table. S4).

f4 NNLS analysis

To explore at a finer level the ancestral components of the newly reported samples and other populations from the literature, we performed a Non-Negative Least Squares (NNLS) analysis exploiting different f4-statistics vectors as a proxy for the relationships among the analyzed populations. In detail, we performed approximately 1.5 million of f4 in the form f4(X,Y;Z,Mbuti.DG) where X, Y, and Z are all the possible combinations of 112 populations. For each of these populations, a f4 vector was created and used to perform NNLS analysis (reconstructing each target individual copying vector as a mixture of different proportions of the putative sources employing a slightly modified version of the nnls function in the R package “nnls,” as described in Saupe et al. [23] and Wangkumhang et al. [92]) with these two sets of ancestral populations:

1)

Anatolia_N_Barcin, Serbia_IronGates_Mesolithic, Iran_N, Levant_PPN, Yamnaya;

2)

Anatolia_N_Barcin, Serbia_IronGates_Mesolithic, Iran_N, Levant_PPN, EHG.

The issue with this method is that some NNLS values are not computed from f4 vectors, in particular when the same populations are present between the X, Y, and Z populations. To avoid the presence of missing data, we set all these f4 values to 0, which is formally correct only when X and Y, and not X and Z, are the same populations. Additionally, to get an overview of the genetic similarity among the analyzed populations we performed an UPGMA tree based on the Euclidean distance between f4 values.

Genome imputation

We performed genome imputation following a slightly different pipeline to the one described in Hui et al. [93]. In particular, genotypes were called with ANGSD-0.917 [72] on the SNPs panel (removing indels) present in the global population of the 1000 Genomes Project Phase 3 [85], using the parameters -doMajorMinor 3 -GL 1 -doPost 1 -doVcf 1 -doMaf 1 -checkBamHeaders 0. Genotype likelihoods were updated in Beagle-4.1 [94] with -gl mode. Then, we performed imputation from sites where the genotype probability (GP) of the most likely genotype is equal or higher than 0.99, using Beagle-5 [95] with -gt mode. We exploited the 1000 Genomes Project Phase 3 [85] world population as a reference for the Beagle -gl step, while we used the Human Reference Consortium [96] dataset for the Beagle -gt step. After genome imputation, an additional GP filter (MAX(GP) ≥ 0.99) was applied before performing subsequent analyses.

We selected for imputation only samples with a coverage ≥ 0.1×, for shotgun data, or covering more than 300K SNPs from the 1240K panel, for chip data. To compare our data with other ancient populations, we selected unrelated individuals from different studies [8,9,10, 23, 24, 26,27,28,29,30,31,32] for a total of 815 individuals. The complete list of samples and the corresponding publication can be found in Additional file 2: Table S5).

Identity-by-descent (IBD) analysis

In order to identify segments of the genome that are identical by descent, we exploited the program IBDseq-vr1206 [97] on the imputed dataset, following the procedure illustrated in Ariano et al. [98]. In detail, with Plink-1.9 [82] we filtered out SNPs showing genotype missingness > 0.02 and a MAF < 0.05 (parameters --geno 0.02 and --maf 0.05 in Plink). The resulting Plink format files were converted into VCF with the option --vcf of Plink, and this was used as the input file for IBDseq. The parameters used for IBDseq were errormax = 0.005 and LOD ≥ 3, as indicated in Ariano et al. [

Comments (0)

No login
gif