The dataset analyzed comprised 20,433 records. We assessed the completeness of the variables within the dataset, with values ranging from 11 to 100%. Incomplete variables that were incomplete were excluded from further analysis, including family name (32%), contact information of the father or mother (13%), and family names of the father and mother (11–12%) (Fig. 2).
Fig. 2Comprehensiveness of the matching variables used for analysis, including DOB, gender, village, mother’s name, father’s name, and children’s names
We examined the distribution of values in the “Name” column, focusing on the frequency of distinct values and missing entries. Frequently occurring names (top 10) were identified, with Eanoi (a nickname meaning “baby”) appearing approximately 500 times, followed by “Mr.” and “Miss” (110 and 80 times, respectively). In contrast, the least frequent names (bottom five) represented permanent names (Fig. 3).
Fig. 3Analysis of the top ten and bottom five values in the “Name” aimed to identify the most and least frequent entries
Manual Review (Gold Standard)We calculated the ratio of matches to nonmatches identified through a combination of DOB, sex, and village. A total of 5,740 potential matches (28.09%) were flagged for manual review. Three reviewers assessed the matches and nonmatches within these 5,740 records. Reviewer 1 identified 3,409 matches (16.68%), Reviewer 2 identified 3,311 matches (16.2%), and Reviewer 3 identified 3,240 matches (15.86%). Compared with individual reviewers, the final consensus of the gold standard matches resulted in a marginally reduced match rate, with a total of 3,191 matches detected (15.62%) (Table 2).
Table 2 Results of a review process, evaluated by three reviewers and considered the gold standardThe breakdown of matching pairs, which is based on the number of records that are matched collectively, yielded several groupings. Most of the matching pairs consisted of two records, representing 86.9% of the total (1,290 pairings out of 1,484). A smaller proportion of matching pairs consisted of three (11.4%), four (1.4%), and five records (0.3%). The largest matching pair comprised 2,580 records, with other groupings consisting of 507, 84, and 20 records (Table 3).
Table 3 The distribution of matching pairs among several groupingsPerformance of Matching TechniquesThe recall of the blocking matching criterion was adjusted from 65 to 95%, revealing that increasing thresholds improved precision but reduced recall. The model consistently achieved a high F1 score, with values between 65% and 75%. We selected a threshold of 70% for further evaluation (Fig. 4).
Fig. 4Adjustment of the blocking matching sensitivity for DOB and villages from 65–95% in the probabilistic matching technique and comparison of its performance with that of manual matching
Figure 5 presents the confusion matrix comparing the performance of three matching methods: deterministic, probabilistic, and hybrid. According to the gold standard, the methods identified 16,979, 16,915, and 16,909 records, respectively, as true negatives; 263, 327, and 333 records as false positives; 1,054, 312, and 310 records as false negatives; and 2,137, 2,879, and 2,881 records as true positives. The precisions of the three methods are notably similar, reaching approximately 89–90%. However, differences are observed in the recall and F1 scores. The probabilistic and hybrid methods outperformed the deterministic method in terms of matching performance, achieving recall rates of 90%, 90%, and 67%, respectively, and F1 scores of 90%, 90%, and 76%, respectively (Fig. 6).
Fig. 5Confusion matrix for three matching techniques applied to the SCHR dataset, including true negatives, true positives, false negatives, and false positives
Fig. 6Comparison of precision, recall, and F1 score using the SCHR dataset across three methodologies: deterministic, probabilistic, and hybrid
The precision‒recall curve provides a visual assessment of model performance by illustrating the trade-off between precision and recall across different threshold settings. The deterministic model, with an area under the curve (AUC) of 0.81, initially demonstrated high precision. The precision sharply decreased as the recall increased, indicating a significant trade-off. Compared with the deterministic model, the probabilistic and hybrid models achieved an AUC of 0.91, indicating superior flexibility. These methods provide a better balance between precision and recall (Fig. 7).
Fig. 7The combined precision‒recall curves plotted using the new data, including the AUC values for the deterministic, probabilistic, and hybrid methods (presented separately)
Comments (0)