Evaluation Metrics
How submissions are scored: partiality-corrected precision, recall and F1 for Task 1; Hits@k and MRR with NIL abstention for Task 2.
This document provides a brief overview of the evaluation metrics used for both tasks. A full description will appear in the supplementary material (available at track launch).
Task 1 - Global Alignment
To score Task 1, we use an implementation equivalent to the scorers in MELT for incomplete, partial reference alignments (i.e., our repaired silver standard). Specifically, we use partiality-corrected precision, recall and f-measure, as defined below.
Averaging: results are reported both micro-averaged (pool the cell counts , , across all six pairs, then compute P/R/F1 once) and macro-averaged (compute P/R/F1 per pair, then take the unweighted mean over the six pairs).
Dual reference: headline scores are computed against the repaired silver standard ; the same metrics against the unrepaired are reported as secondary, for comparison only.
Task 2 - Local Ranking
For our local ranking task, we measure Mean Reciprocal Rank (MRR) and Hits@k ().
Given a pair of ontologies, and , a source entity from , denoted , a set of 50 potential target entities (consisting of 49 entities from and a NIL entity), denoted , produce a ranking for , ordered by confidence that for each in . In the case where an equivalence correspondence between and through does not exist, instead map to NIL. Then, writing for the 1-based position of the correct answer (the validated equivalent, or NIL) in the system’s ranking of ,
Abstention (): a system abstains by ranking the NIL IRI early. The task is unsupervised, so the reference baselines use a fixed threshold. NIL is assigned the confidence and the pool is sorted by score, so a candidate scoring below ranks below NIL. Participant systems are free to place NIL however they judge best.
Macro-average: each metric is reported per pair (overall, plus matched-only, NIL-only and per-type breakdowns), and the headline score is the unweighted mean of the two pairs’ overall values,
The pairs are weighted equally (not by query count) because a system must do well on both: uco-stix is data-property-dominated and stix-d3fend is class-dominated, so neither pair can be carried by the other.
Simply put: these metrics look at where your ranking places the (private) right answer — the validated equivalent, or NIL when there is none. Hits@ is the fraction of queries whose right answer appears in your top ; MRR rewards placing it as early as possible (rank 1 scores , rank 2 scores , and so on). The final score averages the two ontology pairs equally.
Note: pool construction and the NIL semantics will be described in the supplementary materials (available at track launch).