Evaluation Metrics

How submissions are scored: partiality-corrected precision, recall and F1 for Task 1; Hits@k and MRR with NIL abstention for Task 2.

This document provides a brief overview of the evaluation metrics used for both tasks. A full description will appear in the supplementary material (available at track launch).

Task 1 - Global Alignment

To score Task 1, we use an implementation equivalent to the scorers in MELT for incomplete, partial reference alignments (i.e., our repaired silver standard). Specifically, we use partiality-corrected precision, recall and f-measure, as defined below.

A^=predictions touching a covered entity,\hat{A} = \text{predictions touching a covered entity}, Precisionpartial=ARA^,\mathrm{Precision}_{partial} = \frac{\vert A \cap R \vert}{\vert \hat{A} \vert}, Recall=ARR.\mathrm{Recall} = \frac{\vert A \cap R \vert}{\vert R \vert}. F1=2PrecisionpartialRecallPrecisionpartial+Recall.\mathrm{F1} = \frac{2 \cdot \mathrm{Precision}_{partial} \cdot \mathrm{Recall}}{\mathrm{Precision}_{partial} + \mathrm{Recall}}.

Averaging: results are reported both micro-averaged (pool the cell counts AR\vert A \cap R \vert, A^\vert \hat{A} \vert, R\vert R \vert across all six pairs, then compute P/R/F1 once) and macro-averaged (compute P/R/F1 per pair, then take the unweighted mean over the six pairs).

Dual reference: headline scores are computed against the repaired silver standard R+R_{\approx}^{+}; the same metrics against the unrepaired RR_{\approx} are reported as secondary, for comparison only.

Task 2 - Local Ranking

For our local ranking task, we measure Mean Reciprocal Rank (MRR) and Hits@k (k={1,3,5,10}k=\{1,3,5,10\}).

Given a pair of ontologies, Osrc\mathcal{O}_{src} and Otgt\mathcal{O}_{tgt}, a source entity from Osrc\mathcal{O}_{src}, denoted esrcEsrce_{src} \in \mathcal{E}_{src}, a set of 50 potential target entities (consisting of 49 entities from Otgt\mathcal{O}_{tgt} and a NIL entity), denoted K\mathcal{K}, produce a ranking for K\mathcal{K}, ordered by confidence that esrcetgte_{src} \equiv e_{tgt} for each etgte_{tgt} in K\mathcal{K}. In the case where an equivalence correspondence between Osrc\mathcal{O}_{src} and Otgt\mathcal{O}_{tgt} through esrce_{src} does not exist, instead map esrce_{src} to NIL. Then, writing rank(esrc)\mathrm{rank}(e_{src}) for the 1-based position of the correct answer (the validated equivalent, or NIL) in the system’s ranking of K\mathcal{K},

Hits@k=1EsrcesrcEsrc1[rank(esrc)k],\mathrm{Hits@}k = \frac{1}{\vert \mathcal{E}_{src} \vert} \sum_{e_{src} \in \mathcal{E}_{src}} \mathbb{1}[\mathrm{rank}(e_{src}) \leq k], MRR=1EsrcesrcEsrc1rank(esrc).\mathrm{MRR} = \frac{1}{\vert \mathcal{E}_{src} \vert} \sum_{e_{src} \in \mathcal{E}_{src}} \frac{1}{\mathrm{rank}(e_{src})}.

Abstention (τ=0.5\tau=0.5): a system abstains by ranking the NIL IRI early. The task is unsupervised, so the reference baselines use a fixed threshold. NIL is assigned the confidence τ=0.5\tau=0.5 and the pool is sorted by score, so a candidate scoring below τ\tau ranks below NIL. Participant systems are free to place NIL however they judge best.

Macro-average: each metric is reported per pair (overall, plus matched-only, NIL-only and per-type breakdowns), and the headline score is the unweighted mean of the two pairs’ overall values,

MACRO(m)=12(muco-stix+mstix-d3fend).\mathrm{MACRO}(m) = \tfrac{1}{2} \left( m_{\text{uco-stix}} + m_{\text{stix-d3fend}} \right).

The pairs are weighted equally (not by query count) because a system must do well on both: uco-stix is data-property-dominated and stix-d3fend is class-dominated, so neither pair can be carried by the other.

Simply put: these metrics look at where your ranking places the (private) right answer — the validated equivalent, or NIL when there is none. Hits@kk is the fraction of queries whose right answer appears in your top kk; MRR rewards placing it as early as possible (rank 1 scores 11, rank 2 scores 1/21/2, and so on). The final score averages the two ontology pairs equally.

Note: pool construction and the NIL semantics will be described in the supplementary materials (available at track launch).