Baselines & Leaderboard

Organiser-run baselines for both tasks. Participant results will appear here once the evaluation window opens (19 July 2026, provisional).

Task 1 — global alignment

Partiality-corrected P/R/F1

Headline scores are computed against the repaired silver standard Metrics score all named-entity equivalences in the reference — classes, properties, and individuals alike — see evaluation metrics. Eight matchers cover all six pairs; five further matchers were run only on the two openly redistributable JC3IEDM pairs and are listed separately (their numbers are not comparable to the full set).

Micro-F1 on R⁺ — all six pairs

LogMap — micro-F1 0.858 (P 0.957, R 0.777) LogMap 0.858 LogMapLt — micro-F1 0.549 (P 0.778, R 0.424) LogMapLt 0.549 AML — micro-F1 0.499 (P 0.894, R 0.346) AML 0.499 Matcha — micro-F1 0.463 (P 0.441, R 0.488) Matcha 0.463 BertMapLt — micro-F1 0.350 (P 0.906, R 0.217) BertMapLt 0.350 SecureBertMap — micro-F1 0.335 (P 0.961, R 0.203) SecureBertMap 0.335 BertMap — micro-F1 0.325 (P 0.941, R 0.197) BertMap 0.325 CySecBertMap — micro-F1 0.324 (P 0.960, R 0.195) CySecBertMap 0.324

Comparable set — 8 matchers, all 6 pairs

Matcher P (µ, R⁺) R (µ, R⁺) F1 (µ, R⁺) F1 (M, R⁺) F1 (µ, R≈) F1 (M, R≈)
LogMap 0.957 0.777 0.858 0.813 0.793 0.729
LogMapLt 0.778 0.424 0.549 0.665 0.645 0.740
AML 0.894 0.346 0.499 0.644 0.535 0.635
Matcha 0.441 0.488 0.463 0.566 0.464 0.536
BertMapLt 0.906 0.217 0.350 0.475 0.372 0.479
SecureBertMap 0.961 0.203 0.335 0.469 0.303 0.386
BertMap 0.941 0.197 0.325 0.463 0.295 0.379
CySecBertMap 0.960 0.195 0.324 0.459 0.294 0.377

µ = micro-averaged, M = macro-averaged. Headline = micro-F1 on R⁺. Columns sort on click.

Partial coverage — 5 matchers, JC3IEDM pairs only (n=2)
Matchern pairs P (µ, R⁺)R (µ, R⁺)F1 (µ, R⁺)F1 (M, R⁺)
LogMapLLM 2 0.996 0.877 0.933 0.946
ALOD2Vec 2 0.963 0.255 0.403 0.592
ATMatcher 2 0.926 0.242 0.384 0.537
Fine-TOM 2 0.625 0.161 0.256 0.263
KGMatcher 2 1.000 0.084 0.155 0.268

Run only on jc3iedm-mio and jc3iedm-facility; not comparable with the six-pair set above.

Best matcher per pair

PairBest matcherF1 (R⁺)PR|R≈||R⁺≈|Repaired
thinkhome-brick AML 0.774 0.828 0.727 116 66 yes
brick-smartenv LogMap 0.829 0.739 0.944 47 18 yes
city-brick LogMap 0.757 0.778 0.737 61 19 yes
jc3iedm-mio LogMapLLM 0.928 0.996 0.869 267 267
jc3iedm-brick LogMapLt 0.819 0.824 0.813 75 75
jc3iedm-facility LogMap 0.977 0.977 0.977 43 43

“Repaired: —” means the original alignment was already coherent (R ≡ R⁺); no repair was necessary. Full matcher × pair detail: global_detail.tsv · global_aggregate.tsv · global_results.json.

Task 2 — local equivalence ranking

Hits@k and MRR

Reference baselines with a fixed abstention threshold τ=0.5; references are private. The headline is the unweighted macro-average over the two pairs. The silver is lexically clean (same-label equivalences dominate), so the naive string baseline beats the neural encoder — the contribution is the task design.

Macro-averaged over both pairs

BaselineHits@1Hits@3Hits@5Hits@10MRR
naive-lexical (char-3-gram Jaccard) 0.8350.9680.9700.9860.899
all-MiniLM-L6-v2 0.6650.8810.9430.9700.786

Per pair

PairBaselineQueriesHits@1Hits@3Hits@5Hits@10MRR
uco-stix lexical 217 0.9170.9770.9821.0000.950
uco-stix MiniLM 217 0.7280.9260.9540.9820.830
stix-d3fend lexical 73 0.7530.9590.9590.9730.849
stix-d3fend MiniLM 73 0.6030.8360.9310.9590.743
Matched-only / NIL-only breakdown
PairBaselineSubsetQueriesHits@1MRR
uco-stix lexical matched 194 0.954 0.977
uco-stix lexical NIL 23 0.609 0.719
uco-stix MiniLM matched 194 0.789 0.873
uco-stix MiniLM NIL 23 0.217 0.465
stix-d3fend lexical matched 38 0.816 0.890
stix-d3fend lexical NIL 35 0.686 0.805
stix-d3fend MiniLM matched 38 0.789 0.871
stix-d3fend MiniLM NIL 35 0.400 0.605

Metrics are reported overall plus matched-only, NIL-only, and per-type; see evaluation metrics and the Task 2 description.