Baselines & Leaderboard
Organiser-run baselines for both tasks. Participant results will appear here once the evaluation window opens (19 July 2026, provisional).
Task 1 — global alignment
Partiality-corrected P/R/F1
Headline scores are computed against the repaired silver standard Metrics score all named-entity equivalences in the reference — classes, properties, and individuals alike — see evaluation metrics. Eight matchers cover all six pairs; five further matchers were run only on the two openly redistributable JC3IEDM pairs and are listed separately (their numbers are not comparable to the full set).
Micro-F1 on R⁺≈ — all six pairs
Comparable set — 8 matchers, all 6 pairs
| Matcher | P (µ, R⁺) | R (µ, R⁺) | F1 (µ, R⁺) | F1 (M, R⁺) | F1 (µ, R≈) | F1 (M, R≈) |
|---|---|---|---|---|---|---|
| LogMap | 0.957 | 0.777 | 0.858 | 0.813 | 0.793 | 0.729 |
| LogMapLt | 0.778 | 0.424 | 0.549 | 0.665 | 0.645 | 0.740 |
| AML | 0.894 | 0.346 | 0.499 | 0.644 | 0.535 | 0.635 |
| Matcha | 0.441 | 0.488 | 0.463 | 0.566 | 0.464 | 0.536 |
| BertMapLt | 0.906 | 0.217 | 0.350 | 0.475 | 0.372 | 0.479 |
| SecureBertMap | 0.961 | 0.203 | 0.335 | 0.469 | 0.303 | 0.386 |
| BertMap | 0.941 | 0.197 | 0.325 | 0.463 | 0.295 | 0.379 |
| CySecBertMap | 0.960 | 0.195 | 0.324 | 0.459 | 0.294 | 0.377 |
µ = micro-averaged, M = macro-averaged. Headline = micro-F1 on R⁺≈. Columns sort on click.
Partial coverage — 5 matchers, JC3IEDM pairs only (n=2)
| Matcher | n pairs | P (µ, R⁺) | R (µ, R⁺) | F1 (µ, R⁺) | F1 (M, R⁺) |
|---|---|---|---|---|---|
| LogMapLLM | 2 | 0.996 | 0.877 | 0.933 | 0.946 |
| ALOD2Vec | 2 | 0.963 | 0.255 | 0.403 | 0.592 |
| ATMatcher | 2 | 0.926 | 0.242 | 0.384 | 0.537 |
| Fine-TOM | 2 | 0.625 | 0.161 | 0.256 | 0.263 |
| KGMatcher | 2 | 1.000 | 0.084 | 0.155 | 0.268 |
Run only on jc3iedm-mio and jc3iedm-facility; not comparable with the six-pair set above.
Best matcher per pair
| Pair | Best matcher | F1 (R⁺) | P | R | |R≈| | |R⁺≈| | Repaired |
|---|---|---|---|---|---|---|---|
| thinkhome-brick | AML | 0.774 | 0.828 | 0.727 | 116 | 66 | yes |
| brick-smartenv | LogMap | 0.829 | 0.739 | 0.944 | 47 | 18 | yes |
| city-brick | LogMap | 0.757 | 0.778 | 0.737 | 61 | 19 | yes |
| jc3iedm-mio | LogMapLLM | 0.928 | 0.996 | 0.869 | 267 | 267 | — |
| jc3iedm-brick | LogMapLt | 0.819 | 0.824 | 0.813 | 75 | 75 | — |
| jc3iedm-facility | LogMap | 0.977 | 0.977 | 0.977 | 43 | 43 | — |
“Repaired: —” means the original alignment was already coherent (R≈ ≡ R⁺≈); no repair was necessary. Full matcher × pair detail: global_detail.tsv · global_aggregate.tsv · global_results.json.
Task 2 — local equivalence ranking
Hits@k and MRR
Reference baselines with a fixed abstention threshold τ=0.5; references are private. The headline is the unweighted macro-average over the two pairs. The silver is lexically clean (same-label equivalences dominate), so the naive string baseline beats the neural encoder — the contribution is the task design.
Macro-averaged over both pairs
| Baseline | Hits@1 | Hits@3 | Hits@5 | Hits@10 | MRR |
|---|---|---|---|---|---|
| naive-lexical (char-3-gram Jaccard) | 0.835 | 0.968 | 0.970 | 0.986 | 0.899 |
| all-MiniLM-L6-v2 | 0.665 | 0.881 | 0.943 | 0.970 | 0.786 |
Per pair
| Pair | Baseline | Queries | Hits@1 | Hits@3 | Hits@5 | Hits@10 | MRR |
|---|---|---|---|---|---|---|---|
| uco-stix | lexical | 217 | 0.917 | 0.977 | 0.982 | 1.000 | 0.950 |
| uco-stix | MiniLM | 217 | 0.728 | 0.926 | 0.954 | 0.982 | 0.830 |
| stix-d3fend | lexical | 73 | 0.753 | 0.959 | 0.959 | 0.973 | 0.849 |
| stix-d3fend | MiniLM | 73 | 0.603 | 0.836 | 0.931 | 0.959 | 0.743 |
Matched-only / NIL-only breakdown
| Pair | Baseline | Subset | Queries | Hits@1 | MRR |
|---|---|---|---|---|---|
| uco-stix | lexical | matched | 194 | 0.954 | 0.977 |
| uco-stix | lexical | NIL | 23 | 0.609 | 0.719 |
| uco-stix | MiniLM | matched | 194 | 0.789 | 0.873 |
| uco-stix | MiniLM | NIL | 23 | 0.217 | 0.465 |
| stix-d3fend | lexical | matched | 38 | 0.816 | 0.890 |
| stix-d3fend | lexical | NIL | 35 | 0.686 | 0.805 |
| stix-d3fend | MiniLM | matched | 38 | 0.789 | 0.871 |
| stix-d3fend | MiniLM | NIL | 35 | 0.400 | 0.605 |
Metrics are reported overall plus matched-only, NIL-only, and per-type; see evaluation metrics and the Task 2 description.