Baselines & Leaderboard

Organiser-run baselines for both tasks. Participant results will appear here once the evaluation window opens (19 July 2026, provisional).

Task 1 — global alignment

Partiality-corrected P/R/F1

Headline scores are computed against the repaired silver standard Metrics score all named-entity equivalences in the reference — classes, properties, and individuals alike — see evaluation metrics. Eight matchers cover all six pairs; five further matchers were run only on the two openly redistributable JC3IEDM pairs and are listed separately (their numbers are not comparable to the full set).

Micro-F1 on R⁺_≈ — all six pairs

Comparable set — 8 matchers, all 6 pairs

Matcher	P (µ, R⁺)	R (µ, R⁺)	F1 (µ, R⁺)	F1 (M, R⁺)	F1 (µ, R≈)	F1 (M, R≈)
LogMap	0.957	0.777	0.858	0.813	0.793	0.729
LogMapLt	0.778	0.424	0.549	0.665	0.645	0.740
AML	0.894	0.346	0.499	0.644	0.535	0.635
Matcha	0.441	0.488	0.463	0.566	0.464	0.536
BertMapLt	0.906	0.217	0.350	0.475	0.372	0.479
SecureBertMap	0.961	0.203	0.335	0.469	0.303	0.386
BertMap	0.941	0.197	0.325	0.463	0.295	0.379
CySecBertMap	0.960	0.195	0.324	0.459	0.294	0.377

µ = micro-averaged, M = macro-averaged. Headline = micro-F1 on R⁺_≈. Columns sort on click.

Partial coverage — 5 matchers, JC3IEDM pairs only (n=2)

Matcher	n pairs	P (µ, R⁺)	R (µ, R⁺)	F1 (µ, R⁺)	F1 (M, R⁺)
LogMapLLM	2	0.996	0.877	0.933	0.946
ALOD2Vec	2	0.963	0.255	0.403	0.592
ATMatcher	2	0.926	0.242	0.384	0.537
Fine-TOM	2	0.625	0.161	0.256	0.263
KGMatcher	2	1.000	0.084	0.155	0.268

Run only on jc3iedm-mio and jc3iedm-facility; not comparable with the six-pair set above.

Best matcher per pair

Pair	Best matcher	F1 (R⁺)	P	R	\|R≈\|	\|R⁺≈\|	Repaired
thinkhome-brick	AML	0.774	0.828	0.727	116	66	yes
brick-smartenv	LogMap	0.829	0.739	0.944	47	18	yes
city-brick	LogMap	0.757	0.778	0.737	61	19	yes
jc3iedm-mio	LogMapLLM	0.928	0.996	0.869	267	267	—
jc3iedm-brick	LogMapLt	0.819	0.824	0.813	75	75	—
jc3iedm-facility	LogMap	0.977	0.977	0.977	43	43	—

“Repaired: —” means the original alignment was already coherent (R_≈ ≡ R⁺_≈); no repair was necessary. Full matcher × pair detail: global_detail.tsv · global_aggregate.tsv · global_results.json.

Task 2 — local equivalence ranking

Hits@k and MRR

Reference baselines with a fixed abstention threshold τ=0.5; references are private. The headline is the unweighted macro-average over the two pairs. The silver is lexically clean (same-label equivalences dominate), so the naive string baseline beats the neural encoder — the contribution is the task design.

Macro-averaged over both pairs

Baseline	Hits@1	Hits@3	Hits@5	Hits@10	MRR
naive-lexical (char-3-gram Jaccard)	0.835	0.968	0.970	0.986	0.899
all-MiniLM-L6-v2	0.665	0.881	0.943	0.970	0.786

Per pair

Pair	Baseline	Queries	Hits@1	Hits@3	Hits@5	Hits@10	MRR
uco-stix	lexical	217	0.917	0.977	0.982	1.000	0.950
uco-stix	MiniLM	217	0.728	0.926	0.954	0.982	0.830
stix-d3fend	lexical	73	0.753	0.959	0.959	0.973	0.849
stix-d3fend	MiniLM	73	0.603	0.836	0.931	0.959	0.743

Matched-only / NIL-only breakdown

Pair	Baseline	Subset	Queries	Hits@1	MRR
uco-stix	lexical	matched	194	0.954	0.977
uco-stix	lexical	NIL	23	0.609	0.719
uco-stix	MiniLM	matched	194	0.789	0.873
uco-stix	MiniLM	NIL	23	0.217	0.465
stix-d3fend	lexical	matched	38	0.816	0.890
stix-d3fend	lexical	NIL	35	0.686	0.805
stix-d3fend	MiniLM	matched	38	0.789	0.871
stix-d3fend	MiniLM	NIL	35	0.400	0.605

Metrics are reported overall plus matched-only, NIL-only, and per-type; see evaluation metrics and the Task 2 description.