Validate spec_urls based on webref ids #23958

Elchi3 · 2024-07-26T10:21:08Z

Draft testing PR for @tidoust :)

Based on w3c/webref#1198 (comment), I wrote a quick test to see if webref ids could be used to (deeply) validate BCD's spec_urls. (that is, we want to check if the fragment ids are valid as well, not just the spec hosts).

It spits out a lot of errors and I would be interested to hear if BCD should be using different fragment ids, or if webref is missing these fragment ids, or if something else is going on. Please see the CI failure for the results.

(This is a draft PR that removes our dependency on web-specs and instead fetches raw webref JSON files, we might not want to fetch the data this way, so consider this PR just a test for now)

tidoust · 2024-07-26T14:15:29Z

It spits out a lot of errors and I would be interested to hear if BCD should be using different fragment ids, or if webref is missing these fragment ids, or if something else is going on. Please see the CI failure for the results.

I'd say that the good news is that, in most cases, it seems that "something else is going on" ;)

Main categories of errors I see:

Fragments in ids extracts are percent-encoded. That does not seem to be the case for URLs used in BCD. If percent-encoding seems wrong, we could perhaps change that (I'm always confused as to when that is needed, or good practice). Otherwise, for comparison purpose, some preprocessing is going to be required, see inline.
The URL used in BCD may have the filename, e.g., index.html in https://webassembly.github.io/spec/js-api/index.html#dom-globaldescriptor-mutable. URLs in the ids extracts don't have the filename (except for multipage specs!). The index.json file in Webref contains a nightly.filename property for each spec that could be used to create the URL variants with the filename if needed. Alternatively, it could perhaps be a good idea to drop that filename in BCD data?
Many URLs in BCD use the series URL, whereas ids extracts in Webref are per specification. For example, you'll have https://drafts.csswg.org/css-logical/#position-properties in BCD, while you'll find https://drafts.csswg.org/css-logical-1/#position-properties in Webref. To find the right level in Webref, you'll need to look at entries in index.json in Webref with the same series.nightlyUrl as the URL (without fragment) used in BCD, then select the entry whose shortname is equal to series.currentSpecification.
The code does not handle the case where a spec does not define any ID. That does happen for some WebGL extensions referenced by BCD, such as WebGL EXT_disjoint_timer_query
For IETF RFCs produced by the HTTP WG, Webref prefers the httpwg.org URL because its rendering is slightly more user-friendly, whereas BCD seems to use www.rfc-editor.org URLs. The latter URL appears as the canonical url in Webref, so it should be relatively easy to find what you want if BCD wants to keep using that origin.

And then there are actual broken links in BCD, such as https://tc39.es/proposal-temporal/#sec-get-temporal.zoneddatetime.prototype.timezone. There are also "outdated" URLs, such as https://tc39.es/ecma262/multipage/additional-ecmascript-features-for-web-browsers.html#sec-object.prototype.__defineGetter__, which redirects to https://tc39.es/ecma262/multipage/fundamental-objects.html#sec-object.prototype.__defineGetter__ that appears in Webref.

There may be a few other error cases to dig into.

lint/linter/test-spec-urls.ts

Elchi3 · 2024-07-27T08:50:40Z

Fantastique François!! 🎉
Thanks for the very useful review comment! I've updated the script :) Now we're down to just 269 problems found! :)

What I see now:

We should change rfc-editor urls to httpwg urls
Filenames should be omitted
Quite a few legit broken fragment links that need to be fixed in BCD (yay, these are the ones I want to chase with this exercise)

Something I would like for you to take a look:

There are about 22 links to HTML multipage fragments and upon spot checking they work. Maybe these are missing in webref or what am I missing?

tidoust · 2024-07-27T15:45:18Z

There are about 22 links to HTML multipage fragments and upon spot checking they work. Maybe these are missing in webref or what am I missing?

As far as I can tell, all of them are examples of what I called outdated links: they work, but that's because the HTML spec has logic in place to redirect past fragments to their new page. Each time, the content referenced by the link moved to another page of the HTML spec and would better be targeted using the new fragment to avoid a redirect.

For example, clicking on https://html.spec.whatwg.org/multipage/browsing-the-web.html#dom-beforeunloadevent-returnvalue makes you load the browsing-the-web.html page, which includes some JavaScript that detects the fragment, knows it no longer exists in that page, and redirects you to the nav-history-apis.html page where the content was moved. The final URL is https://html.spec.whatwg.org/multipage/nav-history-apis.html#dom-beforeunloadevent-returnvalue. That final URL appears in Webref. Ideally, BCD would always use such final URLs to avoid redirects that consume a bit of time, bandwidth and energy.

For context, see discussion starting at: mdn/browser-compat-data#23958 (comment) Some specs such as DOM, Encoding, HTML contain sections targeted at web developers. These sections re-define terms normatively defined elsewhere in a more developer-friendly way. Terms re-defined in these sections are good targets for documentation but did not appear in definitions extracts. This update makes Reffy parse "for web developers" sections and extract the definitions they contain. This is a prerequisite to publishing a package with definitions that could be used to validate URLs in BCD and web-features, as envisioned in: w3c/webref#1198 (comment) Worth noting: - Ideally, spec authoring tools would provide better support for this pattern, giving definitions more stable IDs than `ref-for-[foo][number]` and creating proper dfns themselves. If they do that, the custom processing introduced here would become moot. Going through tools and specs will take time though. - To keep the cross-references database useful, newly extracted definitions need to be in a separate dfn namespace, i.e., have their own dfn type. Problem is that they also have a "natural" dfn type such as `interface`, `method` or `attribute`. The solution implemented here is to prefix their type with `dev-`. That duplicates dfn types. A cleaner solution would record the "dev" namespace in another property. But that would surprise spec authoring tools. An alternative approach would be to give all of these dfns a `dev` dfn type, but we'd then lose information that could turn out to be useful. - The key marker for sections targeted at web developers is the use of a `domintro` class. Now, a few specs do use `domintro` in normative definition lists (shape-detection-api, image-capture, mediastream-recording). That's probably unintentional. I'll look into fixing the specs. The code skips `domintro` sections that look suspicious. - This would add **2815 definitions** to the dfns extracts (which currently contain ~50000 definitions)

For context, see discussion starting at: mdn/browser-compat-data#23958 (comment) Some specs such as DOM, Encoding, HTML contain sections targeted at web developers. These sections re-define terms normatively defined elsewhere in a more developer-friendly way. Terms re-defined in these sections are good targets for documentation but did not appear in definitions extracts. This update makes Reffy parse "for web developers" sections and extract the definitions they contain. This is a prerequisite to publishing a package with definitions that could be used to validate URLs in BCD and web-features, as envisioned in: w3c/webref#1198 (comment) Worth noting: - Ideally, spec authoring tools would provide better support for this pattern, giving definitions more stable IDs than `ref-for-[foo][number]` and creating proper dfns themselves. If they do that, the custom processing introduced here would become moot. Going through tools and specs will take time though. The custom processing done here allows to add the definitions right away. It does not solve the "unstable" IDs issue, but at least provides a theoretical way to identify situations where the ID of a dev dfn changes. - To keep the cross-references database useful, newly extracted definitions need to be in a separate dfn namespace, i.e., have their own dfn type. Problem is that they also have a "natural" dfn type such as `interface`, `method` or `attribute`. The solution implemented here is to prefix their type with `dev-`. That duplicates dfn types. A cleaner solution would record the "dev" namespace in another property. But that would surprise spec authoring tools. An alternative approach would be to give all of these dfns a `dev` dfn type, but we'd then lose information that could turn out to be useful. - The key marker for sections targeted at web developers is the use of a `domintro` class. Now, a few specs do use `domintro` in normative definition lists (shape-detection-api, image-capture, mediastream-recording). That's probably unintentional. I'll look into fixing the specs. The code skips `domintro` sections that look suspicious. - This would add **2815 definitions** to the dfns extracts (which currently contain ~50000 definitions)

For context, see discussion starting at: mdn/browser-compat-data#23958 (comment) Some specs such as DOM, Encoding, HTML contain sections targeted at web developers. These sections re-define terms normatively defined elsewhere in a more developer-friendly way. Terms re-defined in these sections are good targets for documentation but did not appear in definitions extracts. This update makes Reffy parse "for web developers" sections and extract the links that complete definitions they contain. This is a prerequisite to publishing a package with definitions that could be used to validate URLs in BCD and web-features, as envisioned in: w3c/webref#1198 (comment) The links are recorded in a `links` property attached to the base definition that the link completes. The `links` property is an array of objects, each object featuring `id`, `href`, `type`, `name` and `heading` properties. The `type` property is always set to `"dev"`. The `name` property contains the text content of the enclosing `<dt>`. The `heading` property contains the heading of the section where the anchor is defined (it may be different from the heading of the section where the underlying definition appears). There may be more than one dev link per definition. That's normal. It typically happens when the underlying definition is for a mixin included in multiple interfaces, as for `TextDecoderCommon` attributes in the Encoding spec. Some links for developers target definitions in external specs. They are ignored for now. Worth noting: - Ideally, spec authoring tools would provide better support for this pattern, giving these links more stable IDs than `ref-for-[foo][number]` and possibly creating proper dfns themselves. If they do that, processing may need to be adjusted. Updating tools and specs will take time though. - The key marker for sections targeted at web developers is the use of a `domintro` class. Now, a few specs do use `domintro` in normative definition lists (shape-detection-api, image-capture, mediastream-recording). That's probably unintentional. I'll look into fixing the specs. The code skips `domintro` sections that look suspicious. - This would add **2815 links** to the dfns extracts (for ~50000 definitions)

tidoust

Dfns in Webref now contain production rules from the ECMAScript spec, and links that are "for web developers". The latter requires minor adjustments to the code, see inline.

Running the linter locally, this gets us down to 405 links that fail validation and that target 319 different fragments.

This includes 29 new links to ECMAScript proposals that got integrated in the main ECMAScript spec. We've flagged the specs as "discontinued" in browser-specs accordingly and links should be updated in BCD. I guess that shows the merit of running such a validation ;)

I'll look into the 111 #ref-for links which, as far as I can tell, all target IDL blocks. I may be able to extend the newly introduced links mechanism with these.

~~I noticed a few RFCs missing from browser-specs and that could probably be added.~~ (Edit: They're not missing in practice but extraction of headings seems to fail for some reason. Anyway, I'll look into it)

I also note that api.Element.attachShadow links to https://dom.spec.whatwg.org/#ref-for-dom-element-attachshadow① and... that seems to be a good example of a link that changed without anyone noticing? I suspect the link should rather target https://dom.spec.whatwg.org/#ref-for-dom-element-attachshadow②

lint/linter/test-spec-urls.ts

Co-authored-by: François Daoust <fd@tidoust.net>

tidoust · 2025-07-11T15:04:54Z

For links to IDL terms, specs may follow two distinct patterns:

Definition in the IDL block, referenced by the prose. For example, the definition of pictureInPictureEnabled appears in an IDL block, while the "definition" in prose is actually a reference.
Definition in prose, referenced by the IDL block. For example, the definition of vendorId appears in prose, while the IDL block contains a reference

I don't think there's broad agreement on which approach is the right one. I note that BCD does not seem to be consistent either, and contains links that target either the IDL block or the prose. Typically, among links that fail validation, there are both:

links to the prose when first pattern is used, e.g., api.Document.exitPictureInPicture targets https://w3c.github.io/picture-in-picture/#ref-for-dom-document-exitpictureinpicture
links to the IDL block when the second pattern is used, e.g., api.DeviceOrientationEvent.absolute targets https://w3c.github.io/deviceorientation/#ref-for-dom-deviceorientationevent-absolute

Question is: from a BCD perspective, what would you like links to target for IDL terms? The IDL block or the prose?

Practically speaking, I can easily extract the reference in the IDL block when the second pattern is used. I do not see an easy way to extract the "main" reference in prose when the first pattern is used. Perhaps a rule such as "first normative occurrence right after the IDL block" could work though.

As usual, targeting references that are not qualified in any way is not fantastic: api.ShadowRoot.pictureInPictureElement currently targets https://w3c.github.io/picture-in-picture/#ref-for-dom-documentorshadowroot-pictureinpictureelement①⑤ and I suspect the intended link is rather to https://w3c.github.io/picture-in-picture/#ref-for-dom-documentorshadowroot-pictureinpictureelement①③ (no doubt the spec got updated in the meantime and added two more references before this one). That's not the only case, on top of attachShadow that I reported in my previous comment, links to deviceorientation seem mostly bogus at first sight. Did I mention that these reference links are brittle and dangerous already? ;)

lint/linter/test-spec-urls.ts

Co-authored-by: François Daoust <fd@tidoust.net>

Elchi3 · 2025-07-14T16:36:05Z

We usually use anchors in the form of #dom-interface-member and not #ref-for-interface-member(number).

So, #23958 (comment) and ~25% of the failures should get fixed by #27293.

tidoust · 2025-07-15T07:57:48Z

We usually use anchors in the form of #dom-interface-member and not #ref-for-interface-member(number).

Ah, that certainly works for me ;) I assumed that you would want to treat these anchors like the ones that point at "for web developers" sections, meaning that you were more looking at pointing at a specific place than at always following what the underlying spec uses as main definition anchor.

Elchi3 · 2025-07-15T08:08:38Z

I think it is fine but I would certainly welcome if spec authors could agree on consistency for #dom-interface-member anchors as well as recognize the importance of these main definition anchors for projects like BCD.

Elchi3 · 2025-07-21T12:47:53Z

Down to 207 issues! Seems like lots of problems with SVG and WebGL at this point.

tidoust · 2025-07-23T13:26:36Z

WebGL1 problems should disappear once a new version of Reffy gets released (that's pending release of a new version of webidl2.js with support for the new async_iterable generic type).

For SVG links, the specs need an overhaul. I'm reluctant to spend time in Reffy to try to make sense of IDs in these specs in the meantime (for example, for IDL term definitions, we would need to parse the IDL to make sense of the definitions, but the IDL in SVG Paths is currently invalid, so we would need to patch it first). May I suggest skipping checks for links that target https://svgwg.org/? The links look good and the SVG specs don't really change for now.

@Elchi3, the remaining ~100 links seem to be mainly things that need to be reviewed one by one. I see links where specs could perhaps be updated. Others where BCD probably should. There will no doubt be a few ones that need to be handled as exceptions to the rule. I'd be happy to hop on the phone and go through them with you. Feel free to ping me if you would fancy that!

Elchi3 · 2025-07-24T14:48:38Z

Thanks @tidoust! Now we're down to 53! I will look through them and will probably have questions for you :)

Validate spec_urls based on webref ids

f64e360

github-actions bot added infra Infrastructure issues (npm, GitHub Actions, releases) of this project linter Issues or pull requests regarding the tests / linter of the JSON files. labels Jul 26, 2024

tidoust reviewed Jul 26, 2024

View reviewed changes

lint/linter/test-spec-urls.ts Outdated Show resolved Hide resolved

Address feedback from François

7116139

decode hashes as suggested

b0c0149

Merge branch 'main' into spec-url-validator

480d503

This was referenced Jul 31, 2024

Update Temporal data #24005

Merged

Fix various CSS spec_urls #24008

Merged

Merge branch 'main' into spec-url-validator

7be9f82

Elchi3 mentioned this pull request Sep 16, 2024

RTCPeerConnection.createDTMFSender is non-standard #24442

Merged

tidoust mentioned this pull request Sep 18, 2024

Consistent guidelines spec links, especially in CSS web-platform-dx/web-features#1785

Open

Merge branch 'main' into spec-url-validator

a512cd7

github-actions bot added the size:m [PR only] 25-100 LoC changed label Dec 10, 2024

Merge branch 'main' into spec-url-validator

fea6ce6

This was referenced Dec 17, 2024

Update spec urls for Login Status API #25446

Merged

Fix spec urls for DedicatedWorkerGlobalScope message events #25447

Merged

GPUSupportedLimits.maxInterStageShaderComponents no longer standard #25448

Merged

Remove MLContext.compute #25449

Merged

Elchi3 mentioned this pull request Jan 24, 2025

AggregateError serialization is standard track #25744

Merged

Elchi3 mentioned this pull request Feb 19, 2025

Allow standard-track sub-features without implementation / intent to implement #25954

Open

tidoust mentioned this pull request Jul 7, 2025

Add anchors for web developers to dfns extracts w3c/reffy#1875

Merged

Feedback per tidoust

bc405aa

Elchi3 mentioned this pull request Jul 7, 2025

Use more stable headings/dfns for ECMAScript spec_urls #27246

Merged

tidoust reviewed Jul 11, 2025

View reviewed changes

lint/linter/test-spec-urls.ts Outdated Show resolved Hide resolved

Take into account links properties

d4b2e9a

Co-authored-by: François Daoust <fd@tidoust.net>

github-actions bot removed the size:m [PR only] 25-100 LoC changed label Jul 11, 2025

Elchi3 added 2 commits July 11, 2025 17:00

Fix code style

67d8ffc

Merge branch 'main' into spec-url-validator

61dbcae

tidoust mentioned this pull request Jul 11, 2025

Extract headings from www.rfc-editor.org RFCs w3c/reffy#1883

Merged

tidoust reviewed Jul 11, 2025

View reviewed changes

lint/linter/test-spec-urls.ts Outdated Show resolved Hide resolved

Elchi3 and others added 3 commits July 14, 2025 14:40

Add RFC alternateIds

41e9839

Co-authored-by: François Daoust <fd@tidoust.net>

Fix linter and max callstack

211abb9

Merge branch 'main' into spec-url-validator

831cbfe

Elchi3 mentioned this pull request Jul 14, 2025

Update api/ spec_urls #27293

Merged

Merge branch 'main' into spec-url-validator

a3d8e93

Elchi3 mentioned this pull request Jul 22, 2025

Update more spec_urls for validation #27365

Merged

tidoust mentioned this pull request Jul 22, 2025

Add custom dfns/headings extraction logic for WebGL1 w3c/reffy#1894

Merged

Elchi3 added 2 commits July 24, 2025 16:27

Merge branch 'main' into spec-url-validator

7312a08

Update exception list

1ac5c7b

hellotherechaunce approved these changes Aug 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validate spec_urls based on webref ids #23958

Validate spec_urls based on webref ids #23958

Elchi3 commented Jul 26, 2024

Uh oh!

tidoust commented Jul 26, 2024 •

edited

Loading

Uh oh!

Uh oh!

Elchi3 commented Jul 27, 2024 •

edited

Loading

Uh oh!

tidoust commented Jul 27, 2024

Uh oh!

tidoust left a comment •

edited

Loading

Uh oh!

Uh oh!

tidoust commented Jul 11, 2025

Uh oh!

Uh oh!

Elchi3 commented Jul 14, 2025

Uh oh!

tidoust commented Jul 15, 2025

Uh oh!

Elchi3 commented Jul 15, 2025

Uh oh!

Elchi3 commented Jul 21, 2025

Uh oh!

tidoust commented Jul 23, 2025

Uh oh!

Elchi3 commented Jul 24, 2025

Uh oh!

Uh oh!

Validate spec_urls based on webref ids #23958

Are you sure you want to change the base?

Validate spec_urls based on webref ids #23958

Conversation

Elchi3 commented Jul 26, 2024

Uh oh!

tidoust commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Elchi3 commented Jul 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tidoust commented Jul 27, 2024

Uh oh!

tidoust left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tidoust commented Jul 11, 2025

Uh oh!

Uh oh!

Elchi3 commented Jul 14, 2025

Uh oh!

tidoust commented Jul 15, 2025

Uh oh!

Elchi3 commented Jul 15, 2025

Uh oh!

Elchi3 commented Jul 21, 2025

Uh oh!

tidoust commented Jul 23, 2025

Uh oh!

Elchi3 commented Jul 24, 2025

Uh oh!

Uh oh!

tidoust commented Jul 26, 2024 •

edited

Loading

Elchi3 commented Jul 27, 2024 •

edited

Loading

tidoust left a comment •

edited

Loading