← Back to Test Matrix
✅ No Issues Expected - Protocol Links Should Be Ignored

Non-HTTP Protocol Links Test Page

This page contains various non-HTTP(S) protocol links that should NOT be crawled as regular URLs. These links use special URI schemes like mailto:, tel:, javascript:, and data:.

Email Links (mailto:)

Telephone Links (tel:)

Other Protocol Links

What should happen?

Expected Detection:

No issues should be detected - all these protocol links should be filtered out during the link extraction phase.

Code location: apps/worker/src/crawler.ts:1604-1643 (extractLinks method)

The fix uses getAttribute('href') to get raw attribute values and filters out any protocols that aren't http: or https: before URL normalization.

Test Verification:

To verify the fix is working correctly:

  1. Run a crawl on http://localhost:3030
  2. Check that this page is crawled successfully
  3. Verify that NO link_targets or link_relationships are created for the non-HTTP(S) protocol links above
  4. Check that no skipped_urls entries exist with reasons like "invalid URL" or "relative path resolution failed" for these protocol links

Normal HTTP Links (control group):

These should be crawled normally: