How many tests does AmIHackable have?

77 automated tests covering all 14 analyzer functions. 56 unit tests verify detection and severity calibration. 21 integration tests run against a local vulnerable server with four profiles (insecure, secure, SPA, WordPress). All tests run in under 300ms.

How does AmIHackable compare to Mozilla Observatory?

We compared 19 sites at the finding level. 100% agreement on HSTS, X-Content-Type-Options, and clickjacking protection. 84% on CSP. The main gaps are in cookie testing approach (26% agreement) and SRI (5%). We test 21 additional things Observatory doesn't cover, including exposed files, JavaScript secrets, and CORS misconfigurations.

What is contextual security scoring?

Contextual scoring adjusts finding severity based on your actual attack surface. A missing CSP on a static portfolio with zero JavaScript is effectively informational. The same missing CSP on a site with 12 third-party scripts is a real gap. We're building toward a score that understands your tech stack.

Why does my Vercel SPA get false positive API endpoint findings?

Vercel's SPA catch-all routing returns different HTTP headers for HEAD vs GET requests. Our scanner now detects this pattern. If you scanned before March 23, 2026, try rescanning.

I Built 77 Tests for My Security Scanner and Found a Production Bug in 10 Minutes

Zero tests. For a security tool.

Yesterday I published how I rebuilt my scoring from scratch. Users confirmed the fixes worked. Beta testers validated the improvements.

But I had no automated tests. 87 finding types, 14 analyzer functions, and the only validation was "scan a site and eyeball the results."

That's not how you ship a tool that tells people their site is secure. Today I fixed that.

77 tests in one session

Unit tests for every analyzer

My scanner has 14 analyzer functions. Each one takes raw HTTP data (headers, cookies, file probes) and returns findings with severities. They're pure functions. No network, no database, no Cloudflare Worker. Perfect for unit testing.

I wrote 56 tests covering every analyzer:

Headers (10 tests): CSP absent/present/unsafe-inline/unsafe-eval, HSTS, X-Frame-Options, CDN server header downgraded to info
Cookies (7 tests): session vs tracking cookies, Secure/HttpOnly/SameSite flags, third-party cookie filtering
Sensitive files (4 tests): .env critical, package.json medium, soft-404 detection
CORS (4 tests): wildcard+credentials critical, reflects origin high, wildcard alone low
JS secrets (6 tests): AWS keys critical, Stripe secret critical, generic keys high, source maps medium
Plus: redirects, robots, open redirects, API endpoints, error pages, XSS patterns, HTTP methods, SRI

Every test verifies both detection and severity. If I ever accidentally change CSP from low to medium, a test fails. Every severity is traceable to CVSS v3.1 and Bugcrowd VRT ratings.

A vulnerable server for integration testing

Unit tests verify logic. Integration tests verify the full chain.

I built an HTTP server with four vulnerability profiles:

Profile	What it simulates	Key tests
insecure	No headers, exposed .env, bad cookies, stack traces, CORS wildcard+credentials	10 tests: every finding type fires correctly
secure	All headers, secure cookies, security.txt	4 tests: zero false positives
typical-spa	React SPA, catch-all routing, every path returns 200 + HTML	4 tests: .env and /graphql return HTML, proving SPA detection works
wordpress	PHP headers, WP user enumeration, session cookies	3 tests: WP-specific findings detected

The SPA profile is the most important. It proves that a React app on Netlify or Vercel won't get phantom .env and /graphql findings. If the SPA detection ever regresses, this test catches it before it reaches production.

All 77 tests run in 262ms. No flakiness, no network dependencies, fully deterministic.

The bug I found in 10 minutes

Before writing tests, I re-scanned the five sites from my SPA false positive report.

Three user-reported sites: clean. A fourth site, ski-club-bons-en-chablais.netlify.app, went from 0/F with 35 false findings to 8.5/A with 8 real findings.

Then I hit pickmynews-dashboard-private.vercel.app. Still 0/F. 16 false positive API endpoints.

The root cause: my SPA detection uses HEAD requests for API endpoint probing. Vercel returns an empty content-type for HEAD requests on catch-all routes. My filter checked "is this HTML?" which was false (because the content-type was empty, not text/html). So the SPA filter never fired for API endpoints.

The fix was one line. Instead of checking "is the response HTML?", I now check "is the response NOT a real API format?" Real GraphQL endpoints return application/json. Real Swagger docs return application/json. If we've already confirmed SPA catch-all routing exists and the response isn't JSON or XML, it's the SPA shell.

This is why you test edge cases. The fix from yesterday handled 103 out of 108 SPA sites correctly. The 5 that failed shared a specific Vercel HEAD request behavior that only shows up if you actually probe it.

Benchmark: AmIHackable vs Observatory, finding by finding

Comparing grades is easy. AmIH: B, Observatory: C. Who's right? This tells you nothing.

I built a benchmark that compares at the individual test level. For 19 real sites, I mapped each Observatory test to our corresponding finding and checked: do we agree?

Results: 68% overall agreement. But the per-test breakdown is where it gets useful:

Test	Agreement	Analysis
HSTS	100%	Perfect alignment. We detect this identically.
X-Content-Type-Options	100%	Perfect.
Clickjacking protection	100%	Perfect.
HTTPS redirect	89%	Near-perfect. Minor edge cases.
CSP	84%	Good. Disagreements are about CSP policy quality, not presence.
Referrer-Policy	42%	Gap. Observatory evaluates the policy value; we check presence/absence.
Cookie security	26%	Different approaches. Observatory tests all flags together; we test each flag individually.
Subresource Integrity	5%	Our biggest blind spot. Mapping issue between their test and our finding structure.

Three takeaways:

1. Core header detection is solid. 100% agreement on three foundational checks. The plumbing works.

2. We test 21 things Observatory doesn't. Exposed endpoints, CORS misconfigurations, JavaScript secrets, Supabase RLS bypasses, open redirects. These appeared as "extra" findings on 10 of 19 sites. A site can get A+ from Observatory and still have an exposed .env file. We catch that.

3. The cookie and SRI gaps aren't about accuracy. Observatory treats cookies as one binary test (pass/fail). We test Secure, HttpOnly, and SameSite independently. A site that's missing SameSite but has Secure and HttpOnly passes Observatory's test but fails three of ours. Neither is wrong. They measure different granularities.

Contextual scoring: the data says it works

In yesterday's article, I described contextual scoring as "what's next." Today I built the prototype and ran it on real data.

The idea: A missing CSP on a static portfolio with zero JavaScript is not the same risk as a missing CSP on a site loading Google Tag Manager, Stripe.js, and Intercom. The first is theoretical. The second has real attack surface.

The formula:

0 third-party scripts, 0 inline JS: factor 0.3 (effectively informational)
1-5 scripts or framework detected: factor 0.7
6+ scripts OR inline JS: factor 1.0 (unchanged)
eval() or dangerous DOM patterns: factor 1.2 (elevated)

Results on 153 real scans:

60% of scanned sites are "minimal" (static, no scripts). Their missing CSP drops from low to info.
40% are "low" (1-5 scripts or a framework). Same downgrade.
0% are "medium" or "high". None of our current users have heavy third-party script loads.
76% of sites would see their score increase. Average: +0.74 points.
52 sites change grade. Mostly C to B, B to A.

The most dramatic: a static site going from 5.9/C to 8.9/A. It was penalized for missing headers that protect against threats that don't exist on its stack.

What this means: Our audience is indie devs building simple sites. Penalizing them for missing mitigations against non-existent threats makes the score feel unfair. Contextual scoring fixes this without weakening the signal for complex sites.

What's missing: Zero sites in our database have high injection surfaces (6+ scripts, eval patterns). I need to scan e-commerce sites and SaaS dashboards to validate the upgrade path before deploying.

What comes next

Deploy the Vercel HEAD fix. Re-scan pickmynews to confirm it works.
Store injection surface data. Script counts and inline JS presence need to be persisted in scan results for contextual scoring.
Scan 20 high-traffic sites. Shopify stores, marketing pages with trackers, SaaS apps. Test the 1.0 and 1.2 surface factors.
Fix the SRI benchmark mapping. This is inflating our disagreement rate with Observatory.
Run the benchmark on 100+ sites. 19 is enough to spot patterns. 100 gives statistical confidence.

The testing infrastructure is in place. Every improvement from here is measurable.

Sources

CVSS v3.1 Specification (FIRST.org)
Bugcrowd Vulnerability Rating Taxonomy
Mozilla Observatory API v2
OWASP Web Security Testing Guide
How I Score Your Website's Security (previous article)