Am I Hackable?
Back to Learn

I Built 77 Tests for My Security Scanner and Found a Production Bug in 10 Minutes

Benji··8 min read

Zero tests. For a security tool.

Yesterday I published how I rebuilt my scoring from scratch. Users confirmed the fixes worked. Beta testers validated the improvements.

But I had no automated tests. 87 finding types, 14 analyzer functions, and the only validation was "scan a site and eyeball the results."

That's not how you ship a tool that tells people their site is secure. Today I fixed that.

77 tests in one session

Unit tests for every analyzer

My scanner has 14 analyzer functions. Each one takes raw HTTP data (headers, cookies, file probes) and returns findings with severities. They're pure functions. No network, no database, no Cloudflare Worker. Perfect for unit testing.

I wrote 56 tests covering every analyzer:

Every test verifies both detection and severity. If I ever accidentally change CSP from low to medium, a test fails. Every severity is traceable to CVSS v3.1 and Bugcrowd VRT ratings.

A vulnerable server for integration testing

Unit tests verify logic. Integration tests verify the full chain.

I built an HTTP server with four vulnerability profiles:

ProfileWhat it simulatesKey tests
insecureNo headers, exposed .env, bad cookies, stack traces, CORS wildcard+credentials10 tests: every finding type fires correctly
secureAll headers, secure cookies, security.txt4 tests: zero false positives
typical-spaReact SPA, catch-all routing, every path returns 200 + HTML4 tests: .env and /graphql return HTML, proving SPA detection works
wordpressPHP headers, WP user enumeration, session cookies3 tests: WP-specific findings detected

The SPA profile is the most important. It proves that a React app on Netlify or Vercel won't get phantom .env and /graphql findings. If the SPA detection ever regresses, this test catches it before it reaches production.

All 77 tests run in 262ms. No flakiness, no network dependencies, fully deterministic.

The bug I found in 10 minutes

Before writing tests, I re-scanned the five sites from my SPA false positive report.

Three user-reported sites: clean. A fourth site, ski-club-bons-en-chablais.netlify.app, went from 0/F with 35 false findings to 8.5/A with 8 real findings.

Then I hit pickmynews-dashboard-private.vercel.app. Still 0/F. 16 false positive API endpoints.

The root cause: my SPA detection uses HEAD requests for API endpoint probing. Vercel returns an empty content-type for HEAD requests on catch-all routes. My filter checked "is this HTML?" which was false (because the content-type was empty, not text/html). So the SPA filter never fired for API endpoints.

The fix was one line. Instead of checking "is the response HTML?", I now check "is the response NOT a real API format?" Real GraphQL endpoints return application/json. Real Swagger docs return application/json. If we've already confirmed SPA catch-all routing exists and the response isn't JSON or XML, it's the SPA shell.

This is why you test edge cases. The fix from yesterday handled 103 out of 108 SPA sites correctly. The 5 that failed shared a specific Vercel HEAD request behavior that only shows up if you actually probe it.

Benchmark: AmIHackable vs Observatory, finding by finding

Comparing grades is easy. AmIH: B, Observatory: C. Who's right? This tells you nothing.

I built a benchmark that compares at the individual test level. For 19 real sites, I mapped each Observatory test to our corresponding finding and checked: do we agree?

Results: 68% overall agreement. But the per-test breakdown is where it gets useful:

TestAgreementAnalysis
HSTS100%Perfect alignment. We detect this identically.
X-Content-Type-Options100%Perfect.
Clickjacking protection100%Perfect.
HTTPS redirect89%Near-perfect. Minor edge cases.
CSP84%Good. Disagreements are about CSP policy quality, not presence.
Referrer-Policy42%Gap. Observatory evaluates the policy value; we check presence/absence.
Cookie security26%Different approaches. Observatory tests all flags together; we test each flag individually.
Subresource Integrity5%Our biggest blind spot. Mapping issue between their test and our finding structure.

Three takeaways:

1. Core header detection is solid. 100% agreement on three foundational checks. The plumbing works.

2. We test 21 things Observatory doesn't. Exposed endpoints, CORS misconfigurations, JavaScript secrets, Supabase RLS bypasses, open redirects. These appeared as "extra" findings on 10 of 19 sites. A site can get A+ from Observatory and still have an exposed .env file. We catch that.

3. The cookie and SRI gaps aren't about accuracy. Observatory treats cookies as one binary test (pass/fail). We test Secure, HttpOnly, and SameSite independently. A site that's missing SameSite but has Secure and HttpOnly passes Observatory's test but fails three of ours. Neither is wrong. They measure different granularities.

Contextual scoring: the data says it works

In yesterday's article, I described contextual scoring as "what's next." Today I built the prototype and ran it on real data.

The idea: A missing CSP on a static portfolio with zero JavaScript is not the same risk as a missing CSP on a site loading Google Tag Manager, Stripe.js, and Intercom. The first is theoretical. The second has real attack surface.

The formula:

Results on 153 real scans:

The most dramatic: a static site going from 5.9/C to 8.9/A. It was penalized for missing headers that protect against threats that don't exist on its stack.

What this means: Our audience is indie devs building simple sites. Penalizing them for missing mitigations against non-existent threats makes the score feel unfair. Contextual scoring fixes this without weakening the signal for complex sites.

What's missing: Zero sites in our database have high injection surfaces (6+ scripts, eval patterns). I need to scan e-commerce sites and SaaS dashboards to validate the upgrade path before deploying.

What comes next

  1. Deploy the Vercel HEAD fix. Re-scan pickmynews to confirm it works.
  2. Store injection surface data. Script counts and inline JS presence need to be persisted in scan results for contextual scoring.
  3. Scan 20 high-traffic sites. Shopify stores, marketing pages with trackers, SaaS apps. Test the 1.0 and 1.2 surface factors.
  4. Fix the SRI benchmark mapping. This is inflating our disagreement rate with Observatory.
  5. Run the benchmark on 100+ sites. 19 is enough to spot patterns. 100 gives statistical confidence.

The testing infrastructure is in place. Every improvement from here is measurable.

Sources

Frequently Asked Questions

How many tests does AmIHackable have?
77 automated tests covering all 14 analyzer functions. 56 unit tests verify detection and severity calibration. 21 integration tests run against a local vulnerable server with four profiles (insecure, secure, SPA, WordPress). All tests run in under 300ms.
How does AmIHackable compare to Mozilla Observatory?
We compared 19 sites at the finding level. 100% agreement on HSTS, X-Content-Type-Options, and clickjacking protection. 84% on CSP. The main gaps are in cookie testing approach (26% agreement) and SRI (5%). We test 21 additional things Observatory doesn't cover, including exposed files, JavaScript secrets, and CORS misconfigurations.
What is contextual security scoring?
Contextual scoring adjusts finding severity based on your actual attack surface. A missing CSP on a static portfolio with zero JavaScript is effectively informational. The same missing CSP on a site with 12 third-party scripts is a real gap. We're building toward a score that understands your tech stack.
Why does my Vercel SPA get false positive API endpoint findings?
Vercel's SPA catch-all routing returns different HTTP headers for HEAD vs GET requests. Our scanner now detects this pattern. If you scanned before March 23, 2026, try rescanning.

Your AI writes the code. We find what it missed.

Paste your URL. Security audit in 60 seconds.

Scan my app