We begin with a massive list of URLs. No quality signal — just raw data. Most of these sites will be irrelevant: parked domains, large corporations, wrong industry, dead links.
Each dot is a website. The highlighted dots are sites that would match our target profile. Without a systematic method, finding them means manually reviewing thousands of sites.
The core challenge isn't identifying what industry a business is in — it's determining whether a website belongs to the kind of small, simple operation we're looking for. A 3-page Wix site for a family-owned roofer looks nothing like a 50-page corporate site for a national roofing chain, even though both say "roofing." Rules can't capture that difference. Industry identification is a useful bonus, but website simplicity is what makes or breaks lead quality.
Hard-code keywords and thresholds. Doesn't work for the thing that matters most — how simple the site is.
pages < 5 — misses a simple site with 8 pages of photo galleries
template = "Wix" — misses GoDaddy, Squarespace, hand-coded sites
word count < 500 — catches parked domains alongside real businesses
Give the system a handful of sites you know are the right kind of lead. It finds every site that looks and feels like them — same simplicity, same structure, same vibe.
The system learns what a "small operator" site looks like from examples — structure, complexity, tone — not from rules anyone had to write.
The question shifts from "does this site match our rules?" to "does this site resemble the sites we already know are good?" The rest of this walkthrough explains how that resemblance is measured.
We feed each website's text into a neural network, and it outputs a list of 1,536 numbers — called an embedding. Think of it as a GPS coordinate, but instead of 2 dimensions (latitude and longitude) you get 1,536. The network learned, from training on enormous amounts of text, to place sites that are about similar things at nearby coordinates.
"Family-owned roofing company serving the greater Atlanta area. We specialize in residential roof repair, storm damage restoration, and new shingle installation. Free estimates."
"Professional cloud hosting solutions for enterprise. Our platform delivers 99.9% uptime with global CDN, automated scaling, and 24/7 support."
No single number means "roofing" or "hosting." The meaning is spread across all 1,536 numbers together. But two roofing sites will produce similar patterns, while a roofing site and a hosting site will produce very different ones. That's the property we exploit.
When we embed thousands of sites and plot them, businesses in the same trade naturally cluster together. Roofers land near roofers. Plumbers near plumbers. We didn't program this — the network figured it out from the content. With 1,536 dimensions (compressed to 2D below), it can even distinguish a small residential roofer from a national commercial chain.
Hover over any dot. This is a simplified illustration with 40 points — a real dataset has thousands, with more overlapping clusters.
Once every site has an embedding, we can score how similar any two sites are. The result is a number from 0 (nothing in common) to 1 (nearly identical). The score captures both what the business does and how the site is built — so two small residential roofers score high even if they use different words.
The score focuses on what a site is about, not how much content it has. A 3-page site and a 30-page site about the same trade will still score high.
We generate two separate embeddings for each website. One from the text content, one from the raw HTML structure. They capture different signals, and we weight them differently.
Based on the readable content — the words on the page.
Captures what the business does: industry, services, language patterns. Good at distinguishing roofers from plumbers.
Based on the page's source code — layout, tags, structure.
Captures how the site was built: template, layout, builder platform. Small service businesses tend to use the same builders (Wix, GoDaddy, Squarespace) and the same templates — so structurally similar sites are likely similar businesses.
This is the core operation. We take known good leads, average their embeddings into a single point (the centroid), measure every site's distance from it, and sort. The closest sites are the most similar to your best customers. Press play to watch it unfold.
The centroid averages out noise from individual examples. Sorting by distance produces a ranked list where the top entries are the sites most structurally and semantically similar to known good leads — without writing any rules about what a "good lead" looks like.