How similarity search finds qualified leads

Scroll to read
01

The starting point: a large, unqualified list

We begin with a massive list of URLs. No quality signal — just raw data. Most of these sites will be irrelevant: parked domains, large corporations, wrong industry, dead links.

Each dot is a website. The highlighted dots are sites that would match our target profile. Without a systematic method, finding them means manually reviewing thousands of sites.

02

Finding leads by resemblance, not rules

The core challenge isn't identifying what industry a business is in — it's determining whether a website belongs to the kind of small, simple operation we're looking for. A 3-page Wix site for a family-owned roofer looks nothing like a 50-page corporate site for a national roofing chain, even though both say "roofing." Rules can't capture that difference. Industry identification is a useful bonus, but website simplicity is what makes or breaks lead quality.

Rule-based filtering

Write rules, hope they work

Hard-code keywords and thresholds. Doesn't work for the thing that matters most — how simple the site is.

  • pages < 5 — misses a simple site with 8 pages of photo galleries
  • template = "Wix" — misses GoDaddy, Squarespace, hand-coded sites
  • word count < 500 — catches parked domains alongside real businesses
  • No rule for "looks like an owner-operator site" — that's a vibe, not a field
Similarity search

Show examples, find lookalikes

Give the system a handful of sites you know are the right kind of lead. It finds every site that looks and feels like them — same simplicity, same structure, same vibe.

StormShield Exteriors 0.94
Heritage Roof Co 0.91
Pinnacle Roof Systems 0.88
Metro Plumbing LLC 0.61
NationalRoof.com (corporate) 0.31

The system learns what a "small operator" site looks like from examples — structure, complexity, tone — not from rules anyone had to write.

The question shifts from "does this site match our rules?" to "does this site resemble the sites we already know are good?" The rest of this walkthrough explains how that resemblance is measured.

03

Turning a website into a list of numbers

We feed each website's text into a neural network, and it outputs a list of 1,536 numbers — called an embedding. Think of it as a GPS coordinate, but instead of 2 dimensions (latitude and longitude) you get 1,536. The network learned, from training on enormous amounts of text, to place sites that are about similar things at nearby coordinates.

Website text

"Family-owned roofing company serving the greater Atlanta area. We specialize in residential roof repair, storm damage restoration, and new shingle installation. Free estimates."

Embedding: 1,536 numbers

0.023 -0.118 0.041 0.087 -0.032 0.156 -0.071 0.009 0.133 -0.064 0.095 ... ×1,525 more

A different website

"Professional cloud hosting solutions for enterprise. Our platform delivers 99.9% uptime with global CDN, automated scaling, and 24/7 support."

A different embedding

-0.091 0.044 0.167 -0.053 0.112 -0.028 0.083 -0.145 0.006 0.139 -0.077 ... ×1,525 more

No single number means "roofing" or "hosting." The meaning is spread across all 1,536 numbers together. But two roofing sites will produce similar patterns, while a roofing site and a hosting site will produce very different ones. That's the property we exploit.

04

Similar businesses land near each other

When we embed thousands of sites and plot them, businesses in the same trade naturally cluster together. Roofers land near roofers. Plumbers near plumbers. We didn't program this — the network figured it out from the content. With 1,536 dimensions (compressed to 2D below), it can even distinguish a small residential roofer from a national commercial chain.

Roofers
Plumbers
Landscapers
Other industries

Hover over any dot. This is a simplified illustration with 40 points — a real dataset has thousands, with more overlapping clusters.

05

Measuring how alike two sites are

Once every site has an embedding, we can score how similar any two sites are. The result is a number from 0 (nothing in common) to 1 (nearly identical). The score captures both what the business does and how the site is built — so two small residential roofers score high even if they use different words.

Apex Roofing Co
vs
StormShield Roofing
Both small residential roofers
0.92
High match
Apex Roofing Co
vs
Metro Plumbing
Both home services, different trade
0.61
Moderate match
Apex Roofing Co
vs
CloudNet Hosting
Completely different industry
0.18
Low match

The score focuses on what a site is about, not how much content it has. A 3-page site and a 30-page site about the same trade will still score high.

06

Two embeddings per site, not one

We generate two separate embeddings for each website. One from the text content, one from the raw HTML structure. They capture different signals, and we weight them differently.

Text embedding

Based on the readable content — the words on the page.

30%
weight in final score
roofing shingles free estimate licensed residential storm damage

Captures what the business does: industry, services, language patterns. Good at distinguishing roofers from plumbers.

HTML embedding

Based on the page's source code — layout, tags, structure.

70%
weight in final score

Captures how the site was built: template, layout, builder platform. Small service businesses tend to use the same builders (Wix, GoDaddy, Squarespace) and the same templates — so structurally similar sites are likely similar businesses.

How weighting affects position

Each dot below has two positions — one from its text embedding, one from its HTML embedding. The slider blends between them. At 30/70 (our production setting), the structure signal dominates.

Text HTML
30% 70%
07

From examples to a ranked list

This is the core operation. We take known good leads, average their embeddings into a single point (the centroid), measure every site's distance from it, and sort. The closest sites are the most similar to your best customers. Press play to watch it unfold.

The centroid averages out noise from individual examples. Sorting by distance produces a ranked list where the top entries are the sites most structurally and semantically similar to known good leads — without writing any rules about what a "good lead" looks like.