Methodology — How Topical Drift Measures Site Focus

On this page

Definitions The model (Focus, Outliers, ABC) Drift patterns we detect Inputs we use Main-content extraction Embeddings Distance calculation Clustering UMAP semantic projection Drift score Internal link mismatch Interactive radial map Scan frequency (unlimited) Outputs + checklist Limitations

Want help implementing the fixes? Request a walkthrough .

Definitions

What we mean by "topical drift"

Topical drift is when a page slowly stops representing the topic + intent it used to win for. This can happen even if the page is "high quality."

Content drift: edits introduce a different topic neighborhood.
Intent drift: the page begins ranking for queries it wasn't built to satisfy (GSC signal).
Link drift: internal link context pushes the page toward a different cluster.
Structural drift: the page moves away from its natural semantic neighbors.

The model

Focus, outliers, and the 3 signals that drive drift

We explain drift using a simple framework:

1) Site Focus

How tightly your site stays centered on its core topic(s). Measured by actual semantic distance from your topical center.

2) Outlier Pages (Site "Radius")

Pages that drift far from your core topic and weaken overall topical coherence. Default review threshold: distance ≥ 0.700.

3) The "ABC" signals (what we triangulate)

A — Anchors

Internal link anchors + link context describe what your pages mean to search engines.

B — Body

Your headings + main content represent what the page is actually about.

C — Clicks & queries (GSC reality)

GSC shows what the page is earning traffic/impressions for (intent in the wild).

Drift shows up when these signals disagree (e.g., anchors imply Topic A, content is Topic B, and GSC queries trend Topic C).

Where “ABC” comes from (and how we use it): “ABC” (Anchors, Body, Clicks) appears in public reporting and DOJ trial exhibits discussing Google’s topicality / base relevance components. We borrow the high-level mnemonic to explain our framework, but we adapt C to mean your connected Google Search Console queries + clicks (first-party performance data), not Google’s internal click-satisfaction signals.

Optional reading: ABC signals & topicality commentary • DOJ Trial Exhibit PXR0356 (PDF)

Patterns

Common drift patterns we detect

Query/intent shift

GSC trends toward a different intent than the page was built for.

Link-context mismatch

Anchor + surrounding text describe one topic, destination is another.

Cluster overlap / cannibalization

Multiple pages compete inside the same intent space.

Outlier pages

Pages far from your core topic that "pull" the site away from focus.

Semantic isolation

Pages with few close semantic neighbors (highlighted by the UMAP layout and confirmed by nearest-neighbor distances).

Missing internal links

Similar pages (close in semantic space) that should link but don't.

The report doesn't just flag problems — it tells you the fastest fix path (change anchors/context, re-center content, consolidate pages, add internal links, or improve structure).

Inputs

What data we use

Sitemap + crawl

URL inventory, status, canonical behavior, and content fetch (up to 500 pages/site).

Google Search Console

Queries, clicks, impressions, CTR, position — used to ground drift in real demand.

Page embeddings

high-dimensional vector representation of each page's main content for similarity + clustering.

Internal link context

Anchor text + surrounding text + container/heading context (embedded separately).

Extraction

Main-content extraction (why it matters)

Raw HTML includes navigation, footers, "related posts," and template noise. If you embed all of that, your vectors become a representation of your site template — not your topic.

We aim to keep

Primary article/body, headings (H1-H6), lists, key supporting text, main containers.

We aim to reduce

Nav/footer blocks, boilerplate, repeated CTAs, unrelated "widgets," template repetition.

Result: Embeddings reflect page meaning, not layout repetition. This makes clustering more accurate and drift detection more precise.

Embeddings

How embeddings are used

An embedding is a high-dimensional numeric vector that represents semantic meaning. We use embeddings to:

Measure similarity between pages (semantic neighborhoods)
Group pages into clusters (topic hubs via K-means)
Project pages into 2D space for visualization (via UMAP)
Compare internal-link context meaning vs destination page meaning
Calculate actual semantic distance from your site's topical center

similarity(pageA, pageB) = cosine(embeddingA, embeddingB)
distance = 1 - similarity

Embedding model: We generate embeddings using a modern semantic embedding model (model choice may evolve; scoring remains cosine similarity in embedding space). For the same input and model version, embeddings are generally consistent, and all similarity/distance scoring is computed in the original embedding space. You don't need the math to use the output — the report translates it into actions.

Distance calculation

Actual vs normalized distances

We provide two ways to measure and visualize semantic distance:

DefaultActual distances

Raw cosine distance from your site's topical center (0.0 = perfect alignment, ~1.0 = weakly related/neutral, up to 2.0 = strongly dissimilar).

distance = 1 - cosine(page, global_centroid)

Fixed thresholds:

Core: ≤ 0.300 (excellent focus)
Focus: 0.300-0.500 (on-topic)
Extended: 0.500-0.700 (moderate drift)
Peripheral: ≥ 0.700 (needs review)

Normalized (within-site 0–1)

Distances scaled to a 0.0–1.0 range within this scan for easier ranking and UI filtering. This is a relative measure for your site, not a universal scale.

normalized = (distance - min) / (max - min)

Implementation note: if max == min (rare), normalized distance is set to 0 to avoid divide-by-zero.

Relative buckets (default):

Core: 0.0–0.3 (closest pages)
Focus: 0.3–0.5
Extended: 0.5–0.7
Peripheral: 0.7–1.0 (furthest pages)

When to use each:

Actual distances: Best for topical focus tracking across scans. Distances are computed the same way each run, and are generally more comparable over time than normalized values.
Normalized: Best for ranking pages within this scan and building UI filters or composite scores. This is a relative scale and will change if your site’s min/max distances change.

Note: your “topical center” (centroid) is typically recalculated each scan based on the pages analyzed, so the absolute numbers can still shift as your content changes.

Clustering

How clustering reveals topical structure

We use K-means clustering on page embeddings to group semantically similar pages. This reveals:

Topical hubs

Your site's "neighborhoods" of related content (e.g., "Interior Painting," "Exterior Services").

Overlap / cannibalization

Multiple pages competing inside the same intent space.

Content gaps

Clusters with only 1-2 pages indicate under-developed topics.

Drift reference frame

Clusters define "normal" — pages far from their cluster centroid are drifting.

cluster_drift = 1 - cosine(page, cluster_centroid)

Automatic optimization: The tool can evaluate multiple cluster counts (e.g., 3–10) and select a strong option using separation metrics (such as silhouette score), depending on your configuration.

UMAP projection

Semantic similarity visualization

UMAP (Uniform Manifold Approximation and Projection) projects your high-dimensional embeddings into 2D so you can see semantic neighborhoods at a glance. We use this projection to help position pages around the radial map (angle ordering). UMAP does not change drift scoring — scores are computed using cosine distance in the original embedding space.

How it works

UMAP assigns each page a 2D coordinate (x, y). For the radial map, we convert that into an angle using angle = atan2(y, x) so pages that are semantically similar tend to appear near each other around the circle.

What it reveals

“Neighborhoods” of related pages — including relationships that can cross formal cluster boundaries. This makes it easier to spot hub structure, near-duplicates/cannibalization, and strong internal linking candidates.

Linking opportunities: Pages that appear close together around the circle are often semantically related and may benefit from contextual internal links. Angle proximity is a discovery aid; final recommendations still use similarity/distance in the original embedding space.

Technical note: UMAP is commonly used for embedding visualization because it often preserves local neighborhoods better than PCA, and it’s typically faster/more scalable than t-SNE for hundreds of pages. Layout can shift slightly between runs unless random seeds are fixed.

Drift score

How we calculate drift (conceptually)

Drift isn't one number from one signal. We triangulate multiple signals, then summarize them into a score and a "why."

Semantic drift (primary)

Distance from page → cluster centroid (or global centroid in actual mode).

drift_sem = 1 - cosine(page, centroid)

Intent drift (GSC)

Queries the page earns vs the cluster's dominant intent set.

drift_intent = divergence(queries, cluster_queries)

Link structure penalty

Pages with few internal links get penalized (orphan detection).

link_penalty = 1 - (total_links / max_links)

Engagement penalty

Low-traffic pages may indicate poor targeting or drift.

engagement_penalty = 1 - (clicks / max_clicks)

About SDI: SDI is a prioritization score. By default it blends semantic drift with structural signals (internal linking + engagement). Intent drift from GSC is reported as a separate diagnostic (“why this page is drifting”) and can be optionally weighted in custom scoring.

SDI = (60% × semantic_drift) + (30% × link_penalty) + (10% × engagement_penalty)

The report always includes why the score is high (e.g., "ranking shifted toward X intent" or "link context pushes toward Y cluster" or "isolated from semantic neighbors").

Internal links

Internal link context mismatch

Most internal link tools only look at anchor text. We also embed: surrounding text (2-3 sentences) and the container/heading context.

What we compare

link_context_embedding ↔ destination_page_embedding

mismatch = 1 - cosine(context, destination)

What a mismatch means

The link is telling search engines (and users) it's about one topic — but it points to a page about another.

Fast fixes:

Change the anchor text to match destination
Adjust the sentence around the link for better context
Move the link into a more relevant section
Link to a better destination page
Add the link to semantically similar pages (discovered via UMAP angles)

Visualization

Interactive radial site map

The radial map is the centerpiece of the tool — it makes site structure, topic neighborhoods, and semantic relationships immediately visible.

Radius (distance from center)

Shows drift as cosine distance from your topical center (computed in the original embedding space). Farther from the center = more off-topic. You can toggle between actual distance (stable over time) or normalized (0–1 within this scan).

Angle (position around circle)

Shows semantic neighborhood ordering for easier exploration. By default, angle is derived from a UMAP 2D projection (layout-only). Pages near each other angularly are often related — but similarity and drift scores are still computed using cosine distance in the original embedding space. You can also group by cluster, distribute evenly, or group by problem type.

Interactive features:

Zoom & pan: Explore large sites (scroll wheel / pinch to zoom)
Toggle clusters: Show/hide specific topic clusters
Color by: Cluster, zone, SDI score, or drift severity
Node size: Reflects traffic (GSC clicks), adjustable 0.5-3x
Hover tooltips: See full page details, metrics, and drift reasons
Click to open: Pages open in new tab for quick review

What it reveals:

Isolated clusters

Topics with weak internal connectivity (fix: add hub pages, strengthen links)

Bridge pages

Pages connecting multiple topics (often drift because they serve multiple intents)

Linking opportunities

Pages close angularly that should link but don't (revealed by UMAP)

Perfect for communication: Export the map as PNG to share with clients, executives, or your team. It explains complex semantic relationships in seconds without requiring technical SEO knowledge.

Scan frequency

Unlimited scans (fair use) — scan when it makes sense

Drift is time-based and change-driven. You need flexibility to scan when you publish major updates, test fixes, or track progress — not when an arbitrary calendar limit says so.

Common scanning patterns:

Monthly maintenance (most common)

Scan once per month to catch drift, monitor cluster health, and identify new issues. Good for stable sites with regular content updates.

Weekly during projects

Scan weekly when doing major content updates, site redesigns, or fixing drift issues. Track improvements in real-time with actual distance mode.

Before/after testing

Run baseline scan → implement fixes → rescan days later to measure impact. See if content changes reduce semantic drift (lower actual distances).

Quarterly audits

For mature, stable sites, quarterly scans may be enough. Good for tracking long-term trends and catching slow drift.

Track improvements over time:

With actual distance mode, you can measure whether your fixes reduce semantic drift:

Run baseline scan → note distances for problematic pages (e.g., page at 0.680 distance)
Implement fixes (content updates, link changes, heading adjustments)
Rescan after changes are indexed (7-14 days)
Compare distances → page now at 0.485 = 28% improvement in topical alignment

What "fair use" means:

Fair use includes

Scanning whenever you publish major content
Weekly scans during site updates/redesigns
Before/after testing to measure fixes
Ad-hoc scans when traffic drops unexpectedly
Comparing different internal linking strategies

Not fair use

Automated high-frequency scanning (hourly/daily)
Reselling or redistributing scan access
Abuse or attempts to overload the system
Commercial scraping or data harvesting
Running scans on sites you don't own/manage

Bottom line: If you're using the tool for its intended purpose (improving your site's topical focus), scan as often as your workflow requires. We don't police normal usage — the fair use policy exists to prevent abuse, not to restrict legitimate analysis.

Outputs

What you get after a scan

Checklist + action plan

A prioritized list of what to fix first (drift pages, mismatched links, overlap/cannibalization, linking opportunities from UMAP).

Exports

CSV exports for drift pages, cluster membership, internal link mismatches, and semantic angle data.

Cluster report

Hubs, overlaps, and "what belongs together" — used to guide consolidation and internal linking.

Interactive radial map

Zoom, filter, and explore your site's topical structure. Export as PNG for sharing.

Linking opportunities

Pairs of semantically similar pages (close in embedding space) that should link to strengthen structure.

Zone distribution dashboard

See how many pages fall into Core/Focus/Extended/Peripheral zones using actual distance thresholds.

Want me to review your report and map the fixes?

I’ll turn the drift signals into a practical plan: which pages to re-center, which internal links to clean up, where clusters overlap, and which semantic neighbors should link. Then we’ll rescan to confirm improvement.

Request a walkthrough Start Free Scan

References (optional reading)

Discussion of “ABC signals” in SEO commentary: Hobo Web – ABC signals & topicality
U.S. DOJ exhibit referenced in that discussion: Trial Exhibit PXR0356 (PDF)

Limitations

What this is (and isn't)

Not a rank tracker: we use GSC performance data, not daily SERP scraping.
Not "AI content writing": we provide diagnosis + actions; you implement edits.
Extraction is best-effort: heavily scripted sites or unusual templates may require tuning.
Embeddings are a model: they approximate meaning; we reduce noise with main-content extraction and multiple signals.
UMAP stability: When a fixed random seed is used, layouts are repeatable; otherwise, the 2D layout (and angles) can shift slightly between runs even if the underlying page similarities are similar.
No "Google score claims": we use observable site data + public ML concepts; we do not claim access to internal Google scoring systems.
500-page limit: larger sites can be analyzed in phases or with custom deployment.

Start Free Scan Up to 500 pages/site • unlimited scans (fair use)