Processing...
Processing...

How we measure topical drift (and internal link mismatch) — in plain English

Topical Drift Analyzer combines sitemap coverage, GSC query reality, cleaned main content, embeddings, clustering, UMAP projection, and internal link context analysis to produce an actionable checklist. Scan as often as you need to track improvements.

Beta includes
  • Unlimited scans (fair use)
  • Up to 500 pages per site
  • Main-content extraction included
  • Interactive radial map + UMAP
  • Linking opportunity detection
  • CSV exports + action plan

What we mean by "topical drift"

Topical drift is when a page slowly stops representing the topic + intent it used to win for. This can happen even if the page is "high quality."

  • Content drift: edits introduce a different topic neighborhood.
  • Intent drift: the page begins ranking for queries it wasn't built to satisfy (GSC signal).
  • Link drift: internal link context pushes the page toward a different cluster.
  • Structural drift: the page moves away from its natural semantic neighbors.

Focus, outliers, and the 3 signals that drive drift

We explain drift using a simple framework:

1) Site Focus
How tightly your site stays centered on its core topic(s). Measured by actual semantic distance from your topical center.
2) Outlier Pages (Site "Radius")
Pages that drift far from your core topic and weaken overall topical coherence. Default review threshold: distance ≥ 0.700.

3) The "ABC" signals (what we triangulate)
A — Anchors
Internal link anchors + link context describe what your pages mean to search engines.
B — Body
Your headings + main content represent what the page is actually about.
C — Clicks & queries (GSC reality)
GSC shows what the page is earning traffic/impressions for (intent in the wild).
Drift shows up when these signals disagree (e.g., anchors imply Topic A, content is Topic B, and GSC queries trend Topic C).
Where “ABC” comes from (and how we use it): “ABC” (Anchors, Body, Clicks) appears in public reporting and DOJ trial exhibits discussing Google’s topicality / base relevance components. We borrow the high-level mnemonic to explain our framework, but we adapt C to mean your connected Google Search Console queries + clicks (first-party performance data), not Google’s internal click-satisfaction signals.

Common drift patterns we detect

Query/intent shift
GSC trends toward a different intent than the page was built for.
Link-context mismatch
Anchor + surrounding text describe one topic, destination is another.
Cluster overlap / cannibalization
Multiple pages compete inside the same intent space.
Outlier pages
Pages far from your core topic that "pull" the site away from focus.
Semantic isolation
Pages with few close semantic neighbors (highlighted by the UMAP layout and confirmed by nearest-neighbor distances).
Missing internal links
Similar pages (close in semantic space) that should link but don't.
The report doesn't just flag problems — it tells you the fastest fix path (change anchors/context, re-center content, consolidate pages, add internal links, or improve structure).

What data we use

Sitemap + crawl

URL inventory, status, canonical behavior, and content fetch (up to 500 pages during Beta).

Google Search Console

Queries, clicks, impressions, CTR, position — used to ground drift in real demand.

Page embeddings

high-dimensional vector representation of each page's main content for similarity + clustering.

Internal link context

Anchor text + surrounding text + container/heading context (embedded separately).

Main-content extraction (why it matters)

Raw HTML includes navigation, footers, "related posts," and template noise. If you embed all of that, your vectors become a representation of your site template — not your topic.

We aim to keep
Primary article/body, headings (H1-H6), lists, key supporting text, main containers.
We aim to reduce
Nav/footer blocks, boilerplate, repeated CTAs, unrelated "widgets," template repetition.
Result: Embeddings reflect page meaning, not layout repetition. This makes clustering more accurate and drift detection more precise.

How embeddings are used

An embedding is a high-dimensional numeric vector that represents semantic meaning. We use embeddings to:

  • Measure similarity between pages (semantic neighborhoods)
  • Group pages into clusters (topic hubs via K-means)
  • Project pages into 2D space for visualization (via UMAP)
  • Compare internal-link context meaning vs destination page meaning
  • Calculate actual semantic distance from your site's topical center
similarity(pageA, pageB) = cosine(embeddingA, embeddingB)
distance = 1 - similarity
Embedding model: We generate embeddings using a modern semantic embedding model (model choice may evolve; scoring remains cosine similarity in embedding space). For the same input and model version, embeddings are generally consistent, and all similarity/distance scoring is computed in the original embedding space. You don't need the math to use the output — the report translates it into actions.

Actual vs normalized distances

We provide two ways to measure and visualize semantic distance:

DefaultActual distances
Raw cosine distance from your site's topical center (0.0 = perfect alignment, ~1.0 = weakly related/neutral, up to 2.0 = strongly dissimilar).
distance = 1 - cosine(page, global_centroid)
Fixed thresholds:
  • Core: ≤ 0.300 (excellent focus)
  • Focus: 0.300-0.500 (on-topic)
  • Expansion: 0.500-0.700 (moderate drift)
  • Peripheral: ≥ 0.700 (needs review)
Normalized (within-site 0–1)
Distances scaled to a 0.0–1.0 range within this scan for easier ranking and UI filtering. This is a relative measure for your site, not a universal scale.
normalized = (distance - min) / (max - min)
Implementation note: if max == min (rare), normalized distance is set to 0 to avoid divide-by-zero.
Relative buckets (default):
  • Core: 0.0–0.3 (closest pages)
  • Focus: 0.3–0.5
  • Expansion: 0.5–0.7
  • Peripheral: 0.7–1.0 (furthest pages)
When to use each:
  • Actual distances: Best for topical focus tracking across scans. Distances are computed the same way each run, and are generally more comparable over time than normalized values.
  • Normalized: Best for ranking pages within this scan and building UI filters or composite scores. This is a relative scale and will change if your site’s min/max distances change.
Note: your “topical center” (centroid) is typically recalculated each scan based on the pages analyzed, so the absolute numbers can still shift as your content changes.

How clustering reveals topical structure

We use K-means clustering on page embeddings to group semantically similar pages. This reveals:

Topical hubs
Your site's "neighborhoods" of related content (e.g., "Interior Painting," "Exterior Services").
Overlap / cannibalization
Multiple pages competing inside the same intent space.
Content gaps
Clusters with only 1-2 pages indicate under-developed topics.
Drift reference frame
Clusters define "normal" — pages far from their cluster centroid are drifting.
cluster_drift = 1 - cosine(page, cluster_centroid)
Automatic optimization: The tool can evaluate multiple cluster counts (e.g., 3–10) and select a strong option using separation metrics (such as silhouette score), depending on your configuration.

Semantic similarity visualization

UMAP (Uniform Manifold Approximation and Projection) projects your high-dimensional embeddings into 2D so you can see semantic neighborhoods at a glance. We use this projection to help position pages around the radial map (angle ordering). UMAP does not change drift scoring — scores are computed using cosine distance in the original embedding space.

How it works
UMAP assigns each page a 2D coordinate (x, y). For the radial map, we convert that into an angle using angle = atan2(y, x) so pages that are semantically similar tend to appear near each other around the circle.
What it reveals
“Neighborhoods” of related pages — including relationships that can cross formal cluster boundaries. This makes it easier to spot hub structure, near-duplicates/cannibalization, and strong internal linking candidates.
Linking opportunities: Pages that appear close together around the circle are often semantically related and may benefit from contextual internal links. Angle proximity is a discovery aid; final recommendations still use similarity/distance in the original embedding space.
Technical note: UMAP is commonly used for embedding visualization because it often preserves local neighborhoods better than PCA, and it’s typically faster/more scalable than t-SNE for hundreds of pages. Layout can shift slightly between runs unless random seeds are fixed.

How we calculate drift (conceptually)

Drift isn't one number from one signal. We triangulate multiple signals, then summarize them into a score and a "why."

Semantic drift (primary)
Distance from page → cluster centroid (or global centroid in actual mode).
drift_sem = 1 - cosine(page, centroid)
Intent drift (GSC)
Queries the page earns vs the cluster's dominant intent set.
drift_intent = divergence(queries, cluster_queries)
Link structure penalty
Pages with few internal links get penalized (orphan detection).
link_penalty = 1 - (total_links / max_links)
Engagement penalty
Low-traffic pages may indicate poor targeting or drift.
engagement_penalty = 1 - (clicks / max_clicks)
About SDI: SDI is a prioritization score. By default it blends semantic drift with structural signals (internal linking + engagement). Intent drift from GSC is reported as a separate diagnostic (“why this page is drifting”) and can be optionally weighted in custom scoring.
SDI = (60% × semantic_drift) + (30% × link_penalty) + (10% × engagement_penalty)
The report always includes why the score is high (e.g., "ranking shifted toward X intent" or "link context pushes toward Y cluster" or "isolated from semantic neighbors").

Interactive radial site map

The radial map is the centerpiece of the tool — it makes site structure, topic neighborhoods, and semantic relationships immediately visible.

Radius (distance from center)
Shows drift as cosine distance from your topical center (computed in the original embedding space). Farther from the center = more off-topic. You can toggle between actual distance (stable over time) or normalized (0–1 within this scan).
Angle (position around circle)
Shows semantic neighborhood ordering for easier exploration. By default, angle is derived from a UMAP 2D projection (layout-only). Pages near each other angularly are often related — but similarity and drift scores are still computed using cosine distance in the original embedding space. You can also group by cluster, distribute evenly, or group by problem type.
Interactive features:
  • Zoom & pan: Explore large sites (scroll wheel / pinch to zoom)
  • Toggle clusters: Show/hide specific topic clusters
  • Color by: Cluster, zone, SDI score, or drift severity
  • Node size: Reflects traffic (GSC clicks), adjustable 0.5-3x
  • Hover tooltips: See full page details, metrics, and drift reasons
  • Click to open: Pages open in new tab for quick review
What it reveals:
Isolated clusters
Topics with weak internal connectivity (fix: add hub pages, strengthen links)
Bridge pages
Pages connecting multiple topics (often drift because they serve multiple intents)
Linking opportunities
Pages close angularly that should link but don't (revealed by UMAP)
Perfect for communication: Export the map as PNG to share with clients, executives, or your team. It explains complex semantic relationships in seconds without requiring technical SEO knowledge.

Unlimited scans (fair use) — scan when it makes sense

Drift is time-based and change-driven. You need flexibility to scan when you publish major updates, test fixes, or track progress — not when an arbitrary calendar limit says so.

Common scanning patterns:
Monthly maintenance (most common)
Scan once per month to catch drift, monitor cluster health, and identify new issues. Good for stable sites with regular content updates.
Weekly during projects
Scan weekly when doing major content updates, site redesigns, or fixing drift issues. Track improvements in real-time with actual distance mode.
Before/after testing
Run baseline scan → implement fixes → rescan days later to measure impact. See if content changes reduce semantic drift (lower actual distances).
Quarterly audits
For mature, stable sites, quarterly scans may be enough. Good for tracking long-term trends and catching slow drift.
Track improvements over time:

With actual distance mode, you can measure whether your fixes reduce semantic drift:

  • Run baseline scan → note distances for problematic pages (e.g., page at 0.680 distance)
  • Implement fixes (content updates, link changes, heading adjustments)
  • Rescan after changes are indexed (7-14 days)
  • Compare distances → page now at 0.485 = 28% improvement in topical alignment
What "fair use" means:
Fair use includes
  • Scanning whenever you publish major content
  • Weekly scans during site updates/redesigns
  • Before/after testing to measure fixes
  • Ad-hoc scans when traffic drops unexpectedly
  • Comparing different internal linking strategies
Not fair use
  • Automated high-frequency scanning (hourly/daily)
  • Reselling or redistributing scan access
  • Abuse or attempts to overload the system
  • Commercial scraping or data harvesting
  • Running scans on sites you don't own/manage
Bottom line: If you're using the tool for its intended purpose (improving your site's topical focus), scan as often as your workflow requires. We don't police normal usage — the fair use policy exists to prevent abuse, not to restrict legitimate analysis.

What you get after a scan

Checklist + action plan

A prioritized list of what to fix first (drift pages, mismatched links, overlap/cannibalization, linking opportunities from UMAP).

Exports

CSV exports for drift pages, cluster membership, internal link mismatches, and semantic angle data.

Cluster report

Hubs, overlaps, and "what belongs together" — used to guide consolidation and internal linking.

Interactive radial map

Zoom, filter, and explore your site's topical structure. Export as PNG for sharing.

Linking opportunities

Pairs of semantically similar pages (close in embedding space) that should link to strengthen structure.

Zone distribution dashboard

See how many pages fall into Core/Focus/Expansion/Peripheral zones using actual distance thresholds.

Want me to review your report and map the fixes?
I’ll turn the drift signals into a practical plan: which pages to re-center, which internal links to clean up, where clusters overlap, and which semantic neighbors should link. Then we’ll rescan to confirm improvement.
References (optional reading)

What this is (and isn't)

  • Not a rank tracker: we use GSC performance data, not daily SERP scraping.
  • Not "AI content writing": we provide diagnosis + actions; you implement edits.
  • Extraction is best-effort: heavily scripted sites or unusual templates may require tuning.
  • Embeddings are a model: they approximate meaning; we reduce noise with main-content extraction and multiple signals.
  • UMAP stability: When a fixed random seed is used, layouts are repeatable; otherwise, the 2D layout (and angles) can shift slightly between runs even if the underlying page similarities are similar.
  • No "Google score claims": we use observable site data + public ML concepts; we do not claim access to internal Google scoring systems.
  • 500-page limit in beta: larger sites can be analyzed in phases or with custom deployment.
Start Free Beta Up to 500 pages/site • unlimited scans (fair use)