Main-content extraction (why it matters)
Raw HTML includes navigation, footers, "related posts," and template noise. If you embed all of that, your vectors become a representation of your site template — not your topic.
Topical Drift Analyzer combines sitemap coverage, GSC query reality, cleaned main content, embeddings, clustering, UMAP projection, and internal link context analysis to produce an actionable checklist. Scan as often as you need to track improvements.
Topical drift is when a page slowly stops representing the topic + intent it used to win for. This can happen even if the page is "high quality."
We explain drift using a simple framework:
URL inventory, status, canonical behavior, and content fetch (up to 500 pages during Beta).
Queries, clicks, impressions, CTR, position — used to ground drift in real demand.
high-dimensional vector representation of each page's main content for similarity + clustering.
Anchor text + surrounding text + container/heading context (embedded separately).
Raw HTML includes navigation, footers, "related posts," and template noise. If you embed all of that, your vectors become a representation of your site template — not your topic.
An embedding is a high-dimensional numeric vector that represents semantic meaning. We use embeddings to:
We provide two ways to measure and visualize semantic distance:
max == min (rare), normalized distance is set to 0 to avoid divide-by-zero.
We use K-means clustering on page embeddings to group semantically similar pages. This reveals:
UMAP (Uniform Manifold Approximation and Projection) projects your high-dimensional embeddings into 2D so you can see semantic neighborhoods at a glance. We use this projection to help position pages around the radial map (angle ordering). UMAP does not change drift scoring — scores are computed using cosine distance in the original embedding space.
(x, y). For the radial map, we convert that into an angle using
angle = atan2(y, x) so pages that are semantically similar tend to appear near each other around the circle.
Drift isn't one number from one signal. We triangulate multiple signals, then summarize them into a score and a "why."
Most internal link tools only look at anchor text. We also embed: surrounding text (2-3 sentences) and the container/heading context.
The radial map is the centerpiece of the tool — it makes site structure, topic neighborhoods, and semantic relationships immediately visible.
Drift is time-based and change-driven. You need flexibility to scan when you publish major updates, test fixes, or track progress — not when an arbitrary calendar limit says so.
With actual distance mode, you can measure whether your fixes reduce semantic drift:
A prioritized list of what to fix first (drift pages, mismatched links, overlap/cannibalization, linking opportunities from UMAP).
CSV exports for drift pages, cluster membership, internal link mismatches, and semantic angle data.
Hubs, overlaps, and "what belongs together" — used to guide consolidation and internal linking.
Zoom, filter, and explore your site's topical structure. Export as PNG for sharing.
Pairs of semantically similar pages (close in embedding space) that should link to strengthen structure.
See how many pages fall into Core/Focus/Expansion/Peripheral zones using actual distance thresholds.