Sitemap, robots.txt, canonical tags: three files that run your site's SEO.
Sitemap, robots.txt, and canonical tags are three small files that control what Google crawls, indexes, and treats as the authoritative version of your pages. Here's what each one does.
Three files govern most of what search engines do on your site. They are not glamorous. They rarely come up in client meetings. But if any one of them is wrong, Google can crawl the wrong pages, index duplicates, or miss your most important content entirely. Understanding them is not optional — it is the floor beneath every other SEO decision you make.
What is a sitemap and why does it matter?
A sitemap is a file that tells search engines which pages on your site exist and where to find them. It does not guarantee that Google will index every page you list, but it removes the guesswork. Without one, Google discovers your pages by following links — internal links, external links, whatever it can find. That process is slow and incomplete, especially on newer sites or sites with pages that are not well-linked internally.
The most common format is XML. It looks like structured data, not a web page. Each entry in the file contains a URL, and optionally a last-modified date. The last-modified date matters: if you update a service page or a landing page, a fresh date signals to Google that the content has changed and is worth re-crawling.
For a small professional services site — say, a family law firm with fifteen pages — the sitemap is simple. For a multi-location practice with separate pages for each office, each service, and each team member, the sitemap becomes a navigation tool for the crawler. It tells Google: these are the pages we care about. Come here first.
Common mistakes include submitting a sitemap that contains redirected URLs, pages that return 404 errors, or pages blocked by robots.txt. Those are contradictions. You are telling Google to visit a page and simultaneously telling it not to, or sending it somewhere that no longer exists. Google does not penalise you for these directly, but it learns to trust your sitemap less. Fix them whenever you audit.
You submit your sitemap through Google Search Console under the Sitemaps section. That also gives you data: how many URLs you submitted, how many Google has indexed, and whether there are errors. That gap between submitted and indexed is one of the first numbers I check on any new site.
What does robots.txt do — and what it does not do?
Robots.txt is a plain text file that lives at the root of your domain. It tells search engine crawlers which parts of your site they are allowed to access. It is not a security measure. It is a set of polite instructions that well-behaved crawlers follow and bad actors ignore entirely.
The file uses a simple syntax. "User-agent" identifies the crawler — Googlebot, Bingbot, or a wildcard asterisk for all crawlers. "Disallow" marks paths the crawler should skip. "Allow" can carve out exceptions within a blocked directory. That is the whole grammar.
Where robots.txt causes real damage is when it accidentally blocks things it should not. A developer sets up a staging environment and adds a broad disallow rule. The site goes live, the rule stays. Suddenly Google cannot crawl the entire site. This happens more often than you would expect. It is also difficult to catch because the site looks fine to you — you are not a crawler.
The other mistake is using robots.txt as a way to hide pages you do not want indexed. Blocking a page in robots.txt does not remove it from the index if Google has already found it from an external link. For that, you need a noindex directive or a canonical tag. Robots.txt only controls access, not indexation.
The right things to block with robots.txt: admin directories, internal search result pages, session ID URLs, and any staging or preview paths that should not appear in search results. The wrong things to block: your service pages, your blog, your contact page, anything you want Google to find.
Check your robots.txt file at yourdomain.com/robots.txt. Read every disallow rule and ask: do I actually want Google to skip this? If you cannot answer that with confidence, you need a technical audit.
What are canonical tags and when do you need them?
A canonical tag is an HTML element that tells search engines which version of a page is the authoritative one. It lives in the head section of a page and points to a URL — either the page itself, or the page you want to be treated as the original.
Duplicate content is more common than most site owners realise. Your homepage might be accessible at four different URLs: with and without www, with and without a trailing slash, over HTTP and over HTTPS. To you, those are the same page. To a crawler, they can look like four separate pages with identical content. Google has to decide which one to rank. If you do not tell it, it guesses — and it sometimes guesses wrong.
Canonical tags solve this by declaring the winner. Every version of the page points to the canonical URL. Google consolidates the signals — the links, the engagement data, the crawl history — around that one URL.
This also applies to paginated content, filtered product or service listings, and syndicated blog posts. If your practice area page has a URL parameter for sorting or filtering, those filtered versions are technically different URLs with largely the same content. A canonical tag on each filtered version pointing back to the base URL keeps Google from fragmenting your authority across dozens of URL variants.
The most damaging canonical mistake is a self-referencing canonical that points to the wrong URL. This happens when a CMS generates canonical tags automatically based on a template and the template has an error. Every page on the site canonicalises to the homepage, or to a URL with a typo. Google follows the canonical, ignores the actual content of each page, and your site effectively disappears from search. I have seen this on sites that had ranked well for years before a CMS migration broke the template.
For most professional services sites, canonical tags should be self-referencing on all primary pages — each page declares itself as its own canonical. That is clean, unambiguous, and leaves no room for consolidation errors.
How the three files work together
These three files operate as a system. They do not conflict with each other if configured correctly, but they can absolutely conflict if they are not.
Here is the logic chain. Robots.txt controls whether a crawler can visit a URL. If a page is blocked in robots.txt, Google cannot read its canonical tag. So a canonical tag on a blocked page is invisible — it does nothing. Similarly, if a page in your sitemap is blocked by robots.txt, you are creating noise: Google sees the page listed as important but cannot access it.
The sitemap should contain only canonical URLs — the definitive versions of each page. Non-canonical pages, redirects, and blocked pages have no place in a sitemap. When your sitemap, robots.txt, and canonical tags all point in the same direction, Google gets a consistent signal and trusts it. When they contradict each other, Google spends crawl budget resolving the conflict instead of indexing your content.
For a firm like McShanes Solicitors, getting these three files aligned was part of the foundation work before anything else moved. You cannot build authority on top of a crawl structure that confuses the search engine about which pages exist and which ones matter.
If you want to understand how this fits into a complete site health picture, Search Foundations covers the full diagnostic process — from crawl configuration through to structured data and page-level signals.
How to audit these files yourself
You do not need specialist tools to start. Open your sitemap (usually at yourdomain.com/sitemap.xml), read your robots.txt (yourdomain.com/robots.txt), and check the canonical tag on three or four key pages using your browser's View Source option — look in the head section for rel="canonical".
Ask these questions as you go. Does your sitemap load without errors? Are all the URLs in it currently live and returning a 200 status? Does your robots.txt have any disallow rules that could be blocking important pages? On each key page, does the canonical tag point to the correct URL — no typos, no HTTP instead of HTTPS, no www versus non-www mismatch?
For a more thorough picture, tools like Screaming Frog crawl your site the way Google does and flag contradictions between your sitemap, robots.txt, and canonical tags. Google Search Console's Coverage report shows you which pages are indexed, which are excluded, and why. Those two sources together give you enough to act on.
Technical SEO intersects with performance in ways that matter. A site that loads slowly compounds crawl budget problems — Google crawls fewer pages per session on a slow site. Core Web Vitals: the three numbers that decide if Google bothers covers that connection in detail. And if you want to understand why site speed ties directly to revenue, not just rankings, Why your slow site is a sales problem, not an IT problem is worth reading alongside this one.
Where this breaks down
Getting these three files right does not rank your site. It removes the technical obstacles that prevent your site from being ranked. If your content is thin, your pages are poorly structured, or your site has no external signals pointing to it, fixing robots.txt will not move the needle. These files are necessary, not sufficient. Think of them as the plumbing — invisible when working, catastrophic when broken.
Things readers usually ask.
- Do I need a sitemap if my site is small?
- Yes, even a five-page site benefits from a sitemap. It removes ambiguity about which pages exist and tells Google when they were last updated, which speeds up re-indexing after changes.
- Can robots.txt block Google from indexing a page?
- Robots.txt blocks Google from crawling a page, not from indexing it. If Google has already found the page from an external link, it can still appear in search results even with a disallow rule. Use a noindex directive to prevent indexation.
- What happens if my canonical tag points to the wrong URL?
- Google follows the canonical and consolidates all signals around the URL you specified, even if it is wrong. The page you actually want ranked loses authority and may not appear in search results at the right URL.
- How often should I update my sitemap?
- Your sitemap should update automatically whenever you publish or significantly change a page — most CMS platforms handle this. If your sitemap is static and manually maintained, review it whenever you add or remove pages.
- Is it bad to have the same page accessible at multiple URLs?
- It creates a duplicate content problem if left uncorrected. Canonical tags resolve this by designating one URL as authoritative, which consolidates ranking signals and prevents Google from splitting attention across multiple versions of the same page.
Want us to look at your site?
A 20-minute call. No pitch. We'll tell you what we'd fix first.
CONTACT US →