Google primarily crawls through a website’s internal link structure and not the sitemap. This is why it’s important for your website to be properly linked internally. Sitemap dependency for crawling should be kept at a minimum. Let’s find out more.
If you’re an SEO strategist, a crawling strategy needs to be part of your primary goals in the first few weeks of a project. It is essential to know that Google has correct access to your website’s pages and its code, and that critical sections of the page are being rendered correctly.
Google renders all the code and content of a page, analyzes headings and then assesses what the page is about. It is imperative for SEO strategists and web developers to make sure that this is happening properly.
Primary crawling: through internal links
Contrary to the belief of many “SEO experts,” Google’s primary source of crawling is not your sitemap, but your website’s live structure. The bot goes through all the internal links and attempts to discover all pages it can find (it prioritizes the navigation).
So how can you figure out that the crawling is occurring correctly?
- If the option is available in your SEO crawler, either disable sitemap crawling or do not add a sitemap link (in SiteBulb, this is available in “crawl sources” during the audit settings). And then, initiate a crawl and see if the total number of crawled pages matches the total number of pages (that you know to exist) on the website. If the total pages do not match, it means some pages are not being crawled. This means that your website is not linked correctly through the internal linking network.
- In Google Search Console, if you insert new URLs that you launched into the “inspect URL” top box, you will be able to see if the Googlebot discovered this URL. Do this test for a sample size of pages. In this case, try to ignore the “discovered through sitemap” as you test internal linking crawling.
- You can see if the Googlebot is visiting your website by doing a manual or automatic test.
- You need to check if the critical content on the page is also being rendered properly, and you can do this by taking a random sentence from the front end and searching it in the “view source” code of the website.
Secondary crawling: sitemap.xml
While the primary crawling occurs through the website’s internal links, Google also uses the sitemap to find new URLs that may need to be crawled and have been missed during internal linking.
According to Google, focusing on sitemap optimization for websites that are internally linked in a proper manner and have a very small number of pages is not as important as it’s for websites with a significant amount of pages (think of websites with 10,000+ pages such as ecommerce or news sites).
Remember that if you do not have essential pages links through the internal linking structure and have them in the sitemap, the crawl rate for the URLs may be reduced, as Google might think the reason you are not internally linking to these URLs is that you do not think they are very valuable for users (remember that Google is heavily focusing on helpful content).
How the robots.txt file fits into crawling
The primary purpose of the robots.txt file is to manage crawl requests to a website. It can also be used to keep media files off of search engines and very specific JavaScript that can be causing trouble to a website’s SEO, such as announcement bars.
It must be noted that disallowing necessary JavaScript or CSS files in the robots file can cause a page to not render properly for the Googlebot and quite possibly be negatively affected in search results.
Need help with improving your site health and linking logic? Start with an SEO audit.
Frequently asked questions
Are your pages being crawled and discovered but not being indexed?
Google does not guarantee the indexation of any URL. Websites that output a lot of thin content very quickly will find that their pages are not indexed. Only a percentage of a website will be indexed — something determined by Google algorithms.
How can you increase the likelihood of a page being crawled?
You can increase pages’ likelihood of crawling by adding more internal linking to these pages. You should have links to these pages from the index (home page), which is the most important page of the website. Internal and external signals to a page increase page indexation probability.
What is crawl rate?
Crawl rate is the frequency at which the Googlebot crawler visits the pages of your website. It is something Google algorithms determine by individual website analysis.
Can Google crawl Yoast sitemaps?
Yoast is a wonderful plugin for WordPress that generates and updates sitemaps. It also automatically splits up sitemaps. Google can fully crawl the sitemaps and the URLs listed inside it.
Can Productive Shop Inc. help me audit my site for correct internal linking?
Yes, we provide an internal linking service. We can fully analyze your website’s code and how Googlebot sees your website, helping you align better with standard Google guidelines. Reach out today.