John Mueller of Google wrote a really detailed and sincere clarification on why Google (and third social gathering search engine optimization instruments) don’t crawl and index each URL or hyperlink on the internet. He defined that crawling isn’t goal, it’s costly, it may be inefficient, the net modifications rather a lot, there’s spam and junk and all of that needs to be taken into consideration.

John wrote this detailed response on Reddit answering why “Why search engine optimization instruments do not present all backlinks?” However he answered it from a Google Search perspective. He mentioned:

There is not any goal technique to crawl the net correctly.

It is theoretically unimaginable to crawl all of it, because the variety of precise URLs is successfully infinite. Since no person can afford to maintain an infinite variety of URLs in a database, all internet crawlers make assumptions, simplifications, and guesses about what’s realistically price crawling.

And even then, for sensible functions, you’ll be able to’t crawl all of that on a regular basis, the web does not have sufficient connectivity & bandwidth for that, and it prices some huge cash if you wish to entry a variety of pages commonly (for the crawler, and for the location’s proprietor).

Previous that, some pages change rapidly, others have not modified for 10 years — so crawlers attempt to save effort by focusing extra on the pages that they count on to vary, fairly than people who they count on to not change.

After which, we contact on the half the place crawlers strive to determine which pages are literally helpful. The net is crammed with junk that no person cares about, pages which were spammed into uselessness. These pages should commonly change, they might have cheap URLs, however they’re simply destined for the landfill, and any search engine that cares about their customers will ignore them. Generally it is not simply apparent junk both. Extra & extra, websites are technically okay, however simply do not attain “the bar” from a top quality standpoint to benefit being crawled extra.

Subsequently, all crawlers (together with search engine optimization instruments) work on a really simplified set of URLs, they should work out how usually to crawl, which URLs to crawl extra usually, and which elements of the net to disregard. There are not any mounted guidelines for any of this, so each device should make their very own selections alongside the best way. That is why serps have totally different content material listed, why search engine optimization instruments record totally different hyperlinks, why any metrics constructed on prime of those are so totally different.

I felt it will be good to focus on this as a result of it’s helpful for SEOs to learn this and realize it.

Discussion board dialogue at Reddit.