How Google’s Site Crawlers Index Your Site – Google Search
Чтобы пользователи могли быстро найти нужные сведения, наши роботы собирают информацию на сотнях миллиардов страниц и упорядочивают ее в поисковом индексе.
Основы Google Поиска
При очередном сканировании наряду со списком веб-адресов, полученных во время предыдущего сканирования, используются файлы Sitemap, которые предоставляются владельцами сайтов. По мере посещения сайтов робот переходит по указанным на них ссылкам на другие страницы. Особое внимание он уделяет новым и измененным сайтам, а также неработающим ссылкам. Он самостоятельно определяет, какие сайты сканировать, как часто это нужно делать и какое количество страниц следует выбрать на каждом из них.
При помощи Search Console владельцы сайтов могут указывать, как именно следует сканировать их ресурсы, в частности предоставлять подробные инструкции по обработке страниц, запрашивать их повторное сканирование, а также запрещать сканирование, используя файл Google не увеличивает частоту сканирования отдельных ресурсов за плату. Чтобы результаты поиска были максимально полезными для пользователей, все владельцы сайтов получают одни и те же инструменты.
Поиск информации с помощью сканирования
Интернет похож на библиотеку, которая содержит миллиарды изданий и постоянно пополняется, но не располагает централизованной системой учета книг. Чтобы находить общедоступные страницы, мы используем специальное программное обеспечение, называемое поисковыми роботами. Роботы анализируют страницы и переходят по ссылкам на них – как обычные пользователи. После этого они отправляют сведения о ресурсах на серверы Google.
Систематизация информации с помощью индексирования
Во время сканирования наши системы обрабатывают материалы страниц так же, как это делают браузеры, и регистрируют данные по ключевым словам и новизне контента, а затем создают на их основе поисковый индекс.
Индекс Google Поиска содержит сотни миллиардов страниц. Его объем значительно превышает 100 миллионов гигабайт. Он похож на указатель в конце книги, в котором есть отдельная запись для каждого слова на всех проиндексированных страницах. Во время индексирования данные о странице добавляются в записи по всем словам, которые на ней есть.
Построение Сети Знаний — более современный способ определить интересы пользователей по сравнению с сопоставлением ключевых слов. Для этого мы упорядочиваем не только данные по страницам, но и другие типы информации. В настоящее время Google Поиск позволяет найти нужный фрагмент текста в миллионах книг из крупнейших библиотек, узнать расписание общественного транспорта, а также изучить данные общедоступных источников, таких как сайт Всемирного банка.
Website Crawling: A Guide on Everything You Need to Know – Sovrn
Understanding website crawling and how search engines crawl and index websites can be a confusing topic. Everyone does it a little bit differently, but the overall concepts are the same. Here is a quick breakdown of things you should know about how search engines crawl your website. (I’m not getting into the algorithms, keywords or any of that stuff, simply how search engines crawl sites. )
So what is website crawling?
Website Crawling is the automated fetching of web pages by a software process, the purpose of which is to index the content of websites so they can be searched. The crawler analyzes the content of a page looking for links to the next pages to fetch and index.
What types of crawls are there?
Two of the most common types of crawls that get content from a website are:
Site crawls are an attempt to crawl an entire site at one time, starting with the home page. It will grab links from that page, to continue crawling the site to other content of the site. This is often called “Spidering”.
Page crawls, which are the attempt by a crawler to crawl a single page or blog post.
Are there different types of crawlers?
There definitely are different types of crawlers. But one of the most important questions is, “What is a crawler? ” A crawler is a software process that goes out to websites and requests the content as a browser would. After that, an indexing process actually picks out the content it wants to save. Typically the content that is indexed is any text visible on the page.
Different search engines and technologies have different methods of getting a website’s content with crawlers:
Crawls can get a snapshot of a site at a specific point in time, and then periodically recrawl the entire site. This is typically considered a “brute force” approach as the crawler is trying to recrawl the entire site each time. This is very inefficient for obvious reasons. It does, though, allow the search engine to have an up-to-date copy of pages, so if the content of a particular page changes, this will eventually allow those changes to be searchable.
Single page crawls allow you to only crawl or recrawl new or updated content. There are many ways to find new or updated content. These can include sitemaps, RSS feeds, syndication and ping services, or crawling algorithms that can detect new content without crawling the entire site.
Can crawlers always crawl my site?
That’s what we strive for at sovrn, but isn’t always possible. Typically, any difficulty crawling a website has more to do with the site itself and less with the crawler attempting to crawl it. The following issues could cause a crawler to fail:
The site owner denies indexing and or crawling using a file.
The page itself may indicate it’s not to be indexed and links not followed (directives embedded in the page code). These directives are “meta” tags that tell the crawler how it is allowed to interact with the site.
The site owner blocked a specific crawler IP address or “user agent”.
All of these methods are usually employed to save bandwidth for the owner of the website, or to prevent malicious crawler processes from accessing content. Some site owners simply don’t want their content to be searchable. One would do this kind of thing, for example, if the site was primarily a personal site, and not really intended for a general audience.
I think it is also important to note here that and meta directives are really just a “gentlemen’s agreement”, and there’s nothing to prevent a truly impolite crawler from crawling. sovrn’s crawlers are polite, and will not request pages that have been blocked by or meta directives.
How do I optimize my website so it is easy to crawl?
There are steps you can take to build your website in such a way that it is easier for search engines to crawl it and provide better search results. The end result will be more traffic to your site, and enabling your readers to find your content more effectively.
Search Engine Accessibility Tips:
Having an rss feed or feeds so that when you create new content the search software can recognize new content and crawl it faster. sovrn uses the feeds on your site as an indicator that you have new content available.
Be selective when blocking crawlers using files or meta tag directives in your content. Most blog platforms allow you to customize this feature in some way. A good strategy to employ is to let the search engines in that you trust, and block those you don’t.
Building a consistent document structure. This means when you construct your html page that the content you want crawled is consistently in the same place under the same content section.
Having content and not just images on a page. Search engines can’t find an image unless you provide text or alt tag descriptions for that image.
Try (within the limits of your site design) to have links between pages so the crawler can quickly learn that those pages exist. If you’re running a blog, you might, for example, have an archive page with links to every post. Most blogging platforms provide such a page. A sitemap page is another way to let a crawler know about lots of pages at once.
To learn more about configuring and how to manage it for your site, visit. Or contact us here at sovrn. We want you to be a successful blogger, and understanding website crawling is one of the most important steps.
How to Crawl a Website with DeepCrawl
Running frequent and targeted crawls of your website is a key part of improving it’s technical health and improving rankings in organic search. In this guide, you’ll learn how to a crawl a website efficiently and effectively with DeepCrawl. The six steps to crawling a website include:
Configuring the URL sources
Understanding the domain structure
Running a test crawl
Adding crawl restrictions
Testing your changes
Running your crawl
Step 1: Configuring the URL sources
There are six types of URL sources you can include in your DeepCrawl projects.
Including each one strategically, is the key to an efficient, and comprehensive crawl:
Web crawl: Crawl only the site by following its links to deeper levels.
Sitemaps: Crawl a set of sitemaps, and the URLs in those sitemaps. Links on these pages will not be followed or crawled.
Analytics: Upload analytics source data, and crawl the URLs, to discover additional landing pages on your site which may not be linked. The analytics data will be available in various reports.
Backlinks: Upload backlink source data, and crawl the URLs, to discover additional URLs with backlinks on your site. The backlink data will be available in various reports.
URL lists: Crawl a fixed list of URLs. Links on these pages will not be followed or crawled.
Log files: Upload log file summary data from log file analyser tools, such as Splunk and
Ideally, a website should be crawled in full (including every linked URL on the site). However, very large websites, or sites with many architectural problems, may not be able to be fully crawled immediately. It may be necessary to restrict the crawl to certain sections of the site, or limit specific URL patterns (we’ll cover how to do this below).
Step 2: Understanding the Domain Structure
Before starting a crawl, it’s a good idea to get a better understanding of your site’s domain structure:
Check the www/non-www and / configuration of the domain when you add the domain.
Identify whether the site is using sub-domains.
If you are not sure about sub-domains, check the DeepCrawl “Crawl Subdomains” option and they will automatically be discovered if they are linked.
Step 3: Running a Test Crawl
Start with a small “Web Crawl, ” to look for signs that the site is uncrawlable.
Before starting the crawl, ensure that you have set the “Crawl Limit” to a low quantity. This will make your first checks more efficient, as you won’t have to wait very long to see the results.
Problems to watch for include:
A high number of URLs returning error codes, such as 401 access denied
URLs returned that are not of the correct subdomain – check that the base domain is correct under “Project Settings”.
Very low number of URLs found.
A large number of failed URLs (502, 504, etc).
A large number of canonicalized URLs.
A large number of duplicate pages.
A significant increase in the number of pages found at each level.
To save time, and check for obvious problems immediately, download the URLs during the crawl:
Step 4: Adding Crawl Restrictions
Next, reduce the size of the crawl by identifying anything that can be excluded. Adding restrictions ensures you are not wasting time (or credits) crawling URLs that are not important to you. All the following restrictions can be added within the “Advanced Settings” tab.
Remove Parameters
If you have excluded any parameters from search engine crawls with URL parameter tools like Google Search Console, enter these in the “Remove Parameters” field under “Advanced Settings. ”
Add Custom Settings
DeepCrawl’s “Robots Overwrite” feature allows you to identify additional URLs that can be excluded using a custom file – allowing you to test the impact of pushing a new file to a live environment.
Upload the alternative version of your robots file under “Advanced Settings” and select “Use Robots Override” when starting the crawl:
Filter URLs and URL Paths
Use the “Included/Excluded” URL fields under “Advanced Settings” to limit the crawl to specific areas of interest.
Add Crawl Limits for Groups of Pages
Use the “Page Grouping” feature, under “Advanced Settings, ” to restrict the number of URLs crawled for groups of pages based on their URL patterns.
Here, you can add a name.
In the “Page URL Match” column you can add a regular expression.
Add a maximum number of URLs to crawl in the “Crawl Limit” column.
URLs matching the designated path are counted. When the limits have been reached, all further matching URLs go into the “Page Group Restrictions” report and are not crawled.
Step 5: Testing Your Changes
Run test “Web Crawls” to ensure your configuration is correct and you’re ready to run a full crawl.
Step 6: Running your Crawl
Ensure you’ve increased the “Crawl Limit” before running a more in-depth crawl.
Consider running a crawl with as many URL sources as possible, to supplement your linked URLs with XML Sitemap and Google Analytics, and other data.
If you have specified a subdomain of www within the “Base Domain” setting, subdomains such as blog or default, will not be crawled.
To include subdomains select “Crawl Subdomains” within the “Project Settings” tab.
Set “Scheduling” for your crawls and track your progress.
Handy Tips
Settings for Specific Requirements
If you have a test/sandbox site you can run a “Comparison Crawl” by adding your test site domain and authentication details in “Advanced Settings. ”
For more about the Test vs Live feature, check out our guide to Comparing a Test Website to a Live Website.
To crawl an AJAX-style website, with an escaped fragment solution, use the “URL Rewrite” function to modify all linked URLs to the escaped fragment format.
Read more about our testing features – Testing Development Changes Before Putting Them Live.
Changing Crawl Rate
Watch for performance issues caused by the crawler while running a crawl.
If you see connection errors, or multiple 502/503 type errors, you may need to reduce the crawl rate under “Advanced Settings. ”
If you have a robust hosting solution, you may be able to crawl the site at a faster rate.
The crawl rate can be increased at times when the site load is reduced – 4 a. m. for example.
Head to “Advanced Settings” > “Crawl Rate” > “Add Rate Restriction. ”
Analyze Outbound Links
Sites with a large quantity of external links, may want to ensure that users are not directed to dead links.
To check this, select “Crawl External Links” under “Project Settings, ” adding an HTTP status code next to external links within your report.
Read more on outbound link audits to learn about analyzing and cleaning up external links.
Change User Agent
See your site through a variety of crawlers’ eyes (Facebook/Bingbot etc. ) by changing the user agent in “Advanced Settings. ”
Add a custom user agent to determine how your website responds.
After The Crawl
Reset your “Project Settings” after the crawl, so you can continue to crawl with ‘real-world’ settings applied.
Remember, the more you experiment and crawl, the closer you get to becoming an expert crawler.
Start your journey with DeepCrawl
If you’re interested in running a crawl with DeepCrawl, discover our range of flexible plans or if you want to find out more about our platform simply drop us a message and we’ll get back to you asap.
Author
Sam Marsden
Sam Marsden is Deepcrawl’s Former SEO & Content Manager. Sam speaks regularly at marketing conferences, like SMX and BrightonSEO, and is a contributor to industry publications such as Search Engine Journal and State of Digital.
Frequently Asked Questions about crawler site web
What does it mean to crawl a website?
Website Crawling is the automated fetching of web pages by a software process, the purpose of which is to index the content of websites so they can be searched. The crawler analyzes the content of a page looking for links to the next pages to fetch and index.May 10, 2010
How do I web crawl a website?
The six steps to crawling a website include:Configuring the URL sources.Understanding the domain structure.Running a test crawl.Adding crawl restrictions.Testing your changes.Running your crawl.
How can I see what sites are crawling?
Find broken links and audit redirects. … Audit the most important meta tags for each URL in one window. … Check external, internal links and anchors for each URL. … Make web page speed test for each URL. … Audit you internal Page Rank. … Search any URL by text or type of a technical error. … Visualize your website structure.More items…