Showing posts with label crawling and indexing. Show all posts
Showing posts with label crawling and indexing. Show all posts

Tuesday, February 28, 2017

Closing down for a day

Note: This post is specific to Google's organic web-search. For Google's other services, please check with the appropriate help center (e.g., for Google Shopping) or help forum.

Even in today's "always-on" world, sometimes businesses want to take a break. There are times when even their online presence needs to be paused. This blog post covers some of the available options so that a site's search presence isn't affected.

Option: Block cart functionality

If a site only needs to block users from buying things, the simplest approach is to disable that specific functionality. In most cases, shopping cart pages can either be blocked from crawling through the robots.txt file, or blocked from indexing with a robots meta tag. Since search engines either won't see or index that content, you can communicate this to users in an appropriate way. For example, you may disable the link to the cart, add a relevant message, or display an informational page instead of the cart.

Option: Always show interstitial or pop-up

If you need to block the whole site from users, be it with a "temporarily unavailable" message, informational page, or popup, the server should return a 503 HTTP result code ("Service Unavailable"). The 503 result code makes sure that Google doesn't index the temporary content that's shown to users. Without the 503 result code, the interstitial would be indexed as your website's content.

Googlebot will retry pages that return 503 for up to about a week, before treating it as a permanent error that can result in those pages being dropped from the search results. You can also include a "Retry after" header to indicate how long the site will be unavailable. Blocking a site for longer than a week can have negative effects on the site's search results regardless of the method that you use.

Option: Switch whole website off

Turning the server off completely is another option. You might also do this if you're physically moving your server to a different data center. For this, have a temporary server available to serve a 503 HTTP result code for all URLs (with an appropriate informational page for users), and switch your DNS to point to that server during that time.

  1. Set your DNS TTL to a low time (such as 5 minutes) a few days in advance.
  2. Change the DNS to the temporary server's IP address.
  3. Take your main server offline once all requests go to the temporary server.
  4. … your server is now offline ...
  5. When ready, bring your main server online again.
  6. Switch DNS back to the main server's IP address.
  7. Change the DNS TTL back to normal.

We hope these options cover the common situations where you'd need to disable your website temporarily. If you have any questions, feel free to drop by our webmaster help forums!

PS If your business is active locally, make sure to reflect these closures in the opening hours for your local listings too!


Monday, January 16, 2017

What Crawl Budget Means for Googlebot

What Crawl Budget Means for Googlebot

Recently, we've heard a number of definitions for "crawl budget", however we don't have a single term that would describe everything that "crawl budget" stands for externally. With this post we'll clarify what we actually have and what it means for Googlebot.

First, we'd like to emphasize that crawl budget, as described below, is not something most publishers have to worry about. If new pages tend to be crawled the same day they're published, crawl budget is not something webmasters need to focus on. Likewise, if a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently.

Prioritizing what to crawl, when, and how much resource the server hosting the site can allocate to crawling is more important for bigger sites, or those that auto-generate pages based on URL parameters, for example.

Crawl rate limit

Googlebot is designed to be a good citizen of the web. Crawling is its main priority, while making sure it doesn't degrade the experience of users visiting the site. We call this the "crawl rate limit," which limits the maximum fetching rate for a given site.


Simply put, this represents the number of simultaneous parallel connections Googlebot may use to crawl the site, as well as the time it has to wait between the fetches. The crawl rate can go up and down based on a couple of factors:
  • Crawl health: if the site responds really quickly for a while, the limit goes up, meaning more connections can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less.
  • Limit set in Search Console: website owners can reduce Googlebot's crawling of their site. Note that setting higher limits doesn't automatically increase crawling.


Crawl demand

Even if the crawl rate limit isn't reached, if there's no demand from indexing, there will be low activity from Googlebot. The two factors that play a significant role in determining crawl demand are:
  • Popularity: URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in our index.
  • Staleness: our systems attempt to prevent URLs from becoming stale in the index.
Additionally, site-wide events like site moves may trigger an increase in crawl demand in order to reindex the content under the new URLs.

Taking crawl rate and crawl demand together we define crawl budget as the number of URLs Googlebot can and wants to crawl.


Factors affecting crawl budget

According to our analysis, having many low-value-add URLs can negatively affect a site's crawling and indexing. We found that the low-value-add URLs fall into these categories, in order of significance:
Wasting server resources on pages like these will drain crawl activity from pages that do actually have value, which may cause a significant delay in discovering great content on a site.


Top questions

Crawling is the entry point for sites into Google's search results. Efficient crawling of a website helps with its indexing in Google Search.

Q: Does site speed affect my crawl budget? How about errors?
A: Making a site faster improves the users' experience while also increasing crawl rate. For Googlebot a speedy site is a sign of healthy servers, so it can get more content over the same number of connections. On the flip side, a significant number of 5xx errors or connection timeouts signal the opposite, and crawling slows down.
We recommend paying attention to the Crawl Errors report in Search Console and keeping the number of server errors low.

Q: Is crawling a ranking factor?
A: An increased crawl rate will not necessarily lead to better positions in Search results. Google uses hundreds of signals to rank the results, and while crawling is necessary for being in the results, it's not a ranking signal.

Q: Do alternate URLs and embedded content count in the crawl budget?
A: Generally, any URL that Googlebot crawls will count towards a site's crawl budget. Alternate URLs, like AMP or hreflang, as well as embedded content, such as CSS and JavaScript, may have to be crawled and will consume a site's crawl budget. Similarly, long redirect chains may have a negative effect on crawling.

Q: Can I control Googlebot with the "crawl-delay" directive?
A: The non-standard "crawl-delay" robots.txt directive is not processed by Googlebot.

Q: Does the nofollow directive affect crawl budget?
A: It depends. Any URL that is crawled affects crawl budget, so even if your page marks a URL as nofollow it can still be crawled if another page on your site, or any page on the web, doesn't label the link as nofollow.

For information on how to optimize crawling of your site, take a look at our blogpost on optimizing crawling from 2009 that is still applicable. If you have questions, ask in the forums!

Posted by Gary, Crawling and Indexing teams
















Wednesday, November 30, 2016

An update on Google's feature-phone crawling & indexing

An update on Google's feature-phone crawling & indexing

Limited mobile devices, "feature-phones", require a special form of markup or a transcoder for web content. Most websites don't provide feature-phone-compatible content in WAP/WML any more. Given these developments, we've made changes in how we crawl feature-phone content (note: these changes don't affect smartphone content):

1. We've retired the feature-phone Googlebot

We won't be using the feature-phone user-agents for crawling for search going forward.

2. Use "handheld" link annotations for dynamic serving of feature-phone content.

Some sites provide content for feature-phones through dynamic serving, based on the user's user-agent. To understand this configuration, make sure your desktop and smartphone pages have a self-referential alternate URL link for handheld (feature-phone) devices:

<link rel="alternate" media="handheld" href="[current page URL]" />

This is a change from our previous guidance of only using the "vary: user-agent" HTTP header. We've updated our documentation on making feature-phone pages accordingly. We hope adding this link element is possible on your side, and thank you for your help in this regard. We'll continue to show feature-phone URLs in search when we can recognize them, and when they're appropriate for users.

3. We're retiring feature-phone tools in Search Console

Without the feature-phone Googlebot, special sitemaps extensions for feature-phone, the Fetch as Google feature-phone options, and feature-phone crawl errors are no longer needed. We continue to support sitemaps and other sitemaps extensions (such as for videos or Google News), as well as the other Fetch as Google options in Search Console.


We've worked to make these changes as minimal as possible. Most websites don't serve feature-phone content, and wouldn't be affected. If your site has been providing feature-phone content, we thank you for your help in bringing the Internet to feature-phone users worldwide!

For any questions, feel free to drop by our Webmaster Help Forums!

Wednesday, October 14, 2015

Deprecating our AJAX crawling scheme

Deprecating our AJAX crawling scheme

tl;dr: We are no longer recommending the AJAX crawling proposal we made back in 2009.

In 2009, we made a proposal to make AJAX pages crawlable. Back then, our systems were not able to render and understand pages that use JavaScript to present content to users. Because "crawlers … [were] not able to see any content … created dynamically," we proposed a set of practices that webmasters can follow in order to ensure that their AJAX-based applications are indexed by search engines.

Times have changed. Today, as long as you're not blocking Googlebot from crawling your JavaScript or CSS files, we are generally able to render and understand your web pages like modern browsers. To reflect this improvement, we recently updated our technical Webmaster Guidelines to recommend against disallowing Googlebot from crawling your site's CSS or JS files.

Since the assumptions for our 2009 proposal are no longer valid, we recommend following the principles of progressive enhancement. For example, you can use the History API pushState() to ensure accessibility for a wider range of browsers (and our systems).

Questions and answers

Q: My site currently follows your recommendation and supports _escaped_fragment_. Would my site stop getting indexed now that you've deprecated your recommendation?
A: No, the site would still be indexed. In general, however, we recommend you implement industry best practices when you're making the next update for your site. Instead of the _escaped_fragment_ URLs, we'll generally crawl, render, and index the #! URLs.

Q: Is moving away from the AJAX crawling proposal to industry best practices considered a site move? Do I need to implement redirects?
A: If your current setup is working fine, you should not have to immediately change anything. If you're building a new site or restructuring an already existing site, simply avoid introducing _escaped_fragment_ urls. .

Q: I use a JavaScript framework and my webserver serves a pre-rendered page. Is that still ok?
A: In general, websites shouldn't pre-render pages only for Google -- we expect that you might pre-render pages for performance benefits for users and that you would follow progressive enhancement guidelines. If you pre-render pages, make sure that the content served to Googlebot matches the user's experience, both how it looks and how it interacts. Serving Googlebot different content than a normal user would see is considered cloaking, and would be against our Webmaster Guidelines.

If you have any questions, feel free to post them here, or in the webmaster help forum.


Wednesday, March 11, 2015

Unblocking resources with Webmaster Tools

Webmasters often use linked images, CSS, and JavaScript files in web pages to make them pretty and functional. If these resources are blocked from crawling, then Googlebot can't use them when it renders those pages for search. Google Webmaster Tools now includes a Blocked Resources Report to help you find and resolve these kinds of issues.

This report starts with the names of the hosts from which your site is using blocked resources such as JavaScript, CSS, and images. Clicking on the rows gives you the list of blocked resources and then the pages that embed them, guiding you through the steps to diagnose and resolve how we're able to crawl and index the page's content.

An update to Fetch and Render shows how these blocked resources matter. When you request a URL be fetched and rendered, it now shows screenshots rendered both as Googlebot and as a typical user. This makes it easier to recognize the issues that significantly influence why your pages are seen differently by Googlebot.

Webmaster Tools attempts to show you only the hosts that you might have influence over, so at the moment, we won't show hosts that are used by many different sites  (such as popular analytics services). Because it can be time-consuming (usually not for technical reasons!) to update all robots.txt files, we recommend starting with the resources that make the most important visual difference when blocked. Our Help Center article has more information on the steps involved.

We hope this new feature makes it easier for you to spot and then unblock resources used by your website! Should you have any questions, feel free to drop by our webmaster help forums.

Tuesday, December 9, 2014

The four steps to appiness

Webmaster Level: intermediate to advanced

App deep links are the new kid on the block in organic search, and they’re picking up speed faster than you can say “schema.org ViewAction”! For signed-in users, 15% of Google searches on Android now return deep links to apps through App Indexing. And over just the past quarter, we've seen the number of clicks on app deep links jump by 10x.

We’ve gotten a lot of feedback from developers and seen a lot of implementations gone right and others that were good learning experiences since we opened up App Indexing back in June. We’d like to share with you four key steps to monitor app performance and drive user engagement:

1. Give your app developer access to Webmaster Tools

App indexing is a team effort between you (as a webmaster) and your app development team. We show information in Webmaster Tools that is key for your app developers to do their job well. Here’s what’s available right now:

  • Errors in indexed pages within apps
  • Weekly clicks and impressions from app deep link via Google search
  • Stats on your sitemap (if that’s how you implemented the app deep links)
...and we plan to add a lot more in the coming months!

We’ve noticed that very few developers have access to Webmaster Tools. So if you want your app development team to get all of the information they need to fix app-related issues, it’s essential for them to have access to Webmaster Tools.

Any verified site owner can add a new user. Pick restricted or full permissions, depending on the level of access you’d like to give:

2. Understand how your app is doing in search results

How are users engaging with your app from search results? We’ve introduced two new ways for you to track performance for your app deep links:

  • We now send a weekly clicks and impressions update to the Message center in your Webmaster Tools account.
  • You can now track how much traffic app deep links drive to your app using referrer information - specifically, the referrer extra in the ACTION_VIEW intent. We're working to integrate this information with Google Analytics for even easier access. Learn how to track referrer information on our Developer site.

3. Make sure key app resources can be crawled

Blocked resources are one of the top reasons for the “content mismatch” errors you see in Webmaster Tools’ Crawl Errors report. We need access to all the resources necessary to render your app page. This allows us to assess whether your associated web page has the same content as your app page.

To help you find and fix these issues, we now show you the specific resources we can’t access that are critical for rendering your app page. If you see a content mismatch error for your app, look out for the list of blocked resources in “Step 5” of the details dialog:

4. Watch out for Android App errors

To help you identify errors when indexing your app, we’ll send you messages for all app errors we detect, and will also display most of them in the “Android apps” tab of the Crawl errors report.

In addition to the currently available “Content mismatch” and “Intent URI not supported” error alerts, we’re introducing three new error types:

  • APK not found: we can’t find the package corresponding to the app.
  • No first-click free: the link to your app does not lead directly to the content, but requires login to access.
  • Back button violation: after following the link to your app, the back button did not return to search results.

In our experience, the majority of errors are usually caused by a general setting in your app (e.g. a blocked resource, or a region picker that pops up when the user tries to open the app from search). Taking care of that generally resolves it for all involved URIs.

Good luck in the pursuit of appiness! As always, if you have questions, feel free to drop by our Webmaster help forum.

Monday, October 27, 2014

Updating our technical Webmaster Guidelines

Updating our technical Webmaster Guidelines

Webmaster level: All

We recently announced that our indexing system has been rendering web pages more like a typical modern browser, with CSS and JavaScript turned on. Today, we're updating one of our technical Webmaster Guidelines in light of this announcement.

For optimal rendering and indexing, our new guideline specifies that you should allow Googlebot access to the JavaScript, CSS, and image files that your pages use. This provides you optimal rendering and indexing for your site. Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings.

Updated advice for optimal indexing

Historically, Google indexing systems resembled old text-only browsers, such as Lynx, and that’s what our Webmaster Guidelines said. Now, with indexing based on page rendering, it's no longer accurate to see our indexing systems as a text-only browser. Instead, a more accurate approximation is a modern web browser. With that new perspective, keep the following in mind:

  • Just like modern browsers, our rendering engine might not support all of the technologies a page uses. Make sure your web design adheres to the principles of progressive enhancement as this helps our systems (and a wider range of browsers) see usable content and basic functionality when certain web design features are not yet supported.
  • Pages that render quickly not only help users get to your content easier, but make indexing of those pages more efficient too. We advise you follow the best practices for page performance optimization, specifically:
  • Make sure your server can handle the additional load for serving of JavaScript and CSS files to Googlebot.

Testing and troubleshooting

In conjunction with the launch of our rendering-based indexing, we also updated the Fetch and Render as Google feature in Webmaster Tools so webmasters could see how our systems render the page. With it, you'll be able to identify a number of indexing issues: improper robots.txt restrictions, redirects that Googlebot cannot follow, and more.

And, as always, if you have any comments or questions, please ask in our Webmaster Help forum.

Thursday, October 16, 2014

Best practices for XML sitemaps & RSS/Atom feeds

Best practices for XML sitemaps & RSS/Atom feeds

Webmaster level: intermediate-advanced

Submitting sitemaps can be an important part of optimizing websites. Sitemaps enable search engines to discover all pages on a site and to download them quickly when they change. This blog post explains which fields in sitemaps are important, when to use XML sitemaps and RSS/Atom feeds, and how to optimize them for Google.

Sitemaps and feeds

Sitemaps can be in XML sitemap, RSS, or Atom formats. The important difference between these formats is that XML sitemaps describe the whole set of URLs within a site, while RSS/Atom feeds describe recent changes. This has important implications:

  • XML sitemaps are usually large; RSS/Atom feeds are small, containing only the most recent updates to your site.
  • XML sitemaps are downloaded less frequently than RSS/Atom feeds.

For optimal crawling, we recommend using both XML sitemaps and RSS/Atom feeds. XML sitemaps will give Google information about all of the pages on your site. RSS/Atom feeds will provide all updates on your site, helping Google to keep your content fresher in its index. Note that submitting sitemaps or feeds does not guarantee the indexing of those URLs.

Example of an XML sitemap:

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
   <loc>http://example.com/mypage</loc>
   <lastmod>2011-06-27T19:34:00+01:00</lastmod>
   <!-- optional additional tags -->
 </url>
 <url>
   ...
 </url>
</urlset>

Example of an RSS feed:

<?xml version="1.0" encoding="utf-8"?>
<rss>
 <channel>
   <!-- other tags -->
   <item>
     <!-- other tags -->
     <link>http://example.com/mypage</link>
     <pubDate>Mon, 27 Jun 2011 19:34:00 +0100</pubDate>
   </item>
   <item>
     ...
   </item>
 </channel>
</rss>

Example of an Atom feed:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
 <!-- other tags -->
 <entry>
   <link href="http://example.com/mypage" />
   <updated>2011-06-27T19:34:00+01:00</updated>
   <!-- other tags -->
 </entry>
 <entry>
   ...
 </entry>
</feed>

“other tags” refer to both optional and required tags by their respective standards. We recommend that you specify the required tags for Atom/RSS as they will help you to appear on other properties that might use these feeds, in addition to Google Search.

Best practices

Important fields

XML sitemaps and RSS/Atom feeds, in their core, are lists of URLs with metadata attached to them. The two most important pieces of information for Google are the URL itself and its last modification time:

URLs

URLs in XML sitemaps and RSS/Atom feeds should adhere to the following guidelines:

  • Only include URLs that can be fetched by Googlebot. A common mistake is including URLs disallowed by robots.txt — which cannot be fetched by Googlebot, or including URLs of pages that don't exist.
  • Only include canonical URLs. A common mistake is to include URLs of duplicate pages. This increases the load on your server without improving indexing.
Last modification time

Specify a last modification time for each URL in an XML sitemap and RSS/Atom feed. The last modification time should be the last time the content of the page changed meaningfully. If a change is meant to be visible in the search results, then the last modification time should be the time of this change.

  • XML sitemap uses  <lastmod>
  • RSS uses <pubDate>
  • Atom uses <updated>

Be sure to set or update last modification time correctly:

  • Specify the time in the correct format: W3C Datetime for XML sitemaps, RFC3339 for Atom and RFC822 for RSS.
  • Only update modification time when the content changed meaningfully.
  • Don’t set the last modification time to the current time whenever the sitemap or feed is served.

XML sitemaps

XML sitemaps should contain URLs of all pages on your site. They are often large and update infrequently. Follow these guidelines:

  • For a single XML sitemap: update it at least once a day (if your site changes regularly) and ping Google after you update it.
  • For a set of XML sitemaps: maximize the number of URLs in each XML sitemap. The limit is 50,000 URLs or a maximum size of 10MB uncompressed, whichever is reached first. Ping Google for each updated XML sitemap (or once for the sitemap index, if that's used) every time it is updated. A common mistake is to put only a handful of URLs into each XML sitemap file, which usually makes it harder for Google to download all of these XML sitemaps in a reasonable time.

RSS/Atom

RSS/Atom feeds should convey recent updates of your site. They are usually small and updated frequently. For these feeds, we recommend:

  • When a new page is added or an existing page meaningfully changed, add the URL and the modification time to the feed.
  • In order for Google to not miss updates, the RSS/Atom feed should have all updates in it since at least the last time Google downloaded it. The best way to achieve this is by using PubSubHubbub. The hub will propagate the content of your feed to all interested parties (RSS readers, search engines, etc.) in the fastest and most efficient way possible.

Generating both XML sitemaps and Atom/RSS feeds is a great way to optimize crawling of a site for Google and other search engines. The key information in these files is the canonical URL and the time of the last modification of pages within the website. Setting these properly, and notifying Google and other search engines through sitemaps pings and PubSubHubbub, will allow your website to be crawled optimally, and represented accordingly in search results.

If you have any questions, feel free to post them here, or to join other webmasters in the webmaster help forum section on sitemaps.

Friday, September 5, 2014

An improved search box within the search results

Webmaster level: All
Today you’ll see a new and improved sitelinks search box. When shown, it will make it easier for users to reach specific content on your site, directly through your own site-search pages.

What’s this search box and when does it appear for my site?

When users search for a company by name—for example, [Megadodo Publications] or [Dunder Mifflin]—they may actually be looking for something specific on that website. In the past, when our algorithms recognized this, they'd display a larger set of sitelinks and an additional search box below that search result, which let users do site: searches over the site straight from the results, for example [site:example.com hitchhiker guides].
This search box is now more prominent (above the sitelinks), supports Autocomplete, and—if you use the right markup—will send the user directly to your website's own search pages.

How can I mark up my site?

You need to have a working site-specific search engine for your site. If you already have one, you can let us know by marking up your homepage as a schema.org/WebSite entity with the potentialAction property of the schema.org/SearchAction markup. You can use JSON-LD, microdata, or RDFa to do this; check out the full implementation details on our developer site.
If you implement the markup on your site, users will have the ability to jump directly from the sitelinks search box to your site’s search results page. If we don’t find any markup, we’ll show them a Google search results page for the corresponding site: query, as we’ve done until now.
As always, if you have questions, feel free to ask in our Webmaster Help forum.

Update (16:30h CET, September 12th): We're noticing an enthusiastic uptick in the markup implementation after the initial announcement last week! Here are the two main issues we've observed so far, and what you need to do to fix them:

  1. Make sure that when you replace the curly braces and all that's inside of it with a search term it leads to a valid URL on your site.
    For example: if your "target" value is "http://www.example.com/search?q={searchTerm}", ensure that "http://www.example.com/search?q=foo" and "http://www.example.com/search?q=bar" both lead to search result pages about "foo" and "bar".
  2. Make sure that the "query-input" field points to the same string that's inside the curly braces in the "target" field.
    For example: if your "target" value is "http://www.example.com/search?q={searchTerm}", you must use "searchTerm" as the "name" within "query-input".

Wednesday, July 16, 2014

Testing robots.txt files made easier

Webmaster level: intermediate-advanced

To crawl, or not to crawl, that is the robots.txt question.

Making and maintaining correct robots.txt files can sometimes be difficult. While most sites have it easy (tip: they often don't even need a robots.txt file!), finding the directives within a large robots.txt file that are or were blocking individual URLs can be quite tricky. To make that easier, we're now announcing an updated robots.txt testing tool in Webmaster Tools.

You can find the updated testing tool in Webmaster Tools within the Crawl section:

Here you'll see the current robots.txt file, and can test new URLs to see whether they're disallowed for crawling. To guide your way through complicated directives, it will highlight the specific one that led to the final decision. You can make changes in the file and test those too, you'll just need to upload the new version of the file to your server afterwards to make the changes take effect. Our developers site has more about robots.txt directives and how the files are processed.

Additionally, you'll be able to review older versions of your robots.txt file, and see when access issues block us from crawling. For example, if Googlebot sees a 500 server error for the robots.txt file, we'll generally pause further crawling of the website.

Since there may be some errors or warnings shown for your existing sites, we recommend double-checking their robots.txt files. You can also combine it with other parts of Webmaster Tools: for example, you might use the updated Fetch as Google tool to render important pages on your website. If any blocked URLs are reported, you can use this robots.txt tester to find the directive that's blocking them, and, of course, then improve that. A common problem we've seen comes from old robots.txt files that block CSS, JavaScript, or mobile content — fixing that is often trivial once you've seen it.

We hope this updated tool makes it easier for you to test & maintain the robots.txt file. Should you have any questions, or need help with crafting a good set of directives, feel free to drop by our webmaster's help forum!

Wednesday, June 25, 2014

Android app indexing is now open for everyone!

Webmaster level: All

Do you have an Android app in addition to your website? You can now connect the two so that users searching from their smartphones and tablets can easily find and reach your app content.

App deep links in search results help your users find your content more easily and re-engage with your app after they’ve installed it. As a site owner, you can show your users the right content at the right time — by connecting pages of your website to the relevant parts of your app you control when your users are directed to your app and when they go to your website.


Hundreds of apps have already implemented app indexing. This week at Google I/O, we’re announcing a set of new features that will make it even easier to set up deep links in your app, connect your site to your app, and keep track of performance and potential errors.

Getting started is easy

We’ve greatly simplified the process to get your app deep links indexed. If your app supports HTTP deep linking schemes, here’s what you need to do:

  1. Add deep link support to your app
  2. Connect your site and your app
  3. There is no step 3 (:

As we index your URLs, we’ll discover and index the app / site connections and may begin to surface app deep links in search results.

We can discover and index your app deep links on our own, but we recommend you publish the deep links. This is also the case if your app only supports a custom deep link scheme. You can publish them in one of the following ways:

There’s one more thing: we’ve added a new feature in Webmaster Tools to help you debug any issues that might arise during app indexing. It will show you what type of errors we’ve detected for the app page-web page pairs, together with example app URIs so you can debug:



We’ll also give you detailed instructions on how to debug each issue, including a QR code for the app deep links, so you can easily open them on your phone or tablet. We’ll send you Webmaster Tools error notifications as well, so you can keep up to date.



Give app indexing a spin, and as always, if you need more help ask questions on the Webmaster help forum.

Wednesday, June 4, 2014

Directing smartphone users to the page they actually wanted

Webmaster level: all

Have you ever used Google Search on your smartphone and clicked on a promising-looking result, only to end up on the mobile site’s homepage, with no idea why the page you were hoping to see vanished? This is such a common annoyance that we’ve even seen comics about it. Usually this happens because the website is not properly set up to handle requests from smartphones and sends you to its smartphone homepage—we call this a “faulty redirect”.

We’d like to spare users the frustration of landing on irrelevant pages and help webmasters fix the faulty redirects. Starting today in our English search results in the US, whenever we detect that smartphone users are redirected to a homepage instead of the the page they asked for, we may note it below the result. If you still wish to proceed to the page, you can click “Try anyway”:

This search label has been deprecated. 


And we’re providing advice and resources to help you direct your audience to the pages they want. Here’s a quick rundown:

1. Do a few searches on your own phone (or with a browser set up to act like a smartphone) and see how your site behaves. Simple but effective. :)

2. Check out Webmaster Tools—we’ll send you a message if we detect that any of your site’s pages are redirecting smartphone users to the homepage. We’ll also show you any faulty redirects we detect in the Smartphone Crawl Errors section of Webmaster Tools:


3. Investigate any faulty redirects and fix them. Here’s what you can do:
  • Use the example URLs we provide in Webmaster Tools as a starting point to debug exactly where the problem is with your server configuration.
  • Set up your server so that it redirects smartphone users to the equivalent URL on your smartphone site.
  • If a page on your site doesn’t have a smartphone equivalent, keep users on the desktop page, rather than redirecting them to the smartphone site’s homepage. Doing nothing is better than doing something wrong in this case.
  • Try using responsive web design, which serves the same content for desktop and smartphone users.
If you’d like to know more about building smartphone-friendly sites, read our full recommendations. And, as always, if you need more help you can ask a question in our webmaster forum.

Tuesday, May 27, 2014

Rendering pages with Fetch as Google

Webmaster level: all

The Fetch as Google feature in Webmaster Tools provides webmasters with the results of Googlebot attempting to fetch their pages. The server headers and HTML shown are useful to diagnose technical problems and hacking side-effects, but sometimes make double-checking the response hard: Help! What do all of these codes mean? Is this really the same page as I see it in my browser? Where shall we have lunch? We can't help with that last one, but for the rest, we've recently expanded this tool to also show how Googlebot would be able to render the page.

Viewing the rendered page

In order to render the page, Googlebot will try to find all the external files involved, and fetch them as well. Those files frequently include images, CSS and JavaScript files, as well as other files that might be indirectly embedded through the CSS or JavaScript. These are then used to render a preview image that shows Googlebot's view of the page.

You can find the Fetch as Google feature in the Crawl section of Google Webmaster Tools. After submitting a URL with "Fetch and render," wait for it to be processed (this might take a moment for some pages). Once it's ready, just click on the response row to see the results.

Fetch as Google

Handling resources blocked by robots.txt

Googlebot follows the robots.txt directives for all files that it fetches. If you are disallowing crawling of some of these files (or if they are embedded from a third-party server that's disallowing Googlebot's crawling of them), we won't be able to show them to you in the rendered view. Similarly, if the server fails to respond or returns errors, then we won't be able to use those either (you can find similar issues in the Crawl Errors section of Webmaster Tools). If we run across either of these issues, we'll show them below the preview image.

We recommend making sure Googlebot can access any embedded resource that meaningfully contributes to your site's visible content, or to its layout. That will make Fetch as Google easier for you to use, and will make it possible for Googlebot to find and index that content as well. Some types of content – such as social media buttons, fonts or website-analytics scripts – tend not to meaningfully contribute to the visible content or layout, and can be left disallowed from crawling. For more information, please see our previous blog post on how Google is working to understand the web better.

We hope this update makes it easier for you to diagnose these kinds of issues, and to discover content that's accidentally blocked from crawling. If you have any comments or questions, let us know here or drop by in the webmaster help forum.

Friday, May 23, 2014

Understanding web pages better

Understanding web pages better


In 1998 when our servers were running in Susan Wojcicki’s garage, we didn't really have to worry about JavaScript or CSS. They weren't used much, or, JavaScript was used to make page elements... blink! A lot has changed since then. The web is full of rich, dynamic, amazing websites that make heavy use of JavaScript. Today, we'll talk about our capability to render richer websites — meaning we see your content more like modern Web browsers, include the external resources, execute JavaScript and apply CSS.

Traditionally, we were only looking at the raw textual content that we’d get in the HTTP response body and didn't really interpret what a typical browser running JavaScript would see. When pages that have valuable content rendered by JavaScript started showing up, we weren’t able to let searchers know about it, which is a sad outcome for both searchers and webmasters.

In order to solve this problem, we decided to try to understand pages by executing JavaScript. It’s hard to do that at the scale of the current web, but we decided that it’s worth it. We have been gradually improving how we do this for some time. In the past few months, our indexing system has been rendering a substantial number of web pages more like an average user’s browser with JavaScript turned on.

Sometimes things don't go perfectly during rendering, which may negatively impact search results for your site. Here are a few potential issues, and – where possible, – how you can help prevent them from occurring:
  • If resources like JavaScript or CSS in separate files are blocked (say, with robots.txt) so that Googlebot can’t retrieve them, our indexing systems won’t be able to see your site like an average user. We recommend allowing Googlebot to retrieve JavaScript and CSS so that  your content can be indexed better. This is especially important for mobile websites, where external resources like CSS and JavaScript help our algorithms understand that the pages are optimized for mobile.
  • If your web server is unable to handle the volume of crawl requests for resources, it may have a negative impact on our capability to render your pages. If you’d like to ensure that your pages can be rendered by Google, make sure your servers are able to handle crawl requests for resources.
  • It's always a good idea to have your site degrade gracefully. This will help users enjoy your content even if their browser doesn't have compatible JavaScript implementations. It will also help visitors with JavaScript disabled or off, as well as search engines that can't execute JavaScript yet.
  • Sometimes the JavaScript may be too complex or arcane for us to execute, in which case we can’t render the page fully and accurately.
  • Some JavaScript removes content from the page rather than adding, which prevents us from indexing the content.

To make things easier to debug, we're currently working on a tool for helping webmasters better understand how Google renders their site. We look forward to making it to available for you in the coming days in Webmaster Tools.

If you have any questions, please feel free to visit our help forum.

Monday, May 12, 2014

Creating the Right Homepage for your International Users


If you are doing business in more than one country or targeting different languages, we recommend having separate sites or sections with specific content on each URLs targeted for individual countries or languages. For instance one page for US and english-speaking visitors, and a different page for France and french-speaking users. While we have information on handling multi-regional and multilingual sites, the homepage can be a bit special. This post will help you create the right homepage on your website to serve the appropriate content to users depending on their language and location.

There are three ways to configure your homepage / landing page when your users access it:
  • Show everyone the same content.
  • Let users choose.
  • Serve content depending on users’ localization and language.
Let’s have a look at each in detail.

Show users worldwide the same content 

In this scenario, you decide to serve specific content for one given country and language on your homepage / generic URL (http://www.example.com). This content will be available to anyone who accesses that URL directly in their browser or those who search for that URL specifically. As mentioned above, all country & language versions should also be accessible on their own unique URLs.


Note: You can show a banner on your page to suggest a more appropriate version to users from other locations or with different language settings.

Let users choose which local version and which language they want 

In this configuration, you decide to serve a country selector page on your homepage / generic URL and to let users choose which content they want to see depending on country and language. All users who type in that URL can access the same page.

If you implement this scenario on your international site, remember to use the x-default rel-alternate-hreflang annotation for the country selector page, which was specifically created for these kinds of pages. The x-default value helps us recognize pages that are not specific to one language or region.

Automatically redirect users or dynamically serve the appropriate HTML content depending on users’ location and language settings

A third scenario would be to automatically serve the appropriate HTML content to your users depending on their location and language settings. You will either do that by using server-side 302 redirects or by dynamically serving the right HTML content.

Remember to use x-default rel-alternate-hreflang annotation on the homepage / generic page even if the latter is a redirect page that is not accessible directly for users.

Note: Think about redirecting users for whom you do not have a specific version. For instance, French-speaking users on a website that has English, Spanish and Chinese versions. Show them the content that you consider the most appropriate.

Whatever configuration you decide to go with, you should make sure all the pages – including country and language selector pages:
  • Have rel-alternate-hreflang annotations.
  • Are accessible for Googlebot's crawling and indexing: do not block the crawling or indexing of your localized pages.
  • Always allow users to switch local version or language: you can do that using a drop down menu for instance.
Reminder: As mentioned in the beginning, remember that you must have separate URLs for each country and language version. 

About rel-alternate-hreflang annotations

Remember to annotate all your pages - whatever method you choose. This will greatly help search engines to show the right results to your users.

Country selector pages and redirecting or dynamically serving homepages should all use the x-default hreflang, which was specifically designed for auto-redirecting homepages and country selectors. 

Finally, here are a few useful reminders about rel-alternate-hreflang annotations in general:
  • Your annotations must be confirmed from the other pages. If page A links to page B, page B must link back to page A, otherwise, your annotations may not be interpreted correctly.
  • Your annotations should be self-referential. Page A should use rel-alternate-hreflang annotation linking to itself.
  • You can specify the rel-alternate-hreflang annotations in the HTTP header, in the head section of the HTML, or in a sitemap file. We strongly recommend that you choose only one way to implement the annotations, in order to avoid inconsistent signals and errors.
  • The value of the hreflang attribute must be in ISO 639-1 format for the language, and in ISO 3166-1 Alpha 2 format for the region. Specifying only the region is not supported. If you wish to configure your site only for a country, use the geotargeting feature in Webmaster Tools
Following these recommendations will help us better understand your localized content and serve more relevant results to your users in our search results. As always, if you have any questions or feedback, please tell us in the internationalization Webmaster Help Forum.

Thursday, April 3, 2014

App Indexing updates

Webmaster Level: Advanced

In October, we announced guidelines for App Indexing for deep linking directly from Google Search results to your Android app. Thanks to all of you that have expressed interest. We’ve just enabled 20+ additional applications that users will soon see app deep links for in Search Results, and starting today we’re making app deep links to English content available globally.


We’re continuing to onboard more publishers in all languages. If you haven’t added deep link support to your Android app or specified these links on your website or in your Sitemaps, please do so and then notify us by filling out this form.

Here are some best practices to consider when adding deep links to your sitemap or website:
  • App deep links should only be included for canonical web URLs.
  • Remember to specify an app deep link for your homepage.
  • Not all website URLs in a Sitemap need to have a corresponding app deep link. Do not include app deep links that aren't supported by your app.
  • If you are a news site and use News Sitemaps, be sure to include your deep link annotations in the News Sitemaps, as well as your general Sitemaps.
  • Don't provide annotations for deep links that execute native ARM code. This enables app indexing to work for all platforms
When Google indexes content from your app, your app will need to make HTTP requests that it usually makes under normal operation. These requests will appear to your servers as originating from Googlebot. Therefore, your server's robots.txt file must be configured properly to allow these requests.

Finally, please make sure the back button behavior of your app leads directly back to the search results page.

For more details on implementation, visit our updated developer guidelines. And, as always, you can ask questions on the mobile section of our webmaster forum.

Monday, March 31, 2014

More Precise Index Status Data for Your Site Variations

Webmaster Level: Intermediate

The Google Webmaster Tools Index Status feature reports how many pages on your site are indexed by Google. In the past, we didn’t show index status data for HTTPS websites independently, but rather we included everything in the HTTP site’s report. In the last months, we’ve heard from you that you’d like to use Webmaster Tools to track your indexed URLs for sections of your website, including the parts that use HTTPS.

We’ve seen that nearly 10% of all URLs already use a secure connection to transfer data via HTTPS, and we hope to see more webmasters move their websites from HTTP to HTTPS in the future. We’re happy to announce a refinement in the way your site’s index status data is displayed in Webmaster Tools: the Index Status feature now tracks your site’s indexed URLs for each protocol (HTTP and HTTPS) as well as for verified subdirectories.

This makes it easy for you to monitor different sections of your site. For example, the following URLs each show their own data in Webmaster Tools Index Status report, provided they are verified separately:

The refined data will be visible for webmasters whose site's URLs are on HTTPS or who have subdirectories verified, such as https://example.com/folder/. Data for subdirectories will be included in the higher-level verified sites on the same hostname and protocol.

If you have a website on HTTPS or if some of your content is indexed under different subdomains, you will see a change in the corresponding Index Status reports. The screenshots below illustrate the changes that you may see on your HTTP and HTTPS sites’ Index Status graphs for instance:

HTTP site’s Index Status showing drop

HTTPS site’s Index Status showing increase

An “Update” annotation has been added to the Index Status graph for March 9th, showing when we started collecting this data. This change does not affect the way we index your URLs, nor does it have an impact on the overall number of URLs indexed on your domain. It is a change that only affects the reporting of data in Webmaster Tools user interface.

In order to see your data correctly, you will need to verify all existing variants of your site (www., non-www., HTTPS, subdirectories, subdomains) in Google Webmaster Tools. We recommend that your preferred domains and canonical URLs are configured accordingly.

Note that if you wish to submit a Sitemap, you will need to do so for the preferred variant of your website, using the corresponding URLs. Robots.txt files are also read separately for each protocol and hostname.

We hope that you’ll find this update useful, and that it’ll help you monitor, identify and fix indexing problems with your website. You can find additional details in our Index Status Help Center article. As usual, if you have any questions, don’t hesitate to ask in our webmaster Help Forum.


Thursday, February 13, 2014

Infinite scroll search-friendly recommendations

Webmaster Level: Advanced

Your site’s news feed or pinboard might use infinite scroll—much to your users’ delight! When it comes to delighting Googlebot, however, that can be another story. With infinite scroll, crawlers cannot always emulate manual user behavior--like scrolling or clicking a button to load more items--so they don't always access all individual items in the feed or gallery. If crawlers can’t access your content, it’s unlikely to surface in search results.

To make sure that search engines can crawl individual items linked from an infinite scroll page, make sure that you or your content management system produces a paginated series (component pages) to go along with your infinite scroll.


Infinite scroll page is made “search-friendly” when converted to a paginated series -- each component page has a similar <title> with rel=next/prev values declared in the <head>.

You can see this type of behavior in action in the infinite scroll with pagination demo created by Webmaster Trends Analyst, John Mueller. The demo illustrates some key search-engine friendly points:
  • Coverage: All individual items are accessible. With traditional infinite scroll, individual items displayed after the initial page load aren’t discoverable to crawlers.
  • No overlap: Each item is listed only once in the paginated series (i.e., no duplication of items).
Search-friendly recommendations for infinite scroll
  1. Before you start:
    • Chunk your infinite-scroll page content into component pages that can be accessed when JavaScript is disabled.
    • Determine how much content to include on each page.
      • Be sure that if a searcher came directly to this page, they could easily find the exact item they wanted (e.g., without lots of scrolling before locating the desired content).
      • Maintain reasonable page load time.
    • Divide content so that there’s no overlap between component pages in the series (with the exception of buffering).


  2. The example on the left is search-friendly, the right example isn’t -- the right example would cause crawling and indexing of duplicative content.

  3. Structure URLs for infinite scroll search engine processing.
    • Each component page contains a full URL. We recommend full URLs in this situation to minimize potential for configuration error.
      • Good: example.com/category?name=fun-items&page=1
      • Good: example.com/fun-items?lastid=567
      • Less optimal: example.com/fun-items#1
      • Test that each component page (the URL) works to take anyone directly to the content and is accessible/referenceable in a browser without the same cookie or user history.
    • Any key/value URL parameters should follow these recommendations:
      • Be sure the URL shows conceptually the same content two weeks from now.
        • Avoid relative-time based URL parameters:
          example.com/category/page.php?name=fun-items&days-ago=3
      • Create parameters that can surface valuable content to searchers.
        • Avoid non-searcher valuable parameters as the primary method to access content:
          example.com/fun-places?radius=5&lat=40.71&long=-73.40

  4. Configure pagination with each component page containing rel=next and rel=prev values in the <head>. Pagination values in the <body> will be ignored for Google indexing purposes because they could be created with user-generated content (not intended by the webmaster).

  5. Implement replaceState/pushState on the infinite scroll page. (The decision to use one or both is up to you and your site’s user behavior). That said, we recommend including pushState (by itself, or in conjunction with replaceState) for the following:
    • Any user action that resembles a click or actively turning a page.
    • To provide users with the ability to serially backup through the most recently paginated content.

  6. Test!

Wednesday, February 12, 2014

Faceted navigation best (and 5 of the worst) practices

Webmaster Level: Advanced

Faceted navigation, such as filtering by color or price range, can be helpful for your visitors, but it’s often not search-friendly since it creates many combinations of URLs with duplicative content. With duplicative URLs, search engines may not crawl new or updated unique content as quickly, and/or they may not index a page accurately because indexing signals are diluted between the duplicate versions. To reduce these issues and help faceted navigation sites become as search-friendly as possible, we’d like to:



Selecting filters with faceted navigation can cause many URL combinations, such as http://www.example.com/category.php?category=gummy-candies&price=5-10&price=over-10

Background

In an ideal state, unique content -- whether an individual product/article or a category of products/articles --  would have only one accessible URL. This URL would have a clear click path, or route to the content from within the site, accessible by clicking from the homepage or a category page.

Ideal for searchers and Google Search
  • Clear path that reaches all individual product/article pages


    On the left is potential user navigation on the site (i.e., the click path), on the right are the pages accessed.

  • One representative URL for category page
    http://www.example.com/category.php?category=gummy-candies


    Category page for gummy candies

  • One representative URL for individual product page
    http://www.example.com/product.php?item=swedish-fish

    Product page for swedish fish
Undesirable duplication caused with faceted navigation
  • Numerous URLs for the same article/product

    CanonicalDuplicate
    example.com/product.php? item=swedish-fishexample.com/product.php? item=swedish-fish&category=gummy-candies&price=5-10
    Same product page for swedish fish can be available on multiple URLs.

  • Numerous category pages that provide little or no value to searchers and search engines)

    URLexample.com/category.php? category=gummy-candies&taste=sour&price=5-10example.com/category.php? category=gummy-candies&taste=sour&price=over-10
    Issues
    • No added value to Google searchers given users rarely search for [sour gummy candy price five to ten dollars].
    • No added value for search engine crawlers that discover same item ("fruit salad") from parent category pages (either “gummy candies” or “sour gummy candies”).
    • Negative value to site owner who may have indexing signals diluted between numerous versions of the same category.
    • Negative value to site owner with respect to serving bandwidth and losing crawler capacity to duplicative content rather than new or updated pages.
    • No value for search engines (should have 404 response code).
    • Negative value to searchers.

Worst (search un-friendly) practices for faceted navigation

Worst practice #1: Non-standard URL encoding for parameters, like commas or brackets, instead of “key=value&” pairs.
Worst practices:
  • example.com/category?[category:gummy-candy][sort:price-low-to-high][sid:789]
    • key=value pairs marked with : rather than =
    • multiple parameters appended with [ ] rather than &
  • example.com/category?category,gummy-candy,,sort,lowtohigh,,sid,789
    • key=value pairs marked with a , rather than =
    • multiple parameters appended with ,, rather than &
Best practice:
example.com/category?category=gummy-candy&sort=low-to-high&sid=789

While humans may be able to decode odd URL parameters, such as “,,”, crawlers have difficulty interpreting URL parameters when they’re implemented in a non-standard fashion. Software engineer on Google’s Crawling Team, Mehmet Aktuna, says “Using non-standard encoding is just asking for trouble.” Instead, connect key=value pairs with an equal sign (=) and append multiple parameters with an ampersand (&).

Worst practice #2: Using directories or file paths rather than parameters to list values that don’t change page content.
Worst practice:
example.com/c123/s789/product?swedish-fish
(where /c123/ is a category, /s789/ is a sessionID that doesn’t change page content)
Good practice:
example.com/gummy-candy/product?item=swedish-fish&sid=789 (the directory, /gummy-candy/,changes the page content in a meaningful way)

Best practice:

example.com/product?item=swedish-fish&category=gummy-candy&sid=789 (URL parameters allow more flexibility for search engines to determine how to crawl efficiently)

It’s difficult for automated programs, like search engine crawlers, to differentiate useful values (e.g., “gummy-candy”) from the useless ones (e.g., “sessionID”) when values are placed directly in the path. On the other hand, URL parameters provide flexibility for search engines to quickly test and determine when a given value doesn’t require the crawler access all variations.

Common values that don’t change page content and should be listed as URL parameters include:

  • Session IDs
  • Tracking IDs
  • Referrer IDs
  • Timestamp
Worst practice #3: Converting user-generated values into (possibly infinite) URL parameters that are crawlable and indexable, but not useful in search results.

Worst practices (e.g., user-generated values like longitude/latitude or “days ago” as crawlable and indexable URLs):

  • example.com/find-a-doctor?radius=15&latitude=40.7565068&longitude=-73.9668408

  • example.com/article?category=health&days-ago=7

Best practices:

  • example.com/find-a-doctor?city=san-francisco&neighborhood=soma

  • example.com/articles?category=health&date=january-10-2014

Rather than allow user-generated values to create crawlable URLs  -- which leads to infinite possibilities with very little value to searchers -- perhaps publish category pages for the most popular values, then include additional information so the page provides more value than an ordinary search results page. Alternatively, consider placing user-generated values in a separate directory and then robots.txt disallow crawling of that directory.

  • example.com/filtering/find-a-doctor?radius=15&latitude=40.7565068&longitude=-73.9668408
  • example.com/filtering/articles?category=health&days-ago=7

with robots.txt:

User-agent: *
Disallow: /filtering/
Worst practice #4: Appending URL parameters without logic.

Worst practices:

  • example.com/gummy-candy/lollipops/gummy-candy/gummy-candy/product?swedish-fish
  • example.com/product?cat=gummy-candy&cat=lollipops&cat=gummy-candy&cat=gummy-candy&item=swedish-fish

Better practice:

example.com/gummy-candy/product?item=swedish-fish

Best practice:

example.com/product?item=swedish-fish&category=gummy-candy

Extraneous URL parameters only increase duplication, causing less efficient crawling and indexing. Therefore, consider stripping unnecessary URL parameters and performing your site’s “internal housekeeping”  before generating the URL. If many parameters are required for the user session, perhaps hide the information in a cookie rather than continually append values like cat=gummy-candy&cat=lollipops&cat=gummy-candy& ...

Worst practice #5: Offering further refinement (filtering) when there are zero results.

Worst practice:

Allowing users to select filters when zero items exist for the refinement.

Refinement to a page with zero results (e.g., price=over-10) is allowed even though it frustrates users and causes unnecessary issues for search engines.

Best practice

Only create links/URLs when it’s a valid user-selection (items exist). With zero items, grey out filtering options. To further improve usability, consider adding item counts next to each filter.


Refinement to a page with zero results (e.g., price=over-10) isn’t allowed, preventing users from making an unnecessary click and search engine crawlers from accessing a non-useful page.

Prevent useless URLs and minimize the crawl space by only creating URLs when products exist. This helps users to stay engaged on your site (fewer clicks on the back button when no products exist), and helps minimize potential URLs known to crawlers. Furthermore, if a page isn’t just temporarily out-of-stock, but is unlikely to ever contain useful content, consider returning a 404 status code. With the 404 response, you can include a helpful message to users with more navigation options or a search box to find related products.

Best practices for new faceted navigation implementations or redesigns

New sites that are considering implementing faceted navigation have several options to optimize the “crawl space” (the totality of URLs on your site known to Googlebot) for unique content pages, reduce crawling of duplicative pages, and consolidate indexing signals.

  • Determine which URL parameters are required for search engines to crawl every individual content page (i.e., determine what parameters are required to create at least one click-path to each item). Required parameters may include item-id, category-id, page, etc.
  • Determine which parameters would be valuable to searchers and their queries, and which would likely only cause duplication with unnecessary crawling or indexing. In the candy store example, I may find the URL parameter “taste” to be valuable to searchers for queries like [sour gummy candies] which could show the result example.com/category.php?category=gummy-candies&taste=sour. However, I may consider the parameter “price” to only cause duplication, such as category=gummy-candies&taste=sour&price=over-10. Other common examples:
    • Valuable parameters to searchers: item-id, category-id, name, brand...
    • Unnecessary parameters: session-id, price-range...
  • Consider implementing one of several configuration options for URLs that contain unnecessary parameters. Just make sure that the unnecessary URL parameters are never required in a crawler or user’s click path to reach each individual product!

    • Option 1: rel="nofollow" internal links
      Make all unnecessary URLs links rel=“nofollow.” This option minimizes the crawler’s discovery of unnecessary URLs and therefore reduces the potentially explosive crawl space (URLs known to the crawler) that can occur with faceted navigation. rel=”nofollow” doesn’t prevent the unnecessary URLs from being crawled (only a robots.txt disallow prevents crawling). By allowing them to be crawled, however, you can consolidate indexing signals from the unnecessary URLs with a searcher-valuable URL by adding rel=”canonical” from the unnecessary URL to a superset URL (e.g. example.com/category.php?category=gummy-candies&taste=sour&price=5-10 can specify a rel=”canonical” to the superset sour gummy candies view-all page at example.com/category.php?category=gummy-candies&taste=sour&page=all).
    • Option 2: Robots.txt disallow
      For URLs with unnecessary parameters, include a /filtering/directory that will be robots.txt disallow’d. This lets all search engines freely crawl good content, but will prevent crawling of the unwanted URLs. For instance, if my valuable parameters were item, category, and taste, and my unnecessary parameters were session-id and price. I may have the URL:
      example.com/category.php?category=gummy-candies
      which could link to another URL valuable parameter such as taste:
      example.com/category.php?category=gummy-candies&taste=sour.
      but for the unnecessary parameters, such as price, the URL includes a predefined directory, /filtering/:
      example.com/filtering/category.php?category=gummy-candies&price=5-10
      which is then robots.txt disallowed
      User-agent: *
      Disallow: /filtering/
    • Option 3: Separate hosts
      If you’re not using a CDN (sites using CDNs don’t have this flexibility easily available in Webmaster Tools), consider placing any URLs with unnecessary parameters on a separate host -- for example, creating main host www.example.com and secondary host, www2.example.com. On the secondary host (www2), set the Crawl rate in Webmaster Tools to “low” while keeping the main host’s crawl rate as high as possible. This would allow for more full crawling of the main host URLs and reduces Googlebot’s focus on your unnecessary URLs.
      • Be sure there remains at least one click path to all items on the main host.
      • If you’d like to consolidate indexing signals, consider adding rel=”canonical” from the secondary host to a superset URL on the main host (e.g. www2.example.com/category.php?category=gummy-candies&taste=sour&price=5-10 may specify a rel=”canonical” to the superset “sour gummy candies” view-all page, www.example.com/category.php?category=gummy-candies&taste=sour&page=all).
  • Prevent clickable links when no products exist for the category/filter.
  • Add logic to the display of URL parameters.
    • Remove unnecessary parameters rather than continuously append values.
      • Avoid
        example.com/product?cat=gummy-candy&cat=lollipops&cat=gummy-candy&item=swedish-fish)
    • Help the searcher experience by keeping a consistent parameter order based on searcher-valuable parameters listed first (as the URL may be visible in search results) and searcher-irrelevant parameters last (e.g., session ID).
      • Avoid
        example.com/category.php?session-id=123&tracking-id=456&category=gummy-candies&taste=sour
  • Improve indexing of individual content pages with rel=”canonical” to the preferred version of a page. rel=”canonical” can be used across hostnames or domains.
  • Improve indexing of paginated content (such as page=1 and page=2 of the category “gummy candies”) by either:
    • Adding rel=”canonical” from individual component pages in the series to the category’s “view-all” page (e.g. page=1, page=2, and page=3 of “gummy candies” with rel=”canonical” to category=gummy-candies&page=all) while making sure that it’s still a good searcher experience (e.g., the page loads quickly).
    • Using pagination markup with rel=”next” and rel=”prev” to consolidate indexing properties, such as links, from the component pages/URLs to the series as a whole.
  • Be sure that if using JavaScript to dynamically sort/filter/hide content without updating the URL, there still exists URLs on your site that searchers would find valuable, such as main category and product pages that can be crawled and indexed. For instance, avoid using only the homepage (i.e., one URL) for your entire site with JavaScript to dynamically change content with user navigation --  this would unfortunately provide searchers with only one URL to reach all of your content. Also, check that performance isn’t negatively affected with dynamic filtering, as this could undermine the user experience.
  • Include only canonical URLs in Sitemaps.

Best practices for existing sites with faceted navigation

First, know that the best practices listed above (e.g., rel=”nofollow” for unnecessary URLs) still apply if/when you’re able to implement a larger redesign. Otherwise, with existing faceted navigation, it’s likely that a large crawl space was already discovered by search engines. Therefore, focus on reducing further growth of unnecessary pages crawled by Googlebot and consolidating indexing signals.

  • Use parameters (when possible) with standard encoding and key=value pairs.
  • Verify that values that don’t change page content, such as session IDs, are implemented as standard key=value pairs, not directories
  • Prevent clickable anchors when products exist for the category/filter (i.e., don’t allow clicks or URLs to be created when no items exist for the filter)
  • Add logic to the display of URL parameters
    • Remove unnecessary parameters rather than continuously append values (e.g., avoid example.com/product?cat=gummy-candy&cat=lollipops&cat=gummy-candy&item=swedish-fish)
  • Help the searcher experience by keeping a consistent parameter order based on searcher-valuable parameters listed first (as the URL may be visible in search results) and searcher-irrelevant parameters last (e.g., avoid example.com/category?session-id=123&tracking-id=456&category=gummy-candies&taste=sour& in favor of example.com/category.php?category=gummy-candies&taste=sour&session-id=123&tracking-id=456)
  • Configure Webmaster Tools URL Parameters if you have strong understanding of the URL parameter behavior on your site (make sure that there is still a clear click path to each individual item/article). For instance, with URL Parameters in Webmaster Tools, you can list the parameter name, the parameters effect on the page content, and how you’d like Googlebot to crawl URLs containing the parameter.


    URL Parameters in Webmaster Tools allows the site owner to provide information about the site’s parameters and recommendations for Googlebot’s behavior.

  • Be sure that if using JavaScript to dynamically sort/filter/hide content without updating the URL, there still exists URLs on your site that searchers would find valuable, such as main category and product pages that can be crawled and indexed. For instance, avoid using only the homepage (i.e., one URL) for your entire site with JavaScript to dynamically change content with user navigation --  this would unfortunately provide searchers with only one URL to reach all of your content. Also, check that performance isn’t negatively affected with dynamic filtering, as this could undermine the user experience.
  • Improve indexing of individual content pages with rel=”canonical” to the preferred version of a page. rel=”canonical” can be used across hostnames or domains.
  • Improve indexing of paginated content (such as page=1 and page=2 of the category “gummy candies”) by either:
    • Adding rel=”canonical” from individual component pages in the series to the category’s “view-all” page (e.g. page=1, page=2, and page=3 of “gummy candies” with rel=”canonical” to category=gummy-candies&page=all) while making sure that it’s still a good searcher experience (e.g., the page loads quickly).
    • Using pagination markup with rel=”next” and rel=”prev” to consolidate indexing properties, such as links, from the component pages/URLs to the series as a whole.
  • Include only canonical URLs in Sitemaps.

Remember that commonly, the simpler you can keep it, the better. Questions? Please ask in our Webmaster discussion forum.