Delete your pages and rank higher in search - Index bloat and technical optimization 2019

Remove your pages & rank higher in search – Index bloat and technical optimization 2019 – Search Engine Watch

IIf you are looking for a way to optimize your site for technical SEO and score better, consider deleting your pages.

I know, crazy, right? But listen to me.

We all know that Google can index content slowly, especially on new websites. But every now and then it can aggressively index everything and everything and it gets its robot in its hands, whether you want it or not. This can cause terrible headaches, hours of cleaning up and subsequent maintenance, especially on large sites and / or e-commerce sites.

It is our job as search engine optimization experts to ensure that Google and other search engines can first find our content so that they can then understand, index and rank it appropriately. When we have too many indexed pages, we don't know exactly how we want search engines to treat our pages. As a result, they take whatever action they deem best, which sometimes translates into indexing more pages than necessary.

Before you know it, you have to deal with an inflated index.

What is the bloat index?

Simply put, index bloat is when you have too many low quality pages on your site indexed in search engines. Similar to a bloated feeling in the human digestive system (disclaimer: I am not a doctor), the result of processing this excess content can be seen in search engine indices when their information retrieval process becomes less efficient.

Indexbloat can even make your life difficult without your knowledge. In this swollen and uncomfortable situation, Google needs to go through much more content than necessary (usually low quality and internal duplicate content) before they can reach the pages that you want to index.

Look at it this way: Google visits your XML sitemap to find 5,000 pages, then crawls all of your pages and finds even more via internal links and finally decides to index 30,000 URLs. . This results in an indexation surplus of around 500% or even more.

But don't worry, diagnosing your indexing rate to measure against index bloat can be a very simple and uncomplicated check. All you have to do is refer to which pages & # 39; s you want to index versus the pages & # 39; s that Google indexes (more on this later).

The aim is to find that inequality and to take the most appropriate action. We have two options:

  1. Content is of good quality = Maintain indexability
  2. Content is of low quality (thin, double or paginated) = noindex

You will notice that index bloat usually results in the removal of a relatively large number of pages from the index by adding a "NOINDEX" meta tag. However, through this indexing analysis, it is also possible to find pages that were missed during the creation of your XML Sitemap (s) and these can then be added to your Sitemap (s) for better indexing.

Why indexbloat is harmful to SEO

Indexbloat can slow down processing time, consume more resources, and open roads beyond your control where search engines can crash. One of the goals of SEO is to remove roadblocks that prevent large content from being placed in search engines, which are often technical in nature. For example, low loading speeds, use of noindex or nofollow meta tags where you should not do that, do not have the correct internal linking strategies and other such implementations.

Ideally, you would have 100% indexation. This means that every quality page on your site is indexed – no pollution, no unwanted material, no bloating. But for the sake of this analysis, let's consider something over 100% bloat. Indexbloat forces search engines to spend more (limited) resources than necessary to process the pages they have in their database.

In the best case scenario, index bloat causes inefficient crawling and indexing, which hampers your ranking capabilities. But worst-case index bloat can lead to keyword cannibalization on many pages on your site, limiting your ability to rank in top positions and potentially affecting the user experience by looking for low-quality pages. send.

In summary, index bloat causes the following problems:

  1. Uses the limited resources that Google allocates for a specific site
  2. Creates floating content (directs Googlebot to dead ends)
  3. Has a negative influence on the ranking possibilities of the website
  4. Lowers the quality evaluation of the domain in the eyes of search engines

Sources of index bloating

1. Internal duplicate content

Unintended duplicate content is one of the most common sources of inflated index. This is because most sources of internal duplicate content revolve around technical errors that generate a large number of URL combinations that are eventually indexed. For example, using URL parameters to manage the content on your site without proper canonicalization.

Faceted navigation is also one of the & # 39; thorniest SEO challenges & # 39; for large e-commerce sites, as Portent describes, and has the potential to generate billions of duplicate content pages by overlooking a simple feature.

2. Thin content

It is important to mention an issue introduced by the Yoast SEO plug-in version 7.0 around attachment pages. This WordPress plugin bug led to "Panda-like problems" in March 2018, causing severe ranking decreases for affected sites because Google felt that these sites were lower in the overall quality they offered searchers. In short, there is a setting in the Yoast plug-in to remove attachment pages in WordPress – a page created to include every image in your library with minimal content – the pinnacle of thin content for the most sites. For some users, the update to the latest version (then 7.0) caused the plug-in to overwrite the previous selection to remove this & # 39; s and to index all attachment pages & # 39; s by default.

This meant that having five images per blog post would result in an indexing of the number of indexed pages with 16% of the actual quality content per URL, which would cause a huge decrease in domain value.

3. Pagination

Pagination refers to the concept of splitting content into a series of pages to make content more accessible and improve the user experience. This means that if you have 30 blog posts on your site, you may have ten blog posts per page that go three pages deep. Like this:


You often see this on, among other things, store pages, press releases and news sites.

Within the scope of SEO, the pages beyond the first in the series will often contain the same page title and meta description, along with very similar (almost double) content, introducing cannibalization of keywords in the mix. In addition, the purpose of this & # 39; s page is to provide a better browsing user experience for users who are already on your site. It makes no sense to send search engine visitors to the third page of your blog.

4. Poorly performing content

If there is content on your site that does not generate traffic, has not resulted in conversions and has no backlinks, you can consider changing your strategy. Reusing content is a great way to maximize any value that can be saved from underperforming pages to create stronger and more authoritative pages.

Remember that it is our job as SEO experts to help improve the overall quality and value that a domain offers, and improving content is one of the best ways to do this. For this you need a content audit to evaluate your own individual situation and to determine the best course of action.

Even a 404 page that results in a 200 Live HTTP status code is a thin, low-quality page that should not be indexed.

Common problems with index bloat

One of the first things I do when checking a site is to open their XML sitemap. If they are on a WordPress site with a plug-in such as Yoast SEO or All in One SEO, you can very quickly find page types that do not need to be indexed. Check the following:

  • Custom message types
  • Testimonials & # 39; s
  • Case study pages & # 39; s
  • Team Page & # 39; s
  • Author pages & # 39; s
  • Category pages & # 39; s of blogs
  • Blog tag pages & # 39; s
  • Thank you pages & # 39; s
  • Test page & # 39; s

Determining whether the pages in your XML sitemap are of low quality and should be removed from the search really depends on the purpose that they serve on your site. For example, sites do not use the author's page in their blog, but have the author's page live, and this is not necessary. & # 39; Thank you & # 39; pages & # 39; s should not be indexed at all, as this can lead to deviations in conversion tracking. Test pages usually mean that there is a duplicate somewhere else. Similarly, some plug-ins or developers build custom functions on web builds and create many pages that do not need to be indexed. For example, if you find an XML Sitemap such as the one below, it probably doesn't need to be indexed:


Different methods to diagnose index bloat

Remember that our goal here is to find the largest contributors to low-quality pages that boost the index with low-quality content. It is usually very easy to find these pages on a large scale because many pages with thin content follow a pattern.

This is a quantitative analysis of your content, looking for volume differences based on the number of pages you have, the number of pages you link to, and the number of pages that Google indexes. Any difference between these numbers means that there is room for technical optimization, which often results in an increase in the organic ranking once it has been resolved. You want to make these number series as similar as possible.

While going through the different methods of diagnosing index bloat below, look for patterns in URLs by viewing the following:

  • URL & # 39; s with / dev /
  • URL & # 39; s with "test"
  • Subdomains that should not be indexed
  • Subdirectories that do not need to be indexed
  • A large number of PDF files that do not need to be indexed

Next, I will guide you through a few simple steps that you can take yourself using some of the most basic tools available for SEO. These are the tools you need:

  • Paid screaming frog
  • Verified Google Search Console
  • The XML sitemap of your website
  • Editor access to your Content Management System (CMS)

When you start to find abnormalities, you start adding them to a spreadsheet so that they can be checked for quality manually.

1. Screaming frog crawling

Under Configuration> Spider> Basics, configure Screaming Frog to crawl (check “crawl all subdomains” and “crawl outside the start folder”, manually add your XML sitemap (s) if you have one) for your site to perform a thorough scan of your site page & # 39; s. After the crawl is complete, review all indexable pages listed. You will find this in the report & # 39; Self-referral & # 39; on the Canonicals tab.

screenshot example of using Screaming Frog to scan through XML sitemaps

View the number that you see. Are you surprised? Do you have more or fewer pages than you thought? Note the number. We will come back to this.

2. Google & # 39; s Search Console

Open your Google Search Console (GSC) property and go to the Index> Coverage report. View the valid pages & # 39; s. In this report, Google tells you how many total URLs they found on your site. Also view the other reports, GSC can be a great tool to evaluate what the Googlebot finds when it visits your site.

screenshot example of the Google Search Console coverage report

How many pages does Google say it indexes? Note the number.

3. Your XML sitemaps

This is a simple check. Go to your XML sitemap and count the number of URLs included. Is the number disabled? Are there unnecessary pages & # 39; s? Are there not enough pages & # 39; s?

Perform a crawl with Screaming Frog, add your XML sitemap to the configuration and perform a crawl analysis. When it is finished, you can visit the Sitemaps tab to see which specific pages are included in your XML Sitemap and which are not.

Example of using Screaming Frog to perform an XML Sitemap crawl analysis

Note the number of indexable pages & # 39; s.

4. Your own Content Management System (CMS)

This is also a simple check, don't think about it too much. How many pages on your site do you have? How many blog posts do you have? Add them up. We are looking for high-quality content that offers value, but more in a quantitative way. It does not have to be exact, because the actual quality of a part of the content can be measured through a content audit.

Note the number that you see.

5. Google

Finally we come to the final check of our series. Sometimes Google throws a number at you and you have no idea where it comes from, but try to be as objective as possible. Enter a & # 39; site: & # 39; Search on Google and check how many results Google offers you from its index. Remember that this is purely a numeric value and does not really determine the quality of your pages.

screenshot example of using Google search results to recognize inefficient indexation

Make a note of the number that you see and compare it with the other numbers found. Differences that you encounter indicate symptoms of inefficient indexing. By completing a simple quantitative analysis, you can lead to areas that may not meet minimum qualitative criteria. In other words, comparing numeric values ​​from multiple sources helps you find pages on your site that have a low value.

The quality criteria that we evaluate can be found in the Google webmasters guidelines.

How to solve index bloat

Resolving index bloat is a slow and tedious process, but you must rely on the optimizations that you perform on the site and be patient during the process, because the results may become slow to become noticeable.

1. Delete pages (ideal)

In an ideal scenario, low-quality pages would not be present on your site and would therefore not use limited sources of search engines. If you have a large number of obsolete pages that you no longer use, cleaning (deleting) can often lead to other benefits such as fewer redirects and 404's, less thin content pages, less space for mistakes and misinterpretations of search engines, to name a few.

The less control you give search engines by limiting their options over which action to take, the more control you will have over your site and your SEO.

This is of course not always realistic. So here are a few alternatives.

2. Use Noindex (alternative)

When using this method at the page level, do not add a website-wide noindex – happens more often than we would like), or within a set of pages is probably the most efficient because it can be completed very quickly on most platforms.

  • Do you use all those testimonials pages on your site?
  • Do you have a good blog tag / category, or do they just swell in the index?
  • Does it make sense for your company to have all those blog author's pages indexed?

All of the above can be indexed on WordPress with a few clicks and removed from your XML sitemap (s) if you use Yoast SEO or All in One SEO.

3. Using robots.txt (alternative)

The use of the robots.txt file to not allow sections or pages of your site is not recommended for most websites, unless explicitly recommended by an SEO expert after checking your website. It is incredibly important to look at the specific environment in which your site is located and how a refusal of certain pages would affect the indexing of the rest of the site. Making a careless change here can have unintended consequences.

Now that we have removed that disclaimer, excluding certain parts of your site means that you are blocking search engines from not even reading those pages. This means that if you have added a noindex and have not allowed it, Google cannot even read the Noindex tag on your page or follow your guideline because you have blocked it for access. The order of operations in this case is absolutely crucial for Google to follow your guidelines.

4. Use the Google Search Console removal tool (temporarily)

As a last resort, an action item that does not require developer resources uses the manual removal tool in the old Google Search Console. The use of this method to remove pages, entire subfolders, and entire subdomains from Google Search is only temporary. It can be done very quickly, with just a few clicks. Be careful what you ask Google to de-index.

A successful removal request only takes about 90 days, but can be withdrawn manually. This option can also be done in combination with a noindex meta tag to remove URLs from the index as quickly as possible.


Search engines despise thin content and try very hard to filter all spam on the internet, hence the endless search quality updates that take place almost daily. In order to appease search engines and show them all the great content that we have spent so much time creating, webmasters need to ensure that their technical SEO is tied up as early as possible in the life of the site before indexbloat becomes a nightmare.

Using the various methods described above, you can diagnose an inflated index that affects your site, so you can find out which pages should be removed. By doing this, you can optimize the overall quality evaluation of your site in search engines, score better, and get a cleaner index so that Google can find the pages you're trying to rank quickly and efficiently.

Pablo Villalpando is a bilingual SEO strategist for Victorious. It can be found on Twitter

Related reading

Adapted conversion With the help of analyzes to optimize sales judges for new and returning customers
Do not underestimate the power of video
Research The most common SEO errors
& # 39; keyword search tools that you can use for free

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *