Huge Google Leak: Documents Expose Hidden Search Ranking Factors - WinBuzzer (2024)

2.500 Internal documents from Google's Content Warehouse API have been leaked, providing a rare glimpse into the company's search algorithms. The leak, which was shared with Rand Fishkin, and includes information on data storage for content, links, and user interactions, lacks details on scoring functions but offers significant insights into Google's ranking mechanisms.

Rand Fishkin is a digital marketing expert known for co-founding Moz, an SEO software company, and creating the “Whiteboard Friday” video series. He authored “Lost and Founder” and later founded SparkToro, a market research and audience intelligence platform.

Google search is one of the most secretive, closely-guarded black boxes in the world. Well, maybe not anymore.

In the last quarter century, no leak of this magnitude or detail has ever been reported from Google's search division. If you're in #SEO, you should probably see this. pic.twitter.com/JxEs55IV21

— Rand Fishkin (follow @randderuiter on Threads) (@randfish) May 28, 2024

As Fishkin writes on the SparkToro blog, the leaked documentation outlines an extensive array of 2,596 modules with 14,014 attributes connected to various Google services, including YouTube, Assistant, and web documents. These modules are part of a monolithic repository, meaning all code is stored in one centralized location and accessible by any machine on the network.

“On Sunday, May 5th, I received an email from a person claiming to have access to a massive leak of API documentation from inside Google's Search division. The email further claimed that these leaked documents were confirmed as authentic by ex-Google employees, and that those ex-employees and others had shared additional, private information about Google's search operations.

Many of their claims directly contradict public statements made by Googlers over the years, in particular the company's repeated denial that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain's age is collected or considered, and more.”

Behind the leak is Erfan Azimi, CEO and Director of SEO at EA Eagle Digital, a digital marketing agency. He says he got the documents from former Google employees, and has no financial motives for leaking the information. “The main motive is, […], if there is something we need to know, that we´ve been lied to. The truth needs to come out. The truth needs to come out“, Azimi states.

A few weeks ago, Azimi already published a leaked Google document, most probably part of the bigger leak, that showcases ‘shadow-ban‘ penalties on sites that don't match Google's political views.

Huge Google Leak: Documents Expose Hidden Search Ranking Factors - WinBuzzer (1)

Clicks Influence Rankings

According to the documents shared with Rand Fishkin, Google's claim that clicks do not influence rankings is contradicted by the existence of the NavBoost system, which employs click-driven measures to adjust rankings. This system has been around since 2005 and uses click data to reinforce or demote rankings.

The sources behind the leak reportedly say, that Google's search team recognized the need for full clickstream data in their early years to improve search result quality. This data includes every URL visited by a browser. NavBoost, initially gathering data from Google's Toolbar PageRank, was a key motivation for creating the Chrome browser. The system identifies trending search demand by analyzing the number of searches for a given keyword, the number of clicks on a search result, and differentiating between long clicks and short clicks.

The leak has several implications for SEO practices. Google´s Panda algorithm, for instance, uses a scoring modifier based on user behavior and external links, applied at various levels such as domain, subdomain, and subdirectory. Google also stores author information, highlighting the importance of authorship in rankings. Various demotions are applied for factors like anchor mismatch, search engine results page (SERP) dissatisfaction, and exact match domains. Links remain crucial, with metrics like sourceType indicating the value of links based on their indexing location. Google also measures the originality of short content and counts tokens, emphasizing the importance of placing key content early in the text. The following algorithmic demotions are used by Google, according to the leak:

  • Anchor Mismatch: Links with irrelevant anchor text are demoted.
  • SERP Demotion: Pages showing poor user satisfaction in the SERP are demoted.
  • Exact Match Domains: These receive less value in rankings.
  • Product Review Demotion: Likely related to the recent product reviews update.
  • Location Demotions: “Global” and “super global” pages can be demoted to favor locally relevant content.

Misleading Public Statements

Contrary to Google's public statements, the documents reveal several features that the company has previously denied. For instance, a feature called “siteAuthority” indicates that Google does measure sitewide authority, despite claims to the contrary. Systems like NavBoost use click data to influence rankings, contradicting Google's denials about clicks affecting search results. The documentation also mentions a “hostAge” attribute used to sandbox new sites, which Google has publicly denied. The documentation indicates the existence of such a “sandbox” feature that segregates new or untrusted sites. This is used to prevent fresh spam from ranking highly in search results.

To analyze the shared material, Fishkin worked together with Mike Kink from Pullrank, who published a detailed analysis of what they have found out so far. According to him, there are “2,596 modules represented in the API documentation with 14,014 attributes” in Google´s ranking system. The leaked documentation outlines each module of the API and breaks them down into summaries, types, functions, and attributes.

Despite Google's public statements denying the use of domain authority, the documentation confirms “siteAuthority” as being used in the “Q* ranking system“. This indicates that Google does calculate and use a measure of sitewide authority. The leak seems to uncover several lies from Google´s side about ranking works. Kink writes:

“Google spokespeople have said numerous times that they don't use “domain authority.” I've always assumed that this was a lie by omission and obfuscation.

By saying they don't use domain authority, they could be saying they specifically don't use Moz's metric called “Domain Authority” (obviously 🙄). They could also be saying they don't measure the authority or importance for a specific subject matter (or domain) as it relates to a website. This confusion-by-way-of-semantics allows them to never directly answer the question as to whether they calculate or use sitewide authority metrics.”

Google's ranking system is described as a series of microservices rather than a single algorithm. Key systems include Trawler for crawling, “Alexandria” for indexing, “Mustang” for ranking, and “SuperRoot” for query processing. These microservices work in tandem to process and rank search results.

The Role of Twiddlers for Re-Ranking

Google uses so called Twiddlers, which are re-ranking functions that adjust search results before they are presented to users. Examples of these functions include NavBoost, QualityBoost, and RealTimeBoost. These mechanisms fine-tune search results based on various factors, including user engagement and content quality.

According to the leak, Google employs various methods to combat manual and automated click spam, including using cookie history, logged-in Chrome data, and pattern detection. NavBoost scores queries for user intent, triggering video or image features based on user engagement thresholds. The system also evaluates a site's overall quality at the host level, which can result in a boost or demotion. Although Google has stated that Chrome data is not used in search rankings, the leaked documents reveal that views from Chrome are considered in page quality scores and other ranking factors.

Google also uses geo-fencing for click data, considering factors like country, state/province levels, and mobile versus desktop usage. If data is lacking for certain regions, the process may be applied universally. During the Covid-19 pandemic, Google used whitelists for websites appearing in Covid-related searches. Similarly, during democratic elections, Google used whitelists for sites shown or demoted in election-related information. King speculates in his analysis on whether the Helpful Content Update is related to what is called “Baby Panda” and what NSR (Neural Semantic Retrieval) might mean. “There are two references to something called “baby panda” in the Compressed Quality Signals. Baby Panda is a Twiddler which is a bolt on adjustment after initial ranking”, he writes.

“I think we are generally in agreement that the Helpful Content Update has many of the same behaviors of Panda. If it's built on top of a system using reference queries, links, and clicks those are the things you'll need to focus on after you improve your content.”

Takeaways

He concludes saying that, “we now have a much better understanding of many of the features that Google is using to build rankings. Through a combination of clickstream data and feature extraction, we can replicate more than we could previously.

“An important thing we can all take away from this is: SEOs know what they are doing. After years of being told we're wrong it's good to see behind the curtain and find out we have been right all along. And, while there are interesting nuances of how Google works in these documents there is nothing that is going to make dramatically change course in how I strategically do SEO.

For those that dig in, these documents will primarily serve to validate what seasoned SEOs have long advocated. Understand your audience, identify what they want, make the best thing possible that aligns with that, make it's technically accessible, and promote it until it ranks.”

This might be a good moment to revisit the recent interview of Google CEO Sundar Pichai with Nilay Patel, having some background knowledge.

Markus Kasanmascheff

Markus is the founder of WinBuzzer and has been playing with Windows and technology for more than 25 years. He is holding a Master´s degree in International Economics and previously worked as Lead Windows Expert for Softonic.com.

Huge Google Leak: Documents Expose Hidden Search Ranking Factors - WinBuzzer (2024)

References

Top Articles
Latest Posts
Article information

Author: Patricia Veum II

Last Updated:

Views: 5884

Rating: 4.3 / 5 (64 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Patricia Veum II

Birthday: 1994-12-16

Address: 2064 Little Summit, Goldieton, MS 97651-0862

Phone: +6873952696715

Job: Principal Officer

Hobby: Rafting, Cabaret, Candle making, Jigsaw puzzles, Inline skating, Magic, Graffiti

Introduction: My name is Patricia Veum II, I am a vast, combative, smiling, famous, inexpensive, zealous, sparkling person who loves writing and wants to share my knowledge and understanding with you.