Let's get one thing out of the way immediately: Google does not hate you, but it doesn't necessarily welcome you with open arms either. You're Google's guest, and if you break the house rules, you get served CAPTCHAs instead of the lovely data you came for!
This will be quite a lengthy article, but an important one, so grab a coffee and take the time to read it top to bottom!
In an ideal world you play by the book, and you will be fine. However our world is far from ideal and there are a few problems when accessing Google:
- When using out-of-the-box software like ScrapeBox, SEO PowerSuite or others its easy to screw up the settings, send a ton of requests and get blocked almost instantly.
- When you've created your own software/script it's easy to slip up in the way you load proxies or access Google and getting blocked as a result of this (more on this later).
- Lastly, and this is a big one: Google's rules aren't static!
It's important to really let that last one sink in, because chances are you've arrived on this page after triggering our ProxyGuard system and are now left wondering what went wrong! You've read some guides online, followed best practices, and even still, you've triggered the captcha system! What gives?
Well, what's been written 2 years ago, or even 2 months ago may no longer be relevant today. Google is constantly updating and improving their own system to ensure the best user experience at the least possible server load. That's perfectly fine, and to be expected.
So while you may have read that you can scrape a maximum of 8 keywords / hour and 10 people confirmed that in the comments:
If you get captchas scraping that many keywords, this truth may no longer hold water!
The truth is in the pudding as they say. The're no rulebook on this, the only way to know if you're doing it right or wrong is whether you get your data, or don't.
Knowing that, if you trigger captchas there can be 1 of 2 high-level reasons:
- The IP range has been banned.
- You've crossed Google's threshold.
If you get captchas on all your proxies, #1 is the most likely scenario. Even though we use ProxyGuard to prevent this, there's never a guarantee. If this happens to you, contact us so we can work around that!
But, in most cases, it's #2: you've crossed some kind of threshold and Google blocked you, and in turn, we've blocked you from using the proxy for while.
On to the meat of the matter! What you can do to find, and prevent crossing, the illusive threshold.
Things To Do In Order To Prevent Captchas
While this is written primarily for people that work with Google, a lot of these tips apply to Instagram, Facebook and other websites as well.
When Using Out-of-the-Box Software
- Trial and error, unfortunately! Start with long delays and very few requests and work your way up slowly.
- Sometimes the software itself has some kind of flaw where there's a footprint that's easy for the captcha system to pick up on. In that case, there's nothing you can do. Be sure to read customer experiences in forum threads to make sure you're not alone in this.
- Contact the creator of the software to ask for their advice on the best settings!
When Using Your Own Custom Script or Software
Thanks to our experience in coding and operating an API for Google's SERP data, and having helped numerous clients getting their product optimized, here are some things we've learned through the years.
THE MOST IMPORTANT RULE:
Be natural! Remember the beginning of this article: Google's objective isn't to thwart you personally because they hate you - no, they want the best user experience at the least amount of server cost. This means that anything that's not a user, will likely trigger captchas. Therefor the most important thing is to become that user and act natural!
When you build your build your own software, this is the number one thing to keep in mind. Here are some of the things that are, in our view, mandatory to ensure you look natural:
Random Delays. A no-brainer, don't go from page to page without random, human-like delays. Time yourself if need be to get a range of timings.
Use a Headless, JS Enabled Browser.
Curl is ancient and not loading JS a tell-tale sign of that request not coming from an actual user. Tips: PhantomJS, CasperJS, Splinter, Selenium, and many more. Look them up to see which one is available in your coding language.
When you browse, Google places cookies to track you. Save these cookies, and save them for that specific proxy. Then when you use that proxy the next time, load those same cookies so you're consistent and natural.
Limit Your Activity Hours
Humans don't search 24/7 so neither should you. Limit your activity hours to regular, human hours to best mimic natural behaviour.
Avoid Un-Human Queries.
Have you ever searched Google yourself using num=100? Neither did we. While it's tempting to load more results that way, fact of the matter is that Google is far more likely to block you. Start on page one and work your way up to 10 using natural delays in between. Even though you request more pages, it still counts as just a single keyword requested.
Here's a quote straight from Google's Captcha page: "Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use"
Re-Use All URL Parameters or Click the Next Button
Here's a sample URL that Google's response was when entering something random in the browser search bar and pressing enter (emphasis mine):
That's 6 URL parameters (bolded) that are generated for you. When you visit Google, before visiting the next page, grab these parameters, and re-use them.
Alternatively (and even better) scroll to the bottom of the page (yes, yu can do that with a headless browser) and then click the Next button.
Do Not Use Different Proxies For Deeper Pages
When you scrape page 2, 3, 4 etc. use the same proxy as you did for page 1. No person on the face of this planet has ever started with page 4, ever. No-one, they do not exist. If you claim they did I would need proof they're not alien before believing you because it's just not a thing humans do. Ever.
When Unnatural, Act Natural!
Sometimes you have to force Google to do things, like load the correct TLD (Read more on that here: https://help.proxymillion.com/f-a-q/googles-location-and-the-proxy-location-dont-match). When that's the case, always load the homepage and use the search box to search, like any normal human would!
DO NOT USE A RANDOM FUNCTION TO LOAD YOUR PROXIES
Yes, all caps, it's that important. This will screw you over big time in the long run and if you have a basic knowledge of probability, you will know why. It's a matter of time before you start using the same proxy back to back.
It's easy to forget this - I know because it happened to us as well. So DO NOT use a random number generator to randomly pick a proxy from your list.
Here's what you should do: make sure you save and use an integer index that's increased by 1 every time you load a proxy. So let's say you have an array of a 1.000 proxies. When you start, the index is 0, so you load element 0. Then on the next request it's 1, then 2, then 3, etc. until you hit 999 and then reset to 0. This is the only way to be 100% sure you're using the proxies with the most optimum spread.
Then every 24 hours, shuffle and save the proxy list.
Alright, that was it! If you have any additional info that might benefit us or others please get in touch and let us know.