Latest Inquiries - Data Extraction Software

Blocked Page

Submitted: 2/3/2016
Hi team,

The website tend to block VWR after we crawl for a few pages. It will crawl for a few pages, after a certain pages, the website will "detect" the bot and display an error page. We will need to fill in CAPTCHA in order to continue crawling. 

I tried to implement Semi-Automatic Data Extraction (refer to the manual: http://www.visualwebripper.com/Display.aspx?manual_id=992), but it does not work.

I have attacked the project file and the logfile. Please take note that the project file does not include the implementation of Semi-Automatic Data Extraction. 

Thank you.
logfile.log

Replied: 2/3/2016 7:14:24 AM

The first 3 templates show me yellow not found, I have no idea how to diagnose your issue.

For bing.com site, by my past experience, it didn't seem to be able give you captcha, by your log file, at end of log lines, there are some same HTTP status code error: 405, and the domain you scraped is 

http://www.glassdoor.com/

Are you able to attach a wrong project for diagnose ?

If that website glassdoor.com is really blocking your i/p address, propose that you try proxies in your project (we've provided a free private proxy switch for you), if you don't mind the speeding , you can try to set page load random delay in Project > Project options > Advanced tab.

Replied: 2/19/2016 1:57:41 AM

Looks like that website blocks proxy i/p address also, there is no good way for avoiding the detection from the website, all is rely on what the website does, it's hard to know how the website detect a bot then raise captcha, you still may configure template to do decaptcha, please you refer to the below topic : (it needs 3rd party decaptcha service)

http://manual.visualwebripper.com/default.aspx?manual_id=992

The free private proxy switch has no much i/p adddresses to be rotated in a short time, that might be why you get blocked after soon, you will need to find more proxies provided by 3rd party then setting those proxies in project.

privateproxyswitch.com is not the actual domain for private proxy i/p.

Replied: 2/18/2016 9:20:17 AM
Just to add on, sometimes I got this error also (as shown in the attached screenshot).


Replied: 2/5/2016 3:20:47 AM
That must be that your i/p address was blocked by this website, please you try to use proxies as proposed.
Replied: 2/5/2016 2:42:14 AM
Hi,

I have changed the page load random delay, I extended the time to 1 minute. It works yesterday, the project can be run without any problem. However, today the problem is back. 

I have run CCleaner to clear cache and cookies and restart my pc, but it still doesn't work,

I have attached an almost similar project file as the previous one and the screenshot of the website when it face the problem. 

Thank you.
logfile_glassdoor.log
screenshot_glassdoor.PNG
Glassdoor_SG.rip

Replied: 2/3/2016 5:03:32 AM
Hi team,

Sorry I just realized that I forgot to attach the project file. 
Glassdoor_PH.rip

Replied: 2/18/2016 8:05:09 AM
Hi team,

I have used Proxy Switch to crawl the website. Sometime it works, but most of the time it either crawl for 1 page only or it failed to crawl because the website will prompt us to insert captcha. 

I noticed that the website used third party Distil Networks (http://www.distilnetworks.com/) to block the bot. Is there any way to bypass it?

Thank you.