Software Manual - Data Extraction Software

The topic below is from the Visual Web Ripper manual.

Extracting Data From CAPTCHA-Protected Websites

Visual Web Ripper features both semi-automatic and full-automatic data extraction from websites using CAPTCHA protection. Full-automatic data extraction requires an account with a third party CAPTCHA recognition service and a fee is charged for each CAPTCHA image. Semi-automatic data extraction is free, but requires you to manually decode CAPTCHA images while running a data extraction project.

Using Proxy Servers

Sometimes the easiest solution to CAPTCHA protected websites is using a list of proxy servers. This is especially true when CAPTCHA pages are displayed randomly after browsing the website for a while. Proxy servers will not help if you always need to pass a CAPTCHA page in order to enter a section of a website.

Semi-Automatic Data Extraction

To configure your data extraction project for semi-automatic CAPTCHA processing, you need to do the following:

  1. Add a content element that selects the CAPTCHA image. Then use the Misc options tab to uncheck the Save content option.
  2. Add a FormField element that selects the CAPTCHA input field. Then use the AdvancedOptions tab to select the image element as a CAPTCHA element.
  3. Add a FormSubmit template that submits the CAPTCHA form. You may need to set the Misc option Optional template if the CAPTCHA form is not always displayed.

When Visual Web Ripper encounters a CAPTCHA element, it will display the CAPTCHA image and request the CAPTCHA code.

Download CAPTCHA demo project

Full-Automatic Data Extraction

Full-automatic CAPTCHA processing requires an account with a third party CAPTCHA recognition service. The third party recognition service must provide a .NET API and you must create a Visual Web Ripper script that uses this API to call the service.

Visual Web Ripper includes the API and standard script to call the following CAPTCHA recognition service.

http://www.deathbycaptcha.com

This CAPTCHA recognition service currently charges US$1.39 per 1000 CAPTCHAs. We are not affiliated with this company and therefore don't charge any additional fees for this service.

To configure your data extraction project for full-automatic CAPTCHA processing, you need to do the following:

  1. Add a content element that selects the CAPTCHA image. Then use the Misc options tab to uncheck the Save content option.
  2. Add a FormField element that selects the CAPTCHA input field. Then use the AdvancedOptions tab to select the image element as a CAPTCHA element.
  3. Use the AdvancedOptions tab to add a Decode CAPTCHA script to the FormField element that selects the CAPTCHA input field.
  4. Add a FormSubmit template that submits the CAPTCHA form. You may need to set the Misc option Optional template if the CAPTCHA form is not always displayed.

Decode CAPTCHA Script

A decode CAPTCHA script is used to call a CAPTCHA recognition service. The script gets the CAPTCHA image is an input parameter and should return the decoded CAPTCHA value in string format.

You can add a decode CAPTCHA script to a FormField element by clicking the Decode CAPTCHA script option button in Advanced Options

 

The script editor opens after you click the Decode CAPTCHA script button.

 

The default decode CAPTCHA script is designed to work with the www.deathbycaptcha.com service and if you are using this service, you only need to add your login name and password.

Visual Web Ripper also has easy support for bypasscaptcha.com. If you are using this CAPTCHA service you can use the following code.

string captcha = BypassCaptchaService.DecodeCaptcha(args.ImagePath, "key");
 

A decode CAPTCHA script can be written in C# or VB.NET.

C# and VB.NET Scripts

A decode CAPTCHA script must have one method as shown below.

  1. using  System;   
  2. using  mshtml;   
  3. using  VisualWebRipper;         
  4. public   class  Script   
  5. {   
  6.      //See help for a definition of WrDecodeCaptchaArguments.   
  7.      public   static  string DecodeCaptcha(WrDecodeCaptchaArguments args)   
  8.     {   
  9.          try   
  10.         {   
  11.              //Getting captcha from Decapcher.   
  12.             string captcha = DeathByCaptchaService.DecodeCaptcha   
  13.                 (args.ImagePath, "login", "password");
  14.                                                       
  15.                      return  captcha;   
  16.         }   
  17.          catch (Exception exp)   
  18.         {   
  19.             args.WriteDebug(exp.Message);   
  20.              return   "" ;   
  21.         }   
  22.     }   
  23. }  

public static bool DecodeCaptcha(WrDecodeCaptchaArguments args)

The script method DecodeCaptcha must have this exact name and signature, so change only the method body, not the method signature. The method must return decoded CAPTCHA value.

WrProjectInitializeArguments Properties

Name Type Description
ImagePath String The CAPTCHA image path.
Project WrProject The current Visual Web Ripper project.
DestinationDataSource WrDataSource Destination data source configuration.
InputDataSource WrInputDataSource Input data source configuration.
StartTemplate WrTemplate The first template in the project.
Database WrSharedDatabase

An open database connection.

InputParameters WrInputParameters

Input parameters for the current project.