Software Manual - Data Extraction Software

The topic below is from the Visual Web Ripper manual.

Your First Data Extraction Project

Data extraction can be easy or difficult, depending on the characteristics of the target website. We suggest you begin with an easy website. An easy target website is a website that is fairly static with limited JavaScript and AJAX callbacks.

Before creating a data extraction project for a particular website, familiarize yourself with that website in a standard version of Internet Explorer. This will help you understand how Visual Web Ripper should navigate the website in order to extract the desired data.

First, you need to identify a good start URL. Sometimes the best start URL is simply the start URL of the website, but often the desired data is on a sub-page. You should navigate to the best start URL in Internet Explorer and then copy the URL for use in Visual Web Ripper. Note that some websites allow navigation without changing the visible URL. In such cases, you may not have a start URL that points directly to your preferred start webpage, but will instead need to add templates to your data extraction project to navigate to that webpage.

Once you have the start URL, you can begin creating your data extraction project by following these common steps.

Step 1 - Loading the Start URL

Open Visual Web Ripper and copy the start URL into the address bar. Visual Web Ripper will load the start webpage.

Step 2 - Adding Templates and Content

You can now start adding templates and content elements to your project. We have listed a few common scenarios below.

Scenario 1 - Navigating to Your Preferred Start Page

If you were unable to obtain a direct start URL for your preferred start page, you may need Visual Web Ripper to follow a link in order to navigate to the webpage.

To select the link in the web browser, click on the link and add a new Link Template. Click the Open button to open the Link Template. The web browser will navigate to the new webpage. 

Now you can begin adding sub-templates and content elements to the new link template. When Visual Web Ripper runs the project, it will open the link template and thereby navigate to the new webpage before processing any sub-templates and content elements.

Scenario 2 - Submitting a Web Form

You may need to submit a web form, perhaps to submit a search query or a login form.

Click on the first form field in the web browser and then add a new FormField content element. Use the Capture Window to enter the value you want Visual Web Ripper to enter in the form field when it submits the web form. Repeat this process for all the form fields you want Visual Web Ripper to handle.

Now add the FormSubmit template that will submit the web form. Click on the form submit button in the web browser and then add a FormSubmit template. Click the Open button to open the FormSubmit template. The web browser will submit the form and thereby navigate to a new webpage.

Now you can begin adding sub-templates and content elements to the new FormSubmit template. When Visual Web Ripper runs the project, it will open the FormSubmit template and thereby submit the web form before processing any sub-templates and content elements.

Scenario 3 - Iterating Through a List of Links

Sometimes a webpage displays a list of links and you want to follow each link in the list. To do this, you need to select all the links and then add a Link Template.

To select all the links, select the first link and then right-click on the second link and choose Create List from the context menu. Click the Open button to open the Link Template. The web browser will navigate to the first link in the list. If you want Visual Web Ripper to navigate to a different link when you open the template, simply click on the target link in the web browser before opening the template. When Visual Web Ripper runs the project, it will open and process the template once for each link in the list.

Scenario 4 - Extracting Content

Once the web browser displays the webpage from which you want to extract content, you need to add content elements for each content element on the webpage you wish to extract.

To add a content element, click on the appropriate HTML element in the web browser and then click the New button to add the content element.

Scenario 5 - Iterating Through Search Results

If the web browser displays search results and you want to extract content for each row in the search results, you need to create a list selection that selects all the rows in the search results.

To select all the rows, select the first row in the web browser and then right-click anywhere on the second row and choose Create List from the context menu. Once you have created the list selection, add a new PageArea template and click the Open button to open the template.

You are now inside a PageArea template and you can select only those content elements in the web browser that are inside the page area. To add a content element, click on an HTML element in one of the rows in the web browser and then click the New button to add the content element. When you click on an HTML element in one row, the selection is repeated automatically for each row in the search results.

Scenario 6 - Extracting Content From the Detail Pages

Consider a situation in which the web browser shows search results and you have added a PageArea template to iterate through the search result rows. You have opened the PageArea template and may have added some content elements. Each search result row has a link that opens a detail page with more information, and you want to extract more content from this detail page.

Select the link in one of the search result rows and add a Link template. Visual Web Ripper will automatically select the links in all the search result rows, because you are inside a PageArea list template.

Open the link template and begin adding content elements for all the content you want to extract from the detail page.

Scenario 7 - Processing All Pages in a Search Result

As in the previous scenario, consider a situation in which the web browser shows search results and you have created a PageArea template to iterate through the search result rows. The search results comprise many rows that are displayed on multiple webpages. Page navigation links are used to navigate through the search result pages.

Most webpages using page navigation have a Next Page link that opens the next page in the navigation set. If this is the case, select the Next Page link in the web browser and add a PageNavigation template. The PageNavigation template must be added to the project in the same location as the PageArea template that iterates through the search results. If you open the PageArea template, Visual Web Ripper will not actually open the template, but the web browser will navigate to the next page in the search results.

  • See Page Navigation for more information about page navigation, including other types of page navigation that can be used when there is no Next Page link.

Step 3 - Setting the Destination Data Source

After you have added templates and content elements to your project, it is time to decide where to save the extracted data. Visual Web Ripper will export extracted data to Excel format if you do not select an export target. Click the toolbar button Data Export to change the export target. If you are exporting extracted data to a file format, such as Excel, the default output folder is My Documents\Visual Web Ripper\Output\[PROJECT_NAME].

Step 4 - Running the Data Extraction Project

Now it is time to run your data extraction project. Click the toolbar button Run Project to bring up the following screen.

A data extraction project can run using a WebBrowser, InternetExplorer or WebCrawler agent. Choose the WebBrowser agent until you become more familiar with the software and you are ready to try to optimize your data extraction projects.

The first time you run a project, always check the View Browser and Debugging options. This allows you to observe how Visual Web Ripper is extracting data, which makes it much easier to correct any problems in the data extraction project.