Software Manual - Data Extraction Software

The topic below is from the Visual Web Ripper manual.

Data Extraction Example

In this example, we will extract company data from Australian Yellow Pages. The website looks like this.

We will search for all Hotels and Apple Shops in the state NSW . To do this, we need to configure a data extraction project that can submit the search form and then extract data from the search results.

Follow these steps to create the data extraction project.

Step 1 - Enter the Start URL

The search form is on the website's homepage, so the best start URL is simply www.yellowpages.com.au . We enter the start URL in the Visual Web Ripper address bar and load the website.

Step 2 - Configure the Search Form

To configure a project to submit a web form, we need to add FormField content elements for each form field and a FormSubmit template for the form submit button.

First we add the FormSubmit template for the Find button.

  1. Click on the Template tab in the Captured Elements window. This ensures we are working on templates and not content elements.
  2. Click on the Find button in the web browser.
  3. Click the New button or right-click and select New Template from the context menu.

 

Next we add the FormField element for the What form field.

  1. Click on the Content tab in the Captured Elements window. This ensures we are working on content elements and not templates.
  2. Click on the What input field in the web browser.
  3. Click the New button or right-click and select  New Content from the context menu.
  4. Enter the input values in the Capture Window. We want to submit the web form twice (once for the search term Hotels and once for the term Apple Stores), so we enter both search terms in the Capture Window .
  5. Visual Web Ripper automatically saves FormField input values to the output data. If you do not want to save the input values, you can reset the Save Content option.

 

 Next we add the FormField element for the  Where form field.

  1. Click on the Where input field in the web browser.
  2. Click the New button or right-click and select New Content from the context menu.
  3. Enter the input values in the Capture Window. We want to search only in the state NSW , so we enter only the single value  NSW in the Capture Window .

 

Now we have finished configuring the web form, so we can open the FormSubmit template by clicking the Open button. This submits the web form in the web browser and opens the search results page.

Step 3 - Iterating Through the Search Results

Now we are on the search results page and we want to extract data for each company listed in the results. When you want to extract data from a list of web elements, you usually need to use a PageArea template to iterate through the list.

We create a PageArea list template by following these steps:

  1. Select the entire first row in the list.
  2. Right-click anywhere in the second row and select Create List from the context menu.
  3. Click the New button, or right-click and select  New Template from the context menu.

 

Now we have finished configuring the PageArea template, so we can open the template by clicking the Open button. A PageArea template does not navigate to a new webpage, but limits all selections to a specific area of a webpage. When you open a PageArea template, the page area is colored light green in the web browser.

Step 4 - Extracting Content in the PageArea Template

We will extract all the company names in the search results. We are inside a PageArea list template that selects all search result rows, so all the selections made in one row will be applied automatically to all the other rows in the search results. Follow these steps to configure the project to extract the company names:

  1. Click on the company name in the first row. The company name in all the other rows will be selected automatically.
  2. Click the New button or right-click and select New Content from the context menu.
  3. The title is a link, so Visual Web Ripper automatically sets the content type to Link, but we want to extract the text so we change the content type to Element.

 

Step 5 - Navigating to the Detail Pages

We are still within the PageArea template and we have configured the project to extract the company title content. We also want to extract some content that is available only once we click the company links and view the company detail pages.

Follow these steps to add a link template that navigates to the detail pages:

  1. Click on the Template tab in the Captured Elements window. This ensures we are working on templates and not content elements.
  2. Click on the company title in the first search result row. We are inside a PageArea list template that selects all the search result rows, so all selections made in one row are applied automatically to all rows in the search results.
  3. Click the New button or right-click and select  New Template from the context menu.
  4. Make sure the template type is set to Link and not PageArea .
  5. Sometimes the Start New Web Browser option is useful when you want to follow a list of links. Normally, Visual Web Ripper follows a link and processes the next page before moving back to the previous page in order to follow the next link in the list. If you activate the Start New Web Browser option, Visual Web Ripper will open the link in a new web browser, so it does not have to move back to the previous webpage in order to follow the next link. This saves time, making your data extraction project run more quickly.

 

We have finished configuring the link template for the detail pages, so we can open the link template by clicking the Open button. The web browser navigates to the first company link in the list. If you want to open a different company link, you can simply select the link in the web browser before clicking the Open button.

Step 5 - Extracting Content From the Detail Pages

Now we are on a company detail page and we want to configure the project to extract some content from this webpage. The detail page has two information tabs About Us and Product and Services. The information behind each tab is always on the webpage, but the webpage hides and shows information depending on the information tab you select. Clicking on a tab does not trigger a new page load or an AJAX action, so we don't need a Link template to handle the tabs. Visual Web Ripper does not care if content is visible or hidden. It can extract hidden content just as well as visible content, so we can simply click on the tab displaying the information we are interested in, and start marking the content we want to extract.

The Product and Services tab shows a table that contains information about the hotel. We want to extract the hotel rating. All the hotels in the search results have different information displaying in the table; for one hotel the first table row may contain the hotel rating, but for another hotel the first row may contain completely different information, such as amenities. Simply extracting the content of the first row will not always provide the hotel rating. We want to always obtain the hotel rating, or nothing if a rating is not listed for a hotel.

Follow these steps to use a Text Filter when extracting the hotel rating content:

  1. Make sure you are on a detail page that has a hotel rating. If the hotel rating is not displayed for the first hotel in the search results, go back to the previous template, select a different hotel, and then open the link template again.
  2. Select the hotel rating in the web browser.
  3. Right-click on the Rating text, select Filter, and then select  Must Have Text Rating from the context menu.
  4. Click the New button or right-click and select  New Content from the context menu.

 

Step 6 - Following All Links in the Page Navigation

The search result page shows only 40 companies per page, so the entire search results are spread over multiple pages. We want to extract data for all the companies on all the search result pages. The website uses a page navigation bar to allow users to navigate between search result pages.

We will add a PageNavigation template that will use the Next Page link in the webpage navigation bar to find all the pages in the search results.

  1. Move back to the template where the PageArea template is located. You can use the Back button in the Visual Web Ripper navigation pane to move to the previous template.
  2. Click on the Template tab in the Captured Elements window. This ensures we are working on templates and not content.
  3. Click on the Next Page link in the web browser. To make sure you select the link and not the arrow image, right-click on the arrow image and choose Select Nearest Link from the context menu.
  4. Click the New button or right-click and select  New Template from the context menu.
  5. Make sure the template type is set to PageNavigation.

 

 

Step 7 - Exporting Extracted Data

Now we have finished adding templates and content to the data extraction project and we want to configure how extracted data should be saved. By default, Visual Web Ripper exports extracted data to XML Excel format and we do not want to change this output format. XML Excel format works like a standard Excel file in Excel 2003+, but sometimes other applications such as DreamWeaver may hijack the format. In these cases, you need to open the output file from within Excel or simply drag the file into Excel. Alternatively, you can right-click on the file, select Open With... and then choose Excel.

 

Step 8 - Running the Data Extraction Project

Now we are ready to run the data extraction project. Click the Run Project toolbar button to run a project.

Click the Stop button anytime to stop the data extraction. You can view the data that has been extracted by clicking the View Data button.

The project has been configured to save output data in XML Excel format. By default, the Excel file is located in My Documents\Visual Web Ripper\Output\[PROJECT_NAME].

Advanced Considerations

When you run this data extraction project for a longer time, you will notice it works fine for the first couple of search result pages, but on later pages it suddenly starts failing. The debug window will display messages indicating that the PageArea template can no longer be found.

Web scraping becomes difficult if there are many different page layouts on a single website. The webpage that displays the first page of search results appears similar to the webpage displaying the third page of search results, but there are subtle differences that have a big impact on the data extraction project.

To identify the problem, we need to navigate to the third page of the search results. We do this by opening the PageNavigation template three times. You will notice that the status icon for the PageArea template on the third page of search results is yellow instead of green. This indicates the PageArea selection is no longer valid.

If you edit the PageArea template, nothing is selected in the web browser, because the selection is not valid.

To understand what is happening, re-create the PageArea selection in the web browser. Then compare the XPath of the new PageArea with the XPath of the old PageArea.

  • //UL[@id='localListings']/LI[@class='gold mappableListing listingContainer omnitureListing']
  • //UL[@id='localListings']/LI[@class='gold_entry mappableListing listingContainer omnitureListing']

The class attributes of the search result rows are not the same, so the first XPath will not work on page 3, and the second XPath will not work on page 1.

There are two ways to fix this problem. You can click the Set XPath Manually button to manually edit the selection XPath and remove the class attribute, so you end up with this XPath:

  • //UL[@id='localListings']/LI

Alternatively, you can create the PageArea list template using the List options instead of the context menu. Follow these steps to create the PageArea template using the List options:

  1. Select the first row of search results in the web browser.
  2. Click the List options tab.
  3. Check the Create list option.
  4. Click the OK button to update the selection in the web browser.

If you click on the XPath options tab, you will notice that Visual Web Ripper has created the following XPath selection automatically:

  • //UL[@id='localListings']/LI

Now you can run the data extraction project again. It should work on all the search result pages.