Software Manual - Data Extraction Software

The topic below is from the Visual Web Ripper manual.

Following Pagination Links

A website displaying search results often uses pagination to allow a user to move forward in the search results.

The following image shows the pagination for Google Search.

 
Visual Web Ripper can follow all pagination links by using PageNavigation templates. A PageNavigation template never contains any sub-content or sub-templates, but simply repeats its parent template for each pagination link.

 
The project configuration shown above continues processing the PageArea list template for each pagination link.

Visual Web Ripper supports four types of pagination:

  • Next Page navigation
  • Single Link navigation
  • List of Links navigation
  • Dynamic List of Links navigation

Next Page navigation is the most common and should be used whenever a website has a Next Page link that moves you to the next page in the navigation. The PageNavigation template should select the Next Page link on the webpage. A Next Page PageNavigation template automatically applies a filter to the selection, so the selection XPath often ends up looking something like this: //A[.='Next >>']. You do not need to know XPath syntax to use this feature, but such an XPath ensures a good, robust selection that is very tolerant to future changes on a webpage.

Single Link navigation is similar to Next Page navigation, but it does not apply the selection filter. Although rarely used, a Single Link PageNavigation template can be useful in scenarios where the selection filter applied automatically to a Next Page selection is inappropriate.

List of Links navigation is used where there is no Next Page link, but only a list of page links. The List of Links PageNavigation template must select each link in the navigation and should therefore be a list template.

Sometimes a website shows the first ten navigation pages and then has a Next link that goes to the next ten navigation pages. In this case, you must use two PageNavigation templates: the first is a List of Links PageNavigation template that selects the first ten navigation pages, and the second is a Next Page or Single Link PageNavigation template that selects the next link.

Dynamic List of Links navigation is used where there is no Next Page link, but only a list of page links and the page numbers in the list change dynamically as you move forward in the navigation. Google Search offers a good example of dynamic pagination. Notice that Google search also has a Next Page link. In that case, you should use Next Page navigation, but Dynamic List of Links navigation would work as well.

Using the Start New Web Browser Option

The Start new web browser option can often be used to speed up data extraction in web browser mode, and the option is sometimes required when dealing with pagination on some websites. For example, you may have a search result with a list of detail links and standard page navigation, but everytime you click on a detail link and then move back to the search result, the website automatically moves to page 1 instead of staying on the current page. This means Visual Web Ripper will go into an infinite loop and keep processing page 1. To avoid this problem you need to make sure Visual Web Ripper never leaves the search result page and instead loads the detail pages in a new web browser. You do this by setting the Start new web browser option on the link template that opens the detail pages.

The WebCrawler collector always uses a new WebCrawler instance to extract data from a new webpage, so the Start new web browser option has no effect in WebCrawler mode. See the topic Data Collectors for more information.