Software Manual - Data Extraction Software

The topic below is from the Visual Web Ripper manual.

Understanding the Concept

Before you can begin using Visual Web Ripper, you need to understand the underlying template concept. Most web-scraping tools use a macro-style approach and follow a sequential list of commands. Visual Web Ripper is very different. Viewing a data extraction project as a sequential list of commands leads to a poorly designed project.

In Visual Web Ripper, templates define how data is extracted from one type of webpage. For example, if you were extracting data from a product catalogue, the product detail pages would be defined by one template. A webpage listing all products in a category would be defined by another template. Each template can have an action that describes how the web browser should navigate to the webpage defined by the template. For example, the template defining the product detail pages would have an action telling the web browser to click on a product details link to navigate to a product details page. If the product details link is on the webpage listing all the products in a category, then the template defining the product details page must be a child template of the template defining the product list page.

Projects, Templates and Content

Visual Web Ripper projects define how to extract content from an entire website, not just from a single webpage. Projects comprise templates and content elements, where templates define how to navigate through a website and content elements define what information should be extracted from a webpage.

The hierarchy of templates and content is similar to folders and files in a file system. Templates can contain sub-templates and content, but content cannot contain any sub items.

Templates and content normally select one or more HTML elements on a webpage. A template normally performs an action on the selected HTML element when the template is opened, such as clicking on a link or a form submit button. When you open a template, normally you also open a new webpage, and the content elements in that template define how information is extracted from the new webpage.

A special type of template, a PageArea template, does not open a new webpage, but rather defines a sub-area of the current webpage. All content elements in such a template select information from within the defined sub-area.

A template is often used in combination with list options to select a list of links. If a template selects a list of elements, we call it a list template, although the list option is only a property of a template. If a template selects a list of links, Visual Web Ripper iterates through the links and processes the template once for each link. When designing a template that selects a list of links, the designer navigates to the first link in the list. You may prefer to design the template using a different URL, so you can select a specific link before opening the template and the designer will then use the selected link as the design link.