Software Manual - Data Extraction Software

The topic below is from the Visual Web Ripper manual.

Incremental Web Scraping and Avoiding Duplicate Data

When extracting data from websites such as forums, it is often desirable to extract only new data that has been posted since the last time data was extracted. This can be achieved by cancelling data extraction when duplicate data is detected.

Duplicate checks can be performed against data in the internal data store. Normally, duplicate checks are performed against data extracted in the current project run, but can also be performed against extracted data from previous project runs. Duplicate checks can only be performed against data from previous project runs if the project has been configured to add extracted data to existing data. Otherwise the internal data would not contain extracted data from previous project runs.

Extracting Data from a Forum

Incremental web scraping is best illustrated by using an example. The following example explains how to extract data from www.gaiaonline.com/forum/gaming-discussion/f.4/

We will extract data from all topics on the first 100 pages in the forum and update the data on an hourly basis. Extracting data from 100 pages of topics will take quite a while. If we extract data from all topics each time we update the data, we will extract a lot of duplicate data. We will use the duplicate feature to ensure that we extract only new or updated topics.

First we need to design the project. We create a PageArea template to iterate through all the topics on one page and a PageNavigation template to iterate through all the pages. The PageArea template contains the topic type, title and last post date. The PageArea template also contains a link template that links through to the topic detail page and extracts some data from there.

The topic title is displayed in two different locations, depending on the topic type. We use an alternative content element for the title to make sure both locations are covered.

Now we need to decide how to detect duplicate data. The topic title and last post date can be used to detect duplicate data, so we set the Duplicate Check content option on the "title" and "last post date" content elements.

Next, we need to decide which action to take when duplicate data is detected. The default action is to cancel the current template, which basically removed the duplicate data. We need to cancel the entire project when duplicate data is detected, so we need to change the Duplicate cancel template option, which is found in the More Options tab. We set this option to the start template, so the project cancels the start template and thereby the entire project. We need to change the option on the template that contains the duplicate content which is the PageArea template.

The next problem is the sticky topics at the top of the forum. These topics remain at the top regardless of whether they have been updated. Because we have told Visual Web Ripper to cancel data extraction when the first duplicate row is detected, Visual Web Ripper will cancel data extraction immediately without checking to see whether new non-sticky topics have been posted.

If we know approximately how many sticky posts are at the top of the forum, we can use the Min. duplicate checks option. This option specifies the minimum number of rows Visual Web Ripper will process before cancelling the template. Duplicate data will still be removed, but Visual Web Ripper will continue to iterate through the data until it has processed this minimum number of data rows.

If we do not know how many sticky topics the forum will have, or if we want our approach to be more exact and reliable than guessing the maximum number of sticky topics, we can use a script to decide whether Visual Web Ripper should cancel the template. In this case, we could create a script that checks whether the topic type contains the text Announcement: or Sticky:. The duplicate script is executed only when duplicate data is detected and the template is cancelled when the script returns true.

using System;
using mshtml;
using VisualWebRipper;
public class Script
{
    //See help for a definition of WrDuplicateActionArguments.
    public static bool IsCancelTemplateOnDuplicate(WrDuplicateActionArguments args)
    {
        try
        {
            if (args.InternalDataRow["type"] == "Sticky:" || args.InternalDataRow["type"] == "Announcement:")
                return false;   
            return true;               
        }
        catch(Exception exp)
        {
            args.WriteDebug(exp.Message);
            return false;
        }
    }
}

Duplicate checks can only be performed against data from previous project runs if the project has been configured to add extracted data to existing data, so we need to set the project option Data extraction mode to Add to existing data.

Exporting New Data Only

Visual Web Ripper will export all data by default, so if you are adding newly extracted data to existing data, all data will be exported including the existing data. You can use the option Export last data segment only if you need to add newly extracted data to existing data, so you can check for duplicates, but only want to export the newly extracted data.

Duplicate Checks on Export

Visual Web Ripper will check for duplicates during a web scraping session by default, so it can perform actions such as stopping the web scraping session when it hits a duplicate. If you are always extracting all data and don't need to stop web scraping when a duplicate is found, you can set the advanced project option Duplicate check on export. This option instructs Visual Web Ripper to perform the duplicate checks on export data, and makes sure the export data does not contains duplicates. No duplicate checks are performed during the web scraping session when the option Duplicate check on export is set to true.

Subtopics