Latest Inquiries - Data Extraction Software

Issue with Posted Date

Submitted: 1/20/2016
Hi team,

For this project, I would like to crawl the content of the website which posted 1 day ago. However, in the middle of the made, there are one outlier date which will disturb the crawling.

Let's say, if I want to crawl any content which posted one day ago (i.e: today is 20/1, 1 day ago is 19/1), however in the middle of the page, suddenly there is content which posted on 13/1. After the content, the rest of the content is posted on 20/1. (I have attached the screenshot of the website).

Based on the screenshot that I gave you, means that I can only crawl 2 contents as when the crawler reached 13 Jan 16, it will stop as I only set it to crawl content which posted 1 day ago. There are still content which posted on 19 Jan 16 in page 2 onwards and I can't crawl it.

Is there anyway where I can skip the outlier date?

I have attached the project file and the logfile as well.

Thank you.


Replied: 1/21/2016 1:47:30 AM

See the attached new project.

You can try to use custom Xpath function instead of condition scripts, then the unwanted date can be filtered out, then resume on next ones.