Latest Inquiries - Data Extraction Software

Issue with Posted Date

Submitted: 1/20/2016
Hi team,

For this project, I would like to crawl the content of the website which posted 1 day ago. However, in the middle of the made, there are one outlier date which will disturb the crawling.

Let's say, if I want to crawl any content which posted one day ago (i.e: today is 20/1, 1 day ago is 19/1), however in the middle of the page, suddenly there is content which posted on 13/1. After the content, the rest of the content is posted on 20/1. (I have attached the screenshot of the website).

Based on the screenshot that I gave you, means that I can only crawl 2 contents as when the crawler reached 13 Jan 16, it will stop as I only set it to crawl content which posted 1 day ago. There are still content which posted on 19 Jan 16 in page 2 onwards and I can't crawl it.

Is there anyway where I can skip the outlier date?

I have attached the project file and the logfile as well.

Thank you.

logfile.txt
JobOpenings_PH.rip
website_screenshot.PNG

Replied: 1/21/2016 1:47:30 AM

See the attached new project.

You can try to use custom Xpath function instead of condition scripts, then the unwanted date can be filtered out, then resume on next ones.

JobOpenings_PH.rip