Latest Inquiries - Data Extraction Software

IndiatimesArchive

Submitted: 5/25/2012

I need a project which collects all archive news of the mentioned link. I have also attached a project which does that.

But the problem is I cannot filter Url according to my choice.

For example I want data from only Urls which have the keyword 'energy' like

http://economictimes.indiatimes.com/news/news-by-industry/energy/power/ntpc-will-get-coal-even-without-fsa-for-2012-13/articleshow/13314452.cms

The project which I have made gives me data from all the Urls.

Second problem is I an unable to run this project on WebCrawler Mode, only Web browser mode works which is very slow.

I am looking forward to subscribe for your product if above 2 problems are resolved or any alternative is provided

Thanks.

IndiatimesArchive.rip

Replied: 5/27/2012 7:03:39 AM
 Please check the attached project that i did changes as below:

1) The site is based on ajax, some of pages are loaded by ajax / javascript , however, you couldn't let the project working under WebCrawler agent, but you can specify fewer templates which can be loaded on WebCrawler agent, in generally, it' s more effectively for increase performance. 

2) in the attached project, I setup execute javascript without loading others like activex / flash/ images..etc. , if you met issue not sure the reason, you can attempt to close this config.

3) 'New Template 3' has specific XPath as below:
//TBODY/TR[1]/TD/UL/LI/A[contains(@href, '/energy/')]
it will only select urls which contains '/energy/'.


B.t.w, There are still existing other configs to impove performance, you can attempt to do..
F.Y.I
Improving performance & Reliability

IndiatimesArchive.rip