Latest Inquiries - Data Extraction Software

extraction stops abruptly

Submitted: 2/19/2016

Hi,

This project is supposed to be going through a list of links. For some reason, at a very random place, after it comes back from a details page it goes to the search template then stops the extraction. It doesn't give any errors or anything, just ends the extraction.

It doesn't finish all the pages and I tried everything I know but can't get it to do that.

Please help.

PS: the captcha on this site comes out on a pop-up window and very randomly so I gave up trying to automate it so you'll have to enter the captcha manually.

Please find attached the project.

Thanks.

DE_PROD.rip

Replied: 2/22/2016 8:39:54 AM

I got 'PageArea not found' even at very first time, for resolve this issue, I've set 'Javascript + Async' action for page navigatio template and enable wait script for 'OffendersList' page area template, 'OffendersList' page area template also has marked check 'wait for element' option and enabled wait scripts.

Another hand, If you mark check 'View browser' on debugging window, after iterating through several pages, you will see captcha sometimes in either result list page or detail page, it may cause to stay in previous page and further cause repeated details, for resolve this issue, I've set extra 2 decaptcha templates similar to first one as optional , 

B.t.w, you still may encounter repeated page even captcha has been bypassed, may the website actualy exists a bug sometimes it kee to stay  previous page, but 'Visit each page only once' can avoid scraping duplicates.

See the attached revised project, I ran the project for a long time, it seems to work thorugh more 8 pages properly, but decpatcha2, 3 seem to not work properly when it encounted captcha, it's hard to repeat the decaptcha in VWR editor , you may try to revise the decaptcha template 2,3 further.

DE_PROD.rip

Replied: 2/19/2016 7:00:29 AM

OffendersList link template (i.e, the detail page) can be opened in a new tab, you can mark check 'start new web browser' and setting an empty link transformation to realize this. so you won't need a 'back' template and you couldn't need to worry the random event as you described.

If issue persits or there is a new issue, please you attach your full log file and sample data for diagnose further, thanks!


DE_PROD.rip

Replied: 2/21/2016 10:51:49 AM

I made a quick testing , I disabled 'Details' link template then running your project, after it iterated through all 15 pages, agent process has automatically been stopped , thought the page navigation you configured has no any problem.

I just noticed that you were enabling 'proxy switch' option, is it possible that you got a bad connection, therefore, it did loop through more same pages? it could be hard to diagnose this strange issue, but you may attempt to avoid scrape same detail page, you can set 'Visit each page only once' = true in advanced options for 'Details' link template.

DE_PROD.rip

Replied: 2/22/2016 6:09:44 AM

I took your suggestion and set the "Visit each page only once" option of the details template to true. It ran the first zip code fully, then at the second zip code when it reached page 8 it started skipping all details pages for all the rest of the pages till reached the end of the zip code. Then skipped all the rest of zip codes till the end of the extraction saying that "pageArea not found".

I know the main issue with this website is going back and forth from and to the details page. Disabling the details template would defenitely avoid the problem and make it run smoothly. So would you please try running it again but this time enabling the details page.

Or if you have any other suggestions I would really apprecite it. Thanks.

Please find attached the log file.

DE_PROD_RETRY_info_16_02_21.log

Replied: 2/20/2016 4:36:29 AM

Just wanted to let you know that I changed the Next button action to javascript + ajax instead of synchronous and it works better now.

It doesn't jump a lot like before but it still goes into an infinite loop after a certain number of pages. For a zip code like 19701 which has 7 pages, it worked perfectly fine extracting everything in order without jumping back. But For a zip code like 19702, which has 15 pages, when it reaches page 13 it jumps back to 1 and starts over. I thought maybe it was just a technical error and I left it running. The second time it reached page 13 it started over again. I thought maybe this particular page had a technical problem so I checked the website and found that it was fine. Do you have any explanation for that or have seen something like this before? Do you know how to fix this or have any suggestions?

Please find attached the log file and the new project.

Thanks.

DE_PROD.rip
DE_PROD_info_16_02_19.log

Replied: 2/19/2016 4:41:21 PM

Thank you so much for your help. I really appreciate it.

I have tried to open it in a new tab before, but back then, it didn't work for me. I think the trick was in setting the empty link tranformation then.

Thanks to you, now it doesn't stop in the middle of extraction like it used to do before.

However, I'm now facing a different problem. It keeps extracting the same pages over and over. It's like the navigation is not working properly. for example: the zip code 19702 has 15 pages. Starting the extraction, it extracts page 1 twice then it continues in order for a while and repeating some pages and then going back to page 1 and starting over. Although the page numbers are in order on the debugging screen, but the pages being extracted are not matching. You can see in the log file the names repeating and the number of pages exceeding 15 which should be the total number of pages. I don't understand why It keeps jumping back and strating over, but there is definitely something wrong with the navigation once the extraction comes back from the details page.

I remember I had added the back button in the first place because it used to extract only one page then skip the navigation button altogether. 

Please Please Please help. This project is driving me crazy.

Please find attached a copy of the log file.

Thank you

Thanks.

DE_PROD_info_16_02_19.log