Latest Inquiries - Data Extraction Software

AAPG California Database

Submitted: 1/23/2013

I just spoke with Kyle, from your 800 number, and he told me I could submit a request for help on a project I have downloaded the trial software and have attemped to extract data. I am having problems though. Will outline what I need to accomplish.

1) Go to aapg.org and click on  "MembersOnly" link on left side of page.

2) After, I need to provide username and password. You can create a new account for free, or I can provide login credentials.

3) After logging in, I have to click on "AAPG Directory" link on left of page.

4) A search page comes up. In here I need to type in a ZIP CODE, then hit ENTER or click "Search."

5) A results page comes up that has the following titles: "Customer Name/Company", "City, State, Prov, Society", and "Contact By". I wish to export these titles as column headers in a .csv file. The information below them, I wish to append as data rows.

6) These rows are only displayed at a maximum of 10 contacts per page, so on some zip codes, there will be multiple pages to click on to get the data.

7) I need this to be done for every zip code in California. I have created and provided an attachment with a .csv file containing all zip codes.

 

I appreciate the help, and I hope this is something your product can help with.

CAzipcodelist.csv

Replied: 1/23/2013 7:40:16 PM
Please check the attached demo project.

You should place the input csv file in same directory as the rip file, then running the project.

I don't seem to see where has more results than 1 page, if you can get more pages, in case you can add a page navigation template following the 'Results' page area template, then selecting the 'next' link on page ..

F.Y.I:

PageNavigation templates
Aapg.rip

Replied: 1/27/2013 5:47:33 PM
Thanks, I got Xpath working. Now, upon extracting, I get a column that has 3 different types of data: work, fax, and email contact information. I would like to transform the content into its own elements. Can you please advise? Thanks. I have so far figured out that I can extract the email information with the following lines:

mailto:(.*?)"
$1

Using HTML as regex input. I have not figured out how to isolate the work and fax numbers though. Thanks again.
Replied: 1/27/2013 8:39:37 PM
Ok, I'll attach in 2 mins.
Replied: 1/24/2013 12:29:10 AM
Aapg_info_13_01_24.log
Aapg_error_13_01_23.log
Aapg_info_13_01_23.log

Replied: 1/24/2013 12:28:02 AM
Where is the log file?
Replied: 1/24/2013 2:39:24 AM
the last attached project and sample input data works fine for me, I'm not sure exactly what's going on , how about the last attached project and sample input data? it also doesn't work for you?

Please attach your project and input data and log file.
Replied: 1/24/2013 2:19:19 AM
You need to wait for a while when opening the "AAPG Directory" link template, please don't cancel the process, it will take some time to populate the input data in 'group' template since there are 2K input lines in csv file, then it will go to next step to submit to form iterately.

See the attached project , actually, it has no any difference from last one.

also, I attach a sample input csv where 's only a few lines , it goes fastest.
Aapg.rip
CAzipcodelist.csv

Replied: 1/24/2013 12:16:24 AM
Thanks for the help. I have placed the .rip file and the .csv in the project directory, but running the project doesn't seem to be working. It seems to hang on the processing of the .csv file. Any thoughts?
Replied: 1/27/2013 8:42:42 PM
Ok, here it is. I managed to separate the email and (w) phone number from the TD[3] element. However, both (w) and (f) are nested the same way, so I am not sure how to specify separating the element.
AapgDraft.rip

Replied: 1/24/2013 12:27:06 AM
Thanks for the help. I have placed the .rip file and the .csv in the project directory, but running the project doesn't seem to be working. It seems to hang on the processing of the .csv file. Any thoughts?
Replied: 1/27/2013 4:42:50 PM
I assume that you activated the 'Navigate in browser' button in toolbar, now you 're on navigation mode..therefore, you won't be able to select anything in page, please make sure you don't activate the button.

Regarding the XPATH, you can read our online manual as below link, it shouldn't be difficult in most of cases.

Selection techniques
Replied: 1/27/2013 8:37:55 PM
Thought you can make separated elements with same XPath , then use content transformation Regex script to filter specific content out.

Please attach your project, then I will take a look at the separated elements.
Replied: 1/28/2013 2:06:53 AM
I've checked the project, the two elements (Contact Email & Contact Work Number) that you 're using content transformation on same XPath, that's good solution, you still can use XPath to filter the specifc content .. See the attached new project

F.Y.I:
Using Filters

Regarding the 'memory leaks' issue, you can read the topic link as follow:


not sure why it 's stuck for 20 mins more, it should auto-clear or restart enitre process in this case, whatever, if some of pages (templates) would cause memory leaks, you should attempt to extract data under Web Crawler agent, if it's not working properly, you should set to restart browser instance rather than restarting entire process.

if the issue still cannot be resolved, please attach your log file, so I can further figure out what 's going on.

AapgDraft.rip

Replied: 1/27/2013 10:37:59 AM
Sorry for the late reply, there was a death in the family. I haven't been feeling well. Anyway, thanks for the reply. I was wondering, in comparing your demo to the one I made, how you were able to make the XPATH work. When I set mine the same why as yours, it does not highlight anything red. What do I need to do to get that working? Thanks.
Replied: 1/28/2013 6:45:33 AM
Before you provided that response, I retarted my project again, this time setting "Restart Entire Process" under Advanced tab in project options. Im not exactly sure what this did, but my project was able to successfully finish. I will attach me latest RIP file. Now, the program has created a .CSV file with the exported data. I have not yet attempted to filter out the (f) number from the content element. That said, when I look at the extracted data, there are many duplicated cells. The ZIP code column has no duplicates, which shows me the process is iterating correctly through the input data source. However, its placing duplicate information in cells where there should be no content. For example:

Looking at the .CSV file, starting with ZIP code 90004, Brian Lee Clements is properly listed. A manual search shows that is the correct listing. However, Brian Lee Clements and all other contact information shows up under 90005 and 90006. Manual search indicates 90005 and 90006 have no data. This same duplication occurs throughout the file. However, in the .CSV file, 90002 and 90003 have to data associated with the rows. That is how I want it. Is there a reason why some rows are left blank, while others are left with duplicate data? Thanks.
AapgPt2.csv
AapgDraft.rip

Replied: 1/24/2013 8:19:29 AM
Thanks for that. It looks as though those last two files iterate through the data. I see you changed the initial sign in template to "Optional Template." Where there any other changes made? I was using the previous .RIP file and it would not produce any results. But, it did not have "Optional Template" checked. Is that what made it work? Also, are any "Advanced" options changed to achieve these results? Also, regarding the page navigation template, I wasn't able to pin point a zip code with multiple page results. However, if the input is changed from ZIP codes to CITIES, and "los angeles" is an input, the results page has multiple pages. How can create a template for instances like that which may occur? Thanks again, I beginning to feel confident purchasing this software will do the trick.
Replied: 1/24/2013 1:43:03 AM
Here's a more detailed log.
Aapg_info_13_01_24.log

Replied: 1/24/2013 5:14:04 PM
There is no any difference between the last project and previous demo project, I just made the 'Optional template' for sign in, if the last project is working , that's fine.

However, if the input is changed from ZIP codes to CITIES, and "los angeles" is an input, the results page has multiple pages. 
======================
You figured it out already, once you use CITIES instead of ZIP , then use 'los angles' as input, then you open the search formsubmit template, it will have more pages, so you can find where the 'next' link is , then select it to create a new page navigation template, afterwards, you can navigate back to parent template then replace back to ZIP form field as input..
Replied: 1/24/2013 2:26:50 AM
I thought that myself. But, I created my own CSV containing only 10 zip codes.
I waited almost 3-5 mins but it still didnt load. It hangs on the 4th process.
It doesn't appear to load even the first value 90001.


Thanks for the help. I'm eager to buy this ASAP if it works!
Replied: 1/28/2013 4:51:40 PM
The 'search' form submit template open page in a new tab, it's for enhance performance, but it also caused an rare issue, that's the result page never get refreshed even there is no any result in specific term, instead , it pops up an alret in VWR editor.., whatever, at runtime, VWR won't know this case, it will still continue to extract the result page in that new tab where to displays to previous results, the solution is to submit form with no new tab, it will work.

See the attached new project.

B.t.w, I propose that you create a 'group' template to feed in input zipcodes like what I did demo project for you, although it will take some time initially, but it won't navigate back to start page iterately, this will be helpful for enhance performance.
AapgDraft.rip

Replied: 1/23/2013 4:57:42 PM

I created an account you can use to login, so that you may view the site.

 

Username: 10004837

Password: aapg06

 

Thanks again!

Replied: 1/24/2013 12:19:35 AM
Please attach your log file here.

note: trial version is only able to extract Max 100 elements.
Replied: 1/27/2013 11:27:19 PM
Also, just realized my RIP has been running and is stuck on "Waiting for memory leaks to clear." it's been like this for about 20 minutes now. Please advise. Thanks!