Latest Inquiries - Data Extraction Software

Data different positions

Submitted: 3/13/2017

Hello!

I am scraping course data from the following website: http://xstudy.eu/find-bachelor-master-programmes.0.html?


I am having the following problem, in some pages, there are some information missing (such as office hours). and the information listed below (like internet) is therefore moved up. In this way, in my final excel sheet for some courses I will have Internet or other info under the wrong column. 

Example course all info: http://xstudy.eu/bachelor-master-programme.0.html?&tx_assearchengine_pi7[program]=109303

Example course missing info: http://xstudy.eu/bachelor-master-programme.0.html?&tx_assearchengine_pi7[program]=80173


Also, can you please check if the two page navigation templates are correst?

Because when I create the list for "list of link" template It includes all the numbers in the row, including the symbol for next page (>>) and the number of the last page (3398). I therefore selected 5 in list options - count. Is that correct?


Here below you fing the project attached


Thank you

Federica

Xstudy_2017_all.rip

Replied: 3/13/2017 2:12:43 PM

Hi Federica,

This is  because some element is in different position. To avoid this, you can use filtering: For office hours, for example, select the data, then right click in "Office Hours", choose Add Filter > Must Have Text "Office Hours". This will solve the problem.

For pagenavigation, I modified the xpath. It should work now.


Best regards,

Xstudy_2017_all.rip

Replied: 3/15/2017 5:22:19 PM

Hi,

You can scrape the URL list first for all pagenavigation and don't go to individual pages. After you get the links for all the pagination, you can create a separate agent that will scrape the details page and use the URLs you have collected as an input of List of Start URL in Project Options.

Best regards,

Replied: 3/15/2017 3:25:00 PM

Hello!


Thank you, I added the filter and it worked properly for almost 5000 entries but then it just stops and extract the data. Although, I have to scrape almost 70.000 entries. Is there a way to fix it, or do it in different tranches?


Thank you


Federica

Xstudy_2017_all_CS.rip