Latest Inquiries - Data Extraction Software

Property detailed description

Submitted: 8/15/2013

Hello,

This is real estate website and I was able to extract everything I want, except:
- page navigation: to click on icon, for next page, at the bottom of the page (I tryed to use PageNavigation template but it doesn't work)
- detailed description of property (text between "Dodatni opis:" and "Lastnosti" if you look at attached picture or at http://www.bolha.com/nepremicnine/stanovanja/maribor/stanovanje-podravska-maribor-center-2-sobno-505-m2-prodam-1288642595.html?aclct=1376502928), it can have one or more paragraphs

Thanks in advance.

propertyDetails.png

Replied: 8/16/2013 10:50:08 PM
See the attached new project.

I've added the "description" element where to use content transformation Regex script to capture specific content.
also, I've added the "next page" page navigation template underneath "nepremicnine" link template, it has specific selection xpath and using Full page load action to open next page.
Bolha.rip

Replied: 8/15/2013 5:35:39 PM
Please attach your project, so I can revise the rest of the page navigation template and the description element. thanks.
Replied: 8/17/2013 9:39:08 AM
Thank you so much for that. :)

But I found out that something else that I did does not work as I expected.

As you can see at the attached picture "property1.png" there can be five lists (with green dots). The first two lists "Lastnosti", "Objekt" usually have only one element so I extracted them as an element. For  "Oprema", "Prikljucki", "Okolica" I created template PageArea so that I click on first element of list, right click on the second and select "Create/Edit list" --> "Create list" and also right click on name of the list (e.g. "Oprema") and select "Add Filter" --> "Must have Text Oprema".

For each PageArea I get XPath:
//TD[B[3]='Oprema']/UL[3]/LI
//TD[B[4]='Prikljucki']/UL[4]/LI
//TD[B[5]='Okolica']/UL[5]/LI

But when one or more lists are missing (as you can see on at the attached picture "property2.png") the one that are after them are not extracted.

I would be really happy if you could also help me with that.



property2.png
property1.png

Replied: 8/17/2013 8:51:15 PM
Please check the attached new project.

The key point is the selection xpath where to use Span function.

F.Y.I:

The selection xpath
Bolha.rip

Replied: 8/18/2013 9:13:39 AM

Just one more question about XPath. Let say I have the following xml:

<td id="box-oglas-levo">

<b>Characteristics</b><br>

<ul class="avto_oglas">

<li>Second floor</li>

</ul><br>

<b>Building</b><br>

<ul class="avto_oglas">

<li>age of building: new</li>

</ul><br>

<b>Equipment</b><br>

<ul class="avto_oglas">

<li>satelite tv</li>

<li>telephone</li>

</ul><br>

<b>Nearby</b><br>

<ul class="avto_oglas">

<li>post office</li>

<li>bank</li>

<li>nursery school</li>

</ul>

</td>

How do I get the position of e.g. <b>Equipment</b> if it's not always at the third place of <b> nodes (it can be first, second,…)?

I know that one possibility is //TD[B[.=' Equipment ']], but I need the number of position where is located (XPath that will return 1 if it is on first position, 2 if it is on second,…).


Replied: 8/18/2013 6:34:29 PM
I'm not quite sure why you need to extract the number of position from Xpath, it's a little of hard to do as my thought, but you can try the attribute @node-position passed in custom xpath function, then further you assign the number to other element .
Replied: 8/16/2013 10:36:41 AM
This is what I have so far. 
Bolha.rip