Latest Inquiries - Data Extraction Software

Content Transformation

Submitted: 1/14/2016
Hi,

I have problem with the content transformation.

I want to capture "specialization" in a page (ie: engineering, medical, business). It suppose to transform from the words of " 699 engineering Jobs on JobsCentral.com.sg" as shown in the picture below, into "engineering" so that I can extract a specific specialization. 


I have run content transformation and it works (I have screenshot it, please check attached file, content_transformation.png). However, when I run the project, I noticed that some of it does not transform correctly. Instead, it transformed the supposed "specialization" value (ie: accountant) into the word "matched".



I have attached the project file and the log file.

Thank you.


specialization.PNG
JC_log.txt
Jobscentral_SG.rip
content_transformation.PNG

Replied: 1/15/2016 1:35:33 AM

I got scripts compling errors at SQL connection when opening the content transformation scripts editor for 'Specialization' element, then I've checked the transformation scripts you wrote, if you'd like to extract the 'enginering' word, you can simply use Regex scripts rather than c# .

(\w+) Jobs on
$1

Jobscentral_SG.rip

Replied: 1/18/2016 9:46:51 AM
Thank you for pointing that out. Now, I would like to know how can I obtain start URL in content transformation script.

Basically what I going to do is, since I crawling this website based on list of Start URLs, I going to obtain the "specialization" from start URL (ie: http://jobscentral.com.sg/jobs/engineering will convert into engineering), then when I crawl the next page, if I found the word "matched", it will automatically convert into "specialization" which I have obtained from start URL. 

Here is the screenshot of incomplete content transformation script:



Thank you
Replied: 1/19/2016 1:58:34 AM

Please check the attached new project.

The 'Matched' logic couldn't work at next pages , even it found  the 'Matched' word at next pages, but the current url doesn't include the 'specialization' field, you can check out the example url as I replied at last time.

I've created a new content element at beinning of the project, it does extract the last sectoin - 'specializatoin' in start url as you expect, then following there is a new 'group' template to contain those original templates for iterating thorugh all pages, hope it 's helpful for you.

B.t.w, I've removed the sharing database connectoin on export scripts, so this project can work well for me. 

Jobscentral_SG.rip

Replied: 1/15/2016 4:28:56 AM
Hi,

I used the regex that you gave me. I works fine when I run the content transformation (as shown in cont_trans.png). However, after I crawl the project file, it still shows me "Matched" in the "Specialization field" (as shown in matcfhed.png). 
matched.PNG
cont_trans.PNG

Replied: 1/18/2016 2:00:28 AM

The SQL error persists at run-time, but it's not a big problem for diagnose.

I found the reason is when page navigation template navigates next page where the header text has different format comparing to usual one.

e.g, http://jobscentral.com.sg/jc/jobseeker/jobs/jobresults.aspx?excrit=st%3dA%3buse%3dALL%3bCID%3dSG%3bSID%3d%3f%3bTID%3d0%3bLOCCID%3dSG%3bENR%3dNO%3bDTP%3dDRNS%3bYDI%3dYES%3bIND%3dALL%3bPDQ%3dAll%3bJN%3dJN001%3bJN%3dJN093%3bJN%3dJN112%3bPAYL%3d0%3bPAYH%3dGT120%3bPOY%3dNO%3bETD%3dALL%3bRE%3dALL%3bMGT%3dDC%3bSUP%3dDC%3bFRE%3d30%3bCHL%3dIL%3bQS%3dSID_UNKNOWN%3bSS%3dNO%3bTITL%3d0%3bOB%3d-relv%3bRAD%3d30%3bJQT%3dRAD%3bJDV%3dFalse%3bHost%3dJC%3bSITEENT%3dJOBSCENTRALJOB%3bMaxLowExp%3d-1%3bRecsPerPage%3d25&pg=5&IPath=JRCVTV

it displays '380 matched Jobs on...', that's why you always got 'matched'.

Replied: 1/15/2016 6:32:43 AM
Hi,

Just to add on from the previous post, I have attached my latest project file. Now it should not have errors at SQL connection. 
Jobscentral_SG.rip