[ PHP Dom XPath->evaluate ]
A quick question for all you pros. I am trying to screen scrape the title of threads on this page http://forums.moneysavingexpert.com/forumdisplay.php?f=36
I am using DOMXPath, looking at the source code for the above page, the title is contained in the following code:
<a href="showthread.php?t=...number representing thread..."
id="thread_title_...number representing thread..."
style="font-weight:bold">TITLE OF THREAD</a>
I started with this code:
$list3 = $xpath3
->evaluate("//a[contains(@style, 'font-weight:bold')]");
However, there are multiple <a style="font-weight:bold">
. My question is: can you combine contains
? For example, contains @style
and @href
?
If so, how can you do it with the above href which has a number that changes depending on what thread it is? Can you do [0-9] type thing?
I would appreciate any help I can get!
Answer 1
Use the following expression to get the link whose href
contains showthread.php?t=2
:
//a[contains(@style, 'font-weight:bold') and
contains(@href, 'showthread.php?t=2')]
If you want to get any of those links (regardless of the number in t=<n>
), then use the following expression:
//a[contains(@style, 'font-weight:bold') and
contains(translate(@href, '0123456789', ''), 'showthread.php?t=', )]
Note that you could also use starts-with
if these strings always appear at the start of the href
.
Answer 2
I think you can do combinations, but in your case, I think it would be simpler to get the 3rd td of each tr and get the title inside. And again, try not relying on the style, it is not very semantic.
You need to learn xpath, and you can learn it at w3schools, for instance. Also, if you use firebug, you can right click on any element in the html tab and get its xpath. Here is what I get for the first title : //*[@id="td_threadtitle_3499047"]
... not very good.
For the thread table, I get this: //*[@id="threadslist"]
... this is better, there is no number.
Now let's get every 3rd td in it: //*[@id="threadslist"]//td[3]
And now the second link, which must correspond to the title : //*[@id="threadslist"]//td[3]/div/a[2]
. Get it? Maybe I'm wrong, but I hope you got the idea...
Relying on the position is not very semantic either, but you don't seem to have very much choice in that matter...