Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[solved] regex to scrape links - fails on a few sites
#1
Hi,

I'm trying to scrape the links for the posts on this page http://bustedcoverage.com/other-sports/

Here's my regex
<div class='coedleftalign'><a href='(.*)' ><img src=

It should pull the url, which it does....for three of them before it breaks and returns urls PLUS a bunch of nonsense.

I tried using shortest match in another software and that seemed to help, but I don't know how to do it with findrx. I tried combining flags with | but I can't make it work. Here's my current findrx code

if(0=findrx(s pattern 1 4 l)) end "not found"

The weirdest thing is that I've been able to use regex and the findrx code above to scrape links for posts on almost all sites except for this one.

Thanks.
#2
Use "non-greedy", eg .*?
Or "any characters except specified", eg [^>]*
Read more in Help.
#3
Non-greedy did the trick. Thank you!


Forum Jump:


Users browsing this thread: 1 Guest(s)