Thursday, September 29, 2011

extract urls

I have some content with a list of URLs contained in it.

I am trying to grab all the URLs out and put them in an array.

I have code like:

content = "Sample of URLs: http://www.google.com and http://www.google.com/index.html which I want to grab"

And I am trying to get the end results to be:

['http://www.google.com', 'http://www.google.com/index.html']

Either of two ways you can extracts URLs

1) urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }

Or you can grab it by using REGEX

2) urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)

but note that it won't match pure IP-address URLs (like http://127.0.0.1), because of the [a-z]{2,5} for the TLD.