I have some content with a list of URLs contained in it.
I am trying to grab all the URLs out and put them in an array.
I have code like:
content = "Sample of URLs: http://www.google.com and http://www.google.com/index.html which I want to grab"
And I am trying to get the end results to be:
['http://www.google.com', 'http://www.google.com/index.html']
Either of two ways you can extracts URLs
1) urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }
Or you can grab it by using REGEX
2) urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
but note that it won't match pure IP-address URLs (like http://127.0.0.1), because of the [a-z]{2,5} for the TLD.
Thursday, September 29, 2011
Subscribe to:
Posts (Atom)