Home > ruby > Get all hyperlinks within a page using Nokogiri

Get all hyperlinks within a page using Nokogiri

Task: Create a selenium script using Ruby that will collect all the available links within a page.

In essence we will try to create a method that will parse the html source of the current page and get all the elements with css(‘a’) or xpath ‘//a’ which indicates an anchor element. First let’s try to do it in IRB.


1. Start your server and fire up your irb

2. In your console, type

require 'nokogiri'

3. Initialize the page we want to test, say we want to get all the hyperlinks within a google home page.

page = "http://www.google.com.ph"

4. Type the following commands

doc = Nokogiri::HTML(open(page))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.sort.delete_if
{|href| href.empty?}

Of course we would not like to do the procedure every time in our console, thus we could save it as a method in our class like the following:

# method that will get all links using Nokogiri
def get_all_hrefs_nokogiri
page = self.get_location()
doc = Nokogiri::HTML(open(page))
links = doc.css(‘a’)

hrefs = links.map {|link| link.attribute(‘href’).to_s}.uniq.sort.delete_if {|href| href.empty?}
return hrefs

# get all links without using Nokogiri
def get_all_hrefs
hrefs = []
self.get_xpath_count(‘//a’).to_i.times do |i|
if self.is_element_present(“document.links[#{i}]”) {hrefs << self.get_attribute("document.links[#{i}]@href")} end return hrefs end end [/sourcecode]

Categories: ruby Tags: ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: