BS4

In the following, I assume that you have obtained the html source code from a web page as a string. Commonly that is accomplished by using the "httplib", "request", or "urllib2" modules. This string is now in the "strPage" variable.

Please note that I will show only one syntax for a particular task, even though bs4 typically allows several more variants.

In the instructions below, actual code is shown in this font. Arguments you need to replace are shown in italics.

Getting started:

import bs4

soup = bs4.BeautifulSoup(strPage)

The "soup" variable contains a version of the web code, with lots of extra data added to make it easily searchable.

Extract a single tag

In a few cases, an HTML tag appears only once, or we only want the first one. Extract the tag using:

soup.tag

Examples

t = soup.title #sets "t" to the title of the web page

a = soup.a #sets "a" to the first link in the page.

These variables are objects of the "tag" type, so they respond to various method calls. Lets start with:

tag.contents

tag.text

The contents method returns a list of tags that are contained in the tag. The text method extracts a single string containing all text inside the tag, without any of the tag content itself. For example:

p = soup.p

print(p)

<p><a href="index.html">Home</a></p>

So here the variable "p" contains a <p> tag, inside of which is an <a> tag.

print(p.contents)

[<a href="index.html">Home</a>]

print(p.text)

u'Home'

  • The contents is a list with one item, which contains the <a> tag. The text simply extracts the textual data inside the <p> tag and ignores everything else.

  • The "text" method returns a unicode string (that is what the "u" in front of the string tells us; this way non-latin characters can be handled).

  • However, the list returned by "contents" is a list of tags which will respond in turn to the "contents" and "text" methods.

Another useful tool is:

tag.attrs

While contents and text get what is inside the tag, attrs get the attributes of the tag itself:

print(p.contents)

[<a href="index.html">Home</a>]

print(p.attrs)

{u'href': u'index.html'}

The attrs method returns a dictionary with one entry for each attribute. To get the link in this case:

print(p.attrs['href'])

u'index.html'

Here is an example that is a bit more complex:

print(d)

<div class="sites-status" id="sites-status" style="display:none;" xmlns="http://www.w3.org/1999/xhtml">

<div aria-live="assertive" class="sites-notice" id="sites-notice" role="status">

</div>

</div>

Here we have two "div" tags inside each other. Contents will again return the inner tag (the div):

print(d.contents)

[<div aria-live="assertive" class="sites-notice" id="sites-notice" role="status"> </div>]

The attributes dictionary this time contains more entries:

print(d.attrs)

{u'style': u'display:none;', u'xmlns': u'http://www.w3.org/1999/xhtml', u'id': u'sites-status', u'class': [u'sites-status']}

Each attribute which appears on the left of an equal sign in the tag is now a key in the dictionary. Note that most, but not all of the values in the dictionary are strings. The value of the "class" key is a list, since it is possible to more than one class name in the tag.

Creating a List of Tags

soup.find_all(tag-name)

We often want to get all tags of a particular kind. The "find_all" command does that. "find_all" in its simplest form takes one string argument which is the name of the tag to search for.

listLinks = soup.find_all('a')

creates a variable "listLinks" which contains a list of all "a" tags. One element of the list, such as listLinks[1], is a tag and will take all the method calls described above. You can loop over the elements with the usual "for item in listLinks:" syntax.

"find_all" can carry out more complex searches. Here, we add only one of them: adding an attribute to the search.

soup.find_all(tag-name,attribute=string)

listDivs = soup.find_all('div',id='sites-status')

will create a list of "div" tags with the id attribute "sites-status".

There is one special case: the "class" attribute which we saw earlier. You cannot use the word "class" as a name in python, because is is a reserved word, like "import", or "for". When you need to narrow a search using the class attribute, use the name "class_" instead, as in "soup.find_all("div",class_="sites-attachments-name")"

Basic Flow for a Web Scraping Task

This page only shows the most basic commands in BeautifulSoup. There are many shortcuts which you will want to learn if you do this often. However, the information on this page will get you started. Here are the basic steps if you find a web page that contains data to be scraped:

    • Create your soup object.

    • Print soup.prettify() to see the structure of the web page, and find out what tags contain the information you want.

    • Use find_all to create a list of tags of the desired type.

      • Loop over this list and print each tag.

      • Determine how you will identify the tags that you are interested in.

        • Use "if" statements in your loop to print only those tags of interest.

        • If the list looks correct, use "text" to get at the data, and manipulate the resulting strings (with things like strip and split) to produce the output you want.