You are currently viewing Web Scraping In Python Using Beautifulsoup

Web Scraping In Python Using Beautifulsoup

The Internet is filled with lots of digital data that we might need for research or for personal interest. In order to get these data, we gonna need some web scraping skills.

Python has enough powerful tools to carry out web scraping tasks easily and effectively on large data.

In this tutorial, we are going to use requests and beautifulsoup libraries provided by Python.

What is web scraping?

Web scraping or web data extraction is the process of gathering information from the Internet. It can be a simple copy-paste of the data from specific websites or it can be an advanced data collection from websites that has real-time data.

Some websites don’t mind extracting their data while some websites strictly prohibit data extraction on their websites.

If you are scraping websites for educational purposes then you’re likely to not have any problem but if you are starting large-scale projects then be sure to check the website’s Terms of Services.

Why do we need it?

Not all websites have APIs to fetch content, so to extract the content, we just left with only one option and that is to scrape the content.

Steps for web scraping

  • Inspecting the source of data
  • Getting the HTML content
  • Parsing the HTML with Beautifulsoup

Now let’s move ahead and install the dependencies we’ll need for this tutorial.

Installing the dependencies

We are going to install the requests library that helps us to get the HTML content of the website and beautifulsoup4 that parses the HTML.

Scraping the website

We are going to scrape the Wikipedia article on Python Programming Language. This webpage contains almost all types of HTML tags which will be good for us to test all aspects of BeautifulSoup.

1. Inspecting the source of data

Before writing any Python code, you must take a good look at the website you are going to perform web scraping.

You need to understand the structure of the website to extract the relevant information for the project.

Thoroughly, go through the website, perform basic actions, understand how the website works, and check the URLs, routes, query parameters, etc.

Inspecting the webpage using Developer Tools

Now, it’s time to inspect the DOM (Document Object Model) of the website using Developer Tools.

Developer Tools help in understanding the structure of the website. It is capable of doing a range of things, from inspecting the loaded HTML, CSS, and JavaScript to showing the assets the page has requested and how long they took to load. All modern browsers come with Developer Tools installed.

To open dev tools simply right-click on the webpage and click on the Inspect option. This process is for the Chrome browser on Windows or simply apply the following keyboard shortcut –

Ctrl + Shift + I

For macOS, I think the command is –

 +  + I

Now it’s time to look at the DOM of our webpage that we are going to scrape.

DOM View

We can see the HTML on the right that represents the structure of the page which we can see on the left side.

2. Get the HTML content

We need requests library to scrape the HTML content of the website which we already installed in our system.

Next, open up your favorite IDE or Code Editor and retrieve the site’s HTML in just a few lines of Python code.

If we print the r.text we’ll get the same output as the HTML we inspected earlier with the browser’s developer tools. Now we have access to the site’s HTML in our Python script.

Now let’s parse the HTML using Beautiful Soup

3. Parse the HTML with Beautifulsoup

We have successfully scraped the HTML of the website but there is a problem if we look at it there are so many HTML elements lying here and there, and attributes and tags are scattered around. So we need to parse that lengthy response using Python code to make it more readable and accessible.

Beautiful Soup helps us to parse the structured data. It is a Python library for pulling out data from HTML and XML files.

Here we added some lines to our previous code. We added an import statement for Beautiful Soup and then created a Beautiful Soup object that takes the content which holds the value of r.content.

The second argument we added in our Beautiful Soup object is html.parser. You must choose the right parser for the HTML content.

Find elements by ID

Elements in an HTML webpage can have an id attribute assigned to them. It makes an element in the page uniquely identifiable.

Beautiful Soup allows us to find the specific HTML element by its ID

We can use .prettify() to any beautiful soup object to prettify the HTML for easier viewing. Here we called .prettify() on id_content variable from above.

Note: We cannot use .prettify() when using .find_all() method.

Find elements by Tag

In an HTML webpage, we encounter lots of HTML tags and we might want the data that resides in those tags. Like we want the hyperlinks that reside in the "a" (anchor) tag or want to scrape the description from "p" (paragraph) tag.

Find elements by HTML Class Name

We can see hundreds of elements like <div><p> or <a> with some classes in an HTML webpage, and through these classes, we can access the whole content present inside the specific element.

Beautiful Soup provides a class_ argument to find the content present inside an element with a specified class name.

The first argument we provided inside the beautiful soup object is the element and the second argument we provided is the class name.

Find elements by Text Content and Class name

Beautiful Soup provides a string argument that allows us to search for a string instead of a tag. We can pass in a string, a regular expression, a list, a function, or the value True.

We can also find the tags whose value matches the specified value for the string argument.

Here we are looking for the <p> tag in which the value “Python” must be there. But if we move ahead and try to print the result, then we’ll get an empty result.

This is because when we use string= then our program looks exactly the same value as we provide. Any customization, whitespace, difference in spelling, or capitalization will prevent the element from matching.

If we provide the exact value then the program will run successfully.

Passing a Function

In the above section, when we try to find the <p> tag containing the string “Python” we got disappointment.

But Beautiful Soup allows us to pass a function as arguments. We can modify the above code to work perfectly fine after using the function.

Here we created a function called has_python which takes text as an argument and then it returns that text present in all the <p> tag.

Next, we passed that function to the string argument and pass the string “Python” to it. Then we printed the number of occurrences of the “Python” in all the <p> tags.

Extract Text from HTML elements

What if we do not want the content with the HTML tags attached to them? What if we want clean and simple text data from the elements and tags?

We can use .text or .get_text() to return only the text content of the HTML elements that we pass in the Beautiful Soup object.

We’ll get the whole table as an output in text format. But there will be so many whitespaces between the text so we’ll need to strip that data and remove the whitespaces by simply using .strip method.

There are other ways also to remove whitespaces. Check it out here.

Extract Attributes from HTML elements

An HTML page has numerous attributes like href, src, style, title, and more. Since an HTML webpage contains a large amount of <a> tags with href attributes so we are going to scrape all the href attributes present on our website.

We cannot scrape the attributes as we did in the above examples.

We simply looped over all the <a> tags in the main content of the HTML page and then used a .get('href') to get all the href attributes.

You can do the same for the src attributes also.

Access Parent and Sibling elements

Beautiful Soup allows us to access an element’s parent by just using .parent attribute.

We can find grandparent or great-grandparent elements of an specific element passed in the beautiful soup object.

There is another method that Beautiful Soup provides is .parents which helps us in iterating over all of an element’s parents.

Note: This program might take a little time to complete so wait until the program is finished.

Output for elem.name would be

Similarly, we can access an element’s next and previous siblings by using .next_sibling and .previous_sibling respectively.

Iterating over a tag’s siblings using .next_siblings or .previous_siblings.

Iterating over all the next siblings

Iterating over all the previous siblings

Using Regular Expression

Last but not least, we can use regular expression to search for an elementtagtext, etc., in the HTML tree.

This code will find all the tags starting from p in the HTML element having id=bodyContent

This code will match all the alphanumeric characters, which means a-zA-Z, and 0-9. It also matches the underscore, _. But we don’t have elements starting from digits or underscore, so it will return all the tags and elements of an element passed in the Beautiful Soup object.

Conclusion

Well, we learned how to scrape a static website though it can be different for dynamic websites which throw different data on different requests, or hidden websites that have authentication. There are more powerful scraping tools available for these types of websites like SeleniumScrapy, etc.

requests library allows us to access the site’s HTML which then can be helpful for us to pull out the data from HTML using Beautiful Soup.

There are many methods and functions still available that we haven’t seen but we discussed some key functions and methods that are used most commonly.


That’s all for now

Keep Coding✌✌