The Internet is filled with lots of digital data that we might need for research or for personal interest. In order to get these data, we gonna need some web scraping skills.
Python has enough powerful tools to carry out web scraping tasks easily and effectively on large data.
In this tutorial, we are going to use requests
and beautifulsoup
libraries provided by Python.
What is web scraping?
Web scraping or web data extraction is the process of gathering information from the Internet. It can be a simple copy-paste of the data from specific websites or it can be an advanced data collection from websites that has real-time data.
Some websites don’t mind extracting their data while some websites strictly prohibit data extraction on their websites.
If you are scraping websites for educational purposes then you’re likely to not have any problem but if you are starting large-scale projects then be sure to check the website’s Terms of Services.
Why do we need it?
Not all websites have APIs to fetch content, so to extract the content, we just left with only one option and that is to scrape the content.
Steps for web scraping
- Inspecting the source of data
- Getting the HTML content
- Parsing the HTML with Beautifulsoup
Now let’s move ahead and install the dependencies we’ll need for this tutorial.
Installing the dependencies
We are going to install the requests
library that helps us to get the HTML content of the website and beautifulsoup4
that parses the HTML.
1 |
pip install requests beautifulsoup4 |
Scraping the website
We are going to scrape the Wikipedia article on Python Programming Language. This webpage contains almost all types of HTML tags which will be good for us to test all aspects of BeautifulSoup.
1. Inspecting the source of data
Before writing any Python code, you must take a good look at the website you are going to perform web scraping.
You need to understand the structure of the website to extract the relevant information for the project.
Thoroughly, go through the website, perform basic actions, understand how the website works, and check the URLs, routes, query parameters, etc.
Inspecting the webpage using Developer Tools
Now, it’s time to inspect the DOM (Document Object Model) of the website using Developer Tools.
Developer Tools help in understanding the structure of the website. It is capable of doing a range of things, from inspecting the loaded HTML, CSS, and JavaScript to showing the assets the page has requested and how long they took to load. All modern browsers come with Developer Tools installed.
To open dev tools simply right-click on the webpage and click on the Inspect option. This process is for the Chrome browser on Windows or simply apply the following keyboard shortcut –
Ctrl + Shift + I
For macOS, I think the command is –
⌘ + ⌥ + I
Now it’s time to look at the DOM of our webpage that we are going to scrape.
We can see the HTML on the right that represents the structure of the page which we can see on the left side.
2. Get the HTML content
We need requests
library to scrape the HTML content of the website which we already installed in our system.
Next, open up your favorite IDE or Code Editor and retrieve the site’s HTML in just a few lines of Python code.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import requests url = "https://en.wikipedia.org/wiki/Python_(programming_language)" # Step 1: Get the HTML r = requests.get(url) htmlContent = r.content # Getting the content as bytes print(htmlContent) # Getting the encoded content print(r.text) |
If we print the r.text
we’ll get the same output as the HTML we inspected earlier with the browser’s developer tools. Now we have access to the site’s HTML in our Python script.
Now let’s parse the HTML using Beautiful Soup
3. Parse the HTML with Beautifulsoup
We have successfully scraped the HTML of the website but there is a problem if we look at it there are so many HTML elements lying here and there, and attributes and tags are scattered around. So we need to parse that lengthy response using Python code to make it more readable and accessible.
Beautiful Soup helps us to parse the structured data. It is a Python library for pulling out data from HTML and XML files.
1 2 3 4 5 6 7 8 9 10 11 12 |
import requests from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/Python_(programming_language)" # Step 1: Get the HTML r = requests.get(url) content = r.content # Step 2: Parse the HTML soup = BeautifulSoup(content, 'html.parser') print(soup) |
Here we added some lines to our previous code. We added an import statement for Beautiful Soup and then created a Beautiful Soup object that takes the content
which holds the value of r.content
.
The second argument we added in our Beautiful Soup object is html.parser
. You must choose the right parser for the HTML content.
Find elements by ID
Elements in an HTML webpage can have an id attribute assigned to them. It makes an element in the page uniquely identifiable.
Beautiful Soup allows us to find the specific HTML element by its ID
1 2 3 4 5 6 7 8 9 10 11 |
import requests from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/Python_(programming_language)" r = requests.get(url) content = r.content soup = BeautifulSoup(content, 'html.parser') id_content = soup.find(id="firstHeading") |
We can use .prettify()
to any beautiful soup object to prettify the HTML for easier viewing. Here we called .prettify()
on id_content
variable from above.
1 |
print(id_content.prettify()) |
Note: We cannot use
.prettify()
when using.find_all()
method.
Find elements by Tag
In an HTML webpage, we encounter lots of HTML tags and we might want the data that resides in those tags. Like we want the hyperlinks that reside in the "a"
(anchor) tag or want to scrape the description from "p"
(paragraph) tag.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import requests from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/Python_(programming_language)" r = requests.get(url) content = r.content soup = BeautifulSoup(content, 'html.parser') # Getting the first <code> tag find_tag = soup.find("code") print(find_tag.prettify()) # Getting all the <pre> tag all_pre_tag = soup.find_all("pre") for pre_tag in all_pre_tag: print(pre_tag) |
Find elements by HTML Class Name
We can see hundreds of elements like <div>
, <p>
or <a>
with some classes in an HTML webpage, and through these classes, we can access the whole content present inside the specific element.
Beautiful Soup provides a class_
argument to find the content present inside an element with a specified class name.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import requests from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/Python_(programming_language)" r = requests.get(url) content = r.content soup = BeautifulSoup(content, 'html.parser') # Getting the "div" element with class name "mw-highlight" class_elem = soup.find("div", class_="mw-highlight") print(class_elem.prettify()) |
The first argument we provided inside the beautiful soup object is the element and the second argument we provided is the class name.
Find elements by Text Content and Class name
Beautiful Soup provides a string argument that allows us to search for a string instead of a tag. We can pass in a string, a regular expression, a list, a function, or the value True.
1 2 3 4 5 6 |
# Getting all the strings whose value is "Python" find_str = soup.find_all(string="Python") print(find_str) ......... ['Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python'] |
We can also find the tags whose value matches the specified value for the string argument.
1 |
find_str_tag = soup.find_all("p", string="Python") |
Here we are looking for the <p>
tag in which the value “Python” must be there. But if we move ahead and try to print the result, then we’ll get an empty result.
1 2 3 |
print(find_str_tag) ......... [] |
This is because when we use string= then our program looks exactly the same value as we provide. Any customization, whitespace, difference in spelling, or capitalization will prevent the element from matching.
If we provide the exact value then the program will run successfully.
1 2 3 4 5 |
find_str_tag = soup.find_all("span", string="Typing") print(find_str_tag) ......... [<span class="toctext">Typing</span>, <span class="mw-headline" id="Typing">Typing</span>] |
Passing a Function
In the above section, when we try to find the <p>
tag containing the string “Python” we got disappointment.
But Beautiful Soup allows us to pass a function as arguments. We can modify the above code to work perfectly fine after using the function.
1 2 3 4 5 6 |
# Creating a function def has_python(text): return text in soup.find_all("p") find_str_tag = soup.find_all("p", string=has_python("Python")) print(len(find_str_tag)) |
Here we created a function called has_python
which takes text
as an argument and then it returns that text present in all the <p>
tag.
Next, we passed that function to the string argument and pass the string “Python” to it. Then we printed the number of occurrences of the “Python” in all the <p>
tags.
1 |
81 |
Extract Text from HTML elements
What if we do not want the content with the HTML tags attached to them? What if we want clean and simple text data from the elements and tags?
We can use .text
or .get_text()
to return only the text content of the HTML elements that we pass in the Beautiful Soup object.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import requests from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/Python_(programming_language)" r = requests.get(url) content = r.content soup = BeautifulSoup(content, 'html.parser') table_elements = soup.find_all("table", class_="wikitable") for table_data in table_elements: table_body = table_data.find("tbody") print(table_body.text) # or print(table_body.get_text()) |
We’ll get the whole table as an output in text format. But there will be so many whitespaces between the text so we’ll need to strip that data and remove the whitespaces by simply using .strip
method.
1 |
print(table_body.text.strip()) |
There are other ways also to remove whitespaces. Check it out here.
Extract Attributes from HTML elements
An HTML page has numerous attributes like href, src, style, title, and more. Since an HTML webpage contains a large amount of <a>
tags with href attributes so we are going to scrape all the href attributes present on our website.
We cannot scrape the attributes as we did in the above examples.
1 2 3 4 5 6 7 8 9 10 |
# Accessing href in the main content of the HTML page anchor_in_body_content = soup.find(id="bodyContent") # Finding all the anchor tags anchors = anchor_in_body_content.find_all("a") # Looping over all the anchor tags to get the href attribute for link in anchors: links = link.get('href') print(links) |
We simply looped over all the <a>
tags in the main content of the HTML page and then used a .get('href')
to get all the href attributes.
You can do the same for the src
attributes also.
1 2 3 4 5 6 7 8 9 10 |
# Accessing src in body of the HTML page img_in_body_content = soup.find(id="bodyContent") # Finding all the img tags media = img_in_body_content.find_all("img") # Looping over all the img tags to get the src attribute for img in media: images = img.get('src') print(images) |
Access Parent and Sibling elements
Beautiful Soup allows us to access an element’s parent by just using .parent
attribute.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import requests from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/Python_(programming_language)" r = requests.get(url) content = r.content soup = BeautifulSoup(content, 'html.parser') id_content = soup.find(id="cite_ref-123") parent_elem = id_content.parent print(parent_elem) |
We can find grandparent or great-grandparent elements of an specific element passed in the beautiful soup object.
1 2 3 4 |
id_content = soup.find(id="cite_ref-123") grandparent_elem = id_content.parent.parent print(grandparent_elem) |
There is another method that Beautiful Soup provides is .parents
which helps us in iterating over all of an element’s parents.
1 2 3 4 5 6 |
id_content = soup.find(id="cite_ref-123") for elem in id_content.parents: print(elem) # to print the elements print(elem.name) # to print only the names of elements |
Note: This program might take a little time to complete so wait until the program is finished.
Output for elem.name
would be
1 2 3 4 5 6 7 8 |
p div div div div body html [document] |
Similarly, we can access an element’s next and previous siblings by using .next_sibling
and .previous_sibling
respectively.
1 2 3 4 5 6 |
id_content = soup.find(id="cite_ref-123") # To print the next sibling of an element next_sibling_elem = id_content.next_sibling print(next_sibling_elem) |
1 2 3 4 5 6 |
id_content = soup.find(id="cite_ref-123") # To print the previous sibling of an element previous_sibling_elem = id_content.previous_sibling print(previous_sibling_elem) |
Iterating over a tag’s siblings using .next_siblings
or .previous_siblings
.
Iterating over all the next siblings
1 2 3 4 |
next_sibling_elem = id_content.next_sibling for next_elem in id_content.next_siblings: print(next_elem) |
Iterating over all the previous siblings
1 2 3 4 |
id_content = soup.find(id="cite_ref-123") for previous_elem in id_content.previous_siblings: print(previous_elem) |
Using Regular Expression
Last but not least, we can use regular expression to search for an element, tag, text, etc., in the HTML tree.
This code will find all the tags starting from p
in the HTML element having id=bodyContent
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import requests from bs4 import BeautifulSoup import re url = "https://en.wikipedia.org/wiki/Python_(programming_language)" r = requests.get(url) content = r.content soup = BeautifulSoup(content, 'html.parser') id_content = soup.find(id="bodyContent") for tag in id_content.find_all(re.compile("^p")): print(tag.name) |
This code will match all the alphanumeric characters, which means a-z
, A-Z
, and 0-9
. It also matches the underscore, _
. But we don’t have elements starting from digits or underscore, so it will return all the tags and elements of an element passed in the Beautiful Soup object.
1 2 3 4 |
id_content = soup.find(id="bodyContent") for tag in id_content.find_all(re.compile("\w")): print(tag.name) |
Conclusion
Well, we learned how to scrape a static website though it can be different for dynamic websites which throw different data on different requests, or hidden websites that have authentication. There are more powerful scraping tools available for these types of websites like Selenium, Scrapy, etc.
requests
library allows us to access the site’s HTML which then can be helpful for us to pull out the data from HTML using Beautiful Soup.
There are many methods and functions still available that we haven’t seen but we discussed some key functions and methods that are used most commonly.
That’s all for now
Keep Coding✌✌