What is Web Scraping? — Python

Burak Vural
4 min readSep 25, 2023

--

Web Scraping HTML

Web scraping — data scraping or web scraping refers to a process performed using the Python programming language to extract data or collect information from websites.

Web scraping can be used in a variety of projects and there are many different libraries and tools.

Now let’s examine and take a look at web scraping methods in Python;
Points to consider about web scraping:

Requests: The basis of web scraping is sending HTTP requests to websites. A library called “requests” in Python makes this process easier. You can retrieve and review web pages by submitting requests.

import requests

url = "https://www.google.com/search?q=python"

response = requests.get(url)

if response.status_code == 200:
# True
print(response.content)
else:
# False
print(response.status_code)

Beautiful Soup: Beautiful Soup is a Python library used to parse HTML and XML documents and extract data from them. It simplifies data extraction and offers capabilities to search, filter and organize documents.

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, "html.parser")

results = soup.find_all("div", class_="g")

for result in results:
print(result.text)

################# or ############################

import requests
from bs4 import BeautifulSoup

url = "https://www.amazon.com/s?k=python+book"

response = requests.get(url)

if response.status_code == 200:
# True
soup = BeautifulSoup(response.content, "html.parser")

products = soup.find_all("div", class_="s-result-item")

for product in products:
# Product Title and Price
title = product.find("h2", class_="a-size-base a-color-base s-title").text
price = product.find("span", class_="a-price").text

print(f"Product Title: {title}")
print(f"Product Price: {price}")
else:
# False
print(response.status_code)

###################### Print ########################

Product Name: Python Crash Course: A Hands-On, Project-Based Introduction to Programming (2nd Edition)
Product Price: 29,99 USD

Product Name: Python for Beginners: A Practical Introduction to the Python Programming Language
Product Price: 19,99 USD

Product Name: Python Programming: An Introduction to Computer Science
Product Price: 12,99 USD

Scrapy: Scrapy is a Python framework used for larger and complex web scraping projects. Scrapy is customizable and offers multi-page scans, data storage, and automation capabilities.

Selenium: Selenium is used to retrieve content by automatically crawling web pages. This is especially useful for dynamic websites because it can handle content created with JavaScript.

Data Cleansing and Analysis: Data captured by web scraping usually needs to be cleaned and analyzed. Python libraries like Pandas are useful for organizing, analyzing and visualizing data.

Robots.txt: When web scraping, it is important to check the “robots.txt” file of websites. This file specifies which pages can be crawled and which cannot. It is important to follow robots.txt rules.

Legal and Ethical Issues: When web scraping, you should take into account the website’s terms of use and legal regulations. You may need to obtain consent before collecting business data or personal information.

There are some important issues you need to pay attention to when doing web scraping and it is important to follow ethical rules.

Ethics rules and laws may vary depending on your country and region. Please, we must be sensitive and careful about illegal and unofficial methods.

In Web Scraping studies, a different database created as SQL, MongoDB, Local or a NoSQL base can generally be used. You can also keep data in excel, word, txt type files.

Let’s examine the Web Scraping stages;

Web Platform and Code Analysis
Security — Precaution (CAPTCHA — IP BLOCK — LOCATION)
Algorithm Control
Correct Requests — HTTP
HTML — CSS — JS Codes Analysis
Data — Data Scraping
Data Storage

What are the Types of Web Scraping?

Known types of web scraping are generally;

HTML — DOM — Browser Extensions — Headless background browser
Web Scraping with Python

Libraries are very important in this regard.

We can specify it as Requests — BeautifulSoup — Selenium — Pandas.

Let’s look at the sample studies and codes;

Let’s print some article information from xyz website to MongoDB

import requests
from bs4 import BeautifulSoup
import pymongo

# Create connection to MongoDB
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["ticket_org_db"]
collection = db["ny_espana_ticket_info"]

# Pull data from website
url = "https://www.xyz.com/ny-espana-tickets-all"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Process and retrieve data
tickets = []

for ticket in soup.find_all("div", class_="ticket"):
ticket_name = ticket.find("h3").text
ticket_price = ticket.find("span", class_="price").text

ticket_dict = {
"ticket_name": ticket_name,
"ticket_price": ticket_price
}

tickets.append(ticket_dict)

# Saving data to MongoDB
collection.insert_many(tickets)

# Close the connection
client.close()

Some Details;

Popular Web Scraping Framework and Libraries;

LXML

Selenium

Requests

Mechanize

Beautiful Soup 4

Scrapy | A Fast and Powerful Scraping and Web Crawling Framework

Robot.txt and Platform Rules;

Some platforms can leave some notes in robot.txt files for data analysis. Some platforms can perform measures such as IP Blocking, IP Limitation — Data Limitation, Time Limitation, Bot Blocker. You should carefully examine the platform on which you will scrape data and analyze its codes properly. You should definitely pay attention to legal and illegal issues.

Project and Platform Analysis;

To complete the web scraping process more successfully, you may sometimes need to use proxy, VPN type services and user-agent type configurations. You may need to work with spider type structures, regional data analysis and API infrastructure. Paying attention to API information and the browser infrastructure used will be a good option for more successful web scraping.

Current

Be sure to examine Youtube, Udemy, Free Courses and sample coding studies and detail the code structures and algorithms. Effectively using platforms such as Github and stackoverflow will ensure success. Review projects and analyze courses.

Some Trainings and Web Scraping Videos;

--

--

Burak Vural
Burak Vural

Written by Burak Vural

Software | Cyber Security | FullStack | Dev.Note and some coffee! Ars magus de templum clavis

No responses yet