What is Web Scraping? — Python
Web scraping — data scraping or web scraping refers to a process performed using the Python programming language to extract data or collect information from websites.
Web scraping can be used in a variety of projects and there are many different libraries and tools.
Now let’s examine and take a look at web scraping methods in Python;
Points to consider about web scraping:
Requests: The basis of web scraping is sending HTTP requests to websites. A library called “requests” in Python makes this process easier. You can retrieve and review web pages by submitting requests.
import requests
url = "https://www.google.com/search?q=python"
response = requests.get(url)
if response.status_code == 200:
# True
print(response.content)
else:
# False
print(response.status_code)
Beautiful Soup: Beautiful Soup is a Python library used to parse HTML and XML documents and extract data from them. It simplifies data extraction and offers capabilities to search, filter and organize documents.
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
results = soup.find_all("div", class_="g")
for result in results:
print(result.text)
################# or ############################
import requests
from bs4 import BeautifulSoup
url = "https://www.amazon.com/s?k=python+book"
response = requests.get(url)
if response.status_code == 200:
# True
soup = BeautifulSoup(response.content, "html.parser")
products = soup.find_all("div", class_="s-result-item")
for product in products:
# Product Title and Price
title = product.find("h2", class_="a-size-base a-color-base s-title").text
price = product.find("span", class_="a-price").text
print(f"Product Title: {title}")
print(f"Product Price: {price}")
else:
# False
print(response.status_code)
###################### Print ########################
Product Name: Python Crash Course: A Hands-On, Project-Based Introduction to Programming (2nd Edition)
Product Price: 29,99 USD
Product Name: Python for Beginners: A Practical Introduction to the Python Programming Language
Product Price: 19,99 USD
Product Name: Python Programming: An Introduction to Computer Science
Product Price: 12,99 USD
Scrapy: Scrapy is a Python framework used for larger and complex web scraping projects. Scrapy is customizable and offers multi-page scans, data storage, and automation capabilities.
Selenium: Selenium is used to retrieve content by automatically crawling web pages. This is especially useful for dynamic websites because it can handle content created with JavaScript.
Data Cleansing and Analysis: Data captured by web scraping usually needs to be cleaned and analyzed. Python libraries like Pandas are useful for organizing, analyzing and visualizing data.
Robots.txt: When web scraping, it is important to check the “robots.txt” file of websites. This file specifies which pages can be crawled and which cannot. It is important to follow robots.txt rules.
Legal and Ethical Issues: When web scraping, you should take into account the website’s terms of use and legal regulations. You may need to obtain consent before collecting business data or personal information.
There are some important issues you need to pay attention to when doing web scraping and it is important to follow ethical rules.
Ethics rules and laws may vary depending on your country and region. Please, we must be sensitive and careful about illegal and unofficial methods.
In Web Scraping studies, a different database created as SQL, MongoDB, Local or a NoSQL base can generally be used. You can also keep data in excel, word, txt type files.
Let’s examine the Web Scraping stages;
Web Platform and Code Analysis
Security — Precaution (CAPTCHA — IP BLOCK — LOCATION)
Algorithm Control
Correct Requests — HTTP
HTML — CSS — JS Codes Analysis
Data — Data Scraping
Data Storage
What are the Types of Web Scraping?
Known types of web scraping are generally;
HTML — DOM — Browser Extensions — Headless background browser
Web Scraping with Python
Libraries are very important in this regard.
We can specify it as Requests — BeautifulSoup — Selenium — Pandas.
Let’s look at the sample studies and codes;
Let’s print some article information from xyz website to MongoDB
import requests
from bs4 import BeautifulSoup
import pymongo
# Create connection to MongoDB
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["ticket_org_db"]
collection = db["ny_espana_ticket_info"]
# Pull data from website
url = "https://www.xyz.com/ny-espana-tickets-all"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# Process and retrieve data
tickets = []
for ticket in soup.find_all("div", class_="ticket"):
ticket_name = ticket.find("h3").text
ticket_price = ticket.find("span", class_="price").text
ticket_dict = {
"ticket_name": ticket_name,
"ticket_price": ticket_price
}
tickets.append(ticket_dict)
# Saving data to MongoDB
collection.insert_many(tickets)
# Close the connection
client.close()
Some Details;
Popular Web Scraping Framework and Libraries;
Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
Robot.txt and Platform Rules;
Some platforms can leave some notes in robot.txt files for data analysis. Some platforms can perform measures such as IP Blocking, IP Limitation — Data Limitation, Time Limitation, Bot Blocker. You should carefully examine the platform on which you will scrape data and analyze its codes properly. You should definitely pay attention to legal and illegal issues.
Project and Platform Analysis;
To complete the web scraping process more successfully, you may sometimes need to use proxy, VPN type services and user-agent type configurations. You may need to work with spider type structures, regional data analysis and API infrastructure. Paying attention to API information and the browser infrastructure used will be a good option for more successful web scraping.
Current
Be sure to examine Youtube, Udemy, Free Courses and sample coding studies and detail the code structures and algorithms. Effectively using platforms such as Github and stackoverflow will ensure success. Review projects and analyze courses.
Some Trainings and Web Scraping Videos;