A structured, hands-on learning journey through web scraping and browser automation with Python β built alongside a Udemy course, extended with real-world projects.
This repository documents my self-directed learning of web scraping and browser automation using Python. It is organized into two sections:
- Lessons β Core concepts and techniques covered through guided course material
- Projects β Independent, real-world scraping projects that apply and extend those concepts
The goal is to build both a strong conceptual foundation and a practical portfolio of applied scraping work.
Web Automation&Scraping using Python/
β
βββ .venv/ # Virtual environment (not tracked)
β
βββ Lessons/ # Guided course lessons
β βββ intro.py # BeautifulSoup fundamentals
β βββ lecture_selenium.py # Selenium basics & browser control
β βββ lecture_selenium1.py # Cookie saving with pickle
β βββ lecture_selenium2.py # Executing JavaScript via Selenium
β
βββ Projects/ # Applied real-world scraping projects
β βββ Project1_ConsumerReportsWebsite.py
β βββ Project2_CraigslistWebsite.py
β
βββ data/ # Output data files (CSV, etc.)
β βββ accra_craigslist.csv
β
βββ cookies.pkl # Saved browser session (generated at runtime)
βββ .gitignore
βββ README.md
βββ requirements.txt
Concepts: requests, html5lib, BeautifulSoup
Scrapes the Wikipedia page on Logical Fallacies and extracts the table of contents. Demonstrates:
- Making HTTP GET requests with custom headers to mimic a real browser
- Parsing raw HTML with the
html5libparser - Navigating the DOM with
.find()and.find_all() - Cleaning and formatting extracted text
Concepts: webdriver, ChromeDriverManager, element interaction
Opens a real Chrome browser, navigates to Wikipedia, and performs a live search for Bayern Munich. Demonstrates:
- Launching a Chrome browser session with Selenium
- Auto-managing ChromeDriver with
webdriver-manager - Finding elements by
IDandXPATH - Simulating user input (
.send_keys()) and clicks (.click())
Concepts: Session persistence, pickle
Logs into a practice test login page and saves the authenticated browser session to a .pkl file. Demonstrates:
- Filling and submitting login forms via Selenium
- Capturing and persisting cookies with Python's
picklemodule - The concept of reusable authenticated sessions (to avoid repeated logins)
Concepts: execute_script(), browser-side JavaScript
Navigates to Google and runs a raw JavaScript snippet directly in the browser. Demonstrates:
- Using
driver.execute_script()to interact with page elements at the JS level - Why JS execution is useful when standard Selenium selectors fall short on dynamic pages
File: Projects/Project1_ConsumerReportsWebsite.py
Libraries: requests, BeautifulSoup, pandas
Scrapes article cards from consumerreports.org and follows each article link to extract its "In This Article" section items.
What it does:
- Fetches the Consumer Reports homepage and identifies article cards
- Extracts article titles and their full URLs
- Crawls each article page to extract its sub-topic links and anchor text
- Stores the aggregated data in a
pandasDataFrame
Key techniques: multi-page crawling, link resolution, dictionary aggregation, DataFrame output
β οΈ Note: Consumer Reports is a dynamic (JavaScript-rendered) site. Some content may not be fully accessible via staticrequests-based scraping. This project intentionally explores the limits of static scraping on modern websites.
File: Projects/Project2_CraigslistWebsite.py
Libraries: requests, BeautifulSoup, pandas, pathlib
Scrapes real estate listings from the Accra section of Craigslist and exports a clean dataset as a CSV file.
What it does:
- Fetches gallery-view listing results from
accra.craigslist.org - Extracts listing title, price, location, and direct link for each result
- Saves the structured data to
data/accra_craigslist.csvusingpathlibfor cross-platform path resolution
Key techniques: structured data extraction, multi-field parsing, CSV export, Path(__file__).resolve() for portable paths
Sample output:
| Name | Link | Location | Price |
|---|---|---|---|
| 3 Bedroom House | https://accra.craigslist.org/... | East Legon | GHβ΅3,500 |
| ... | ... | ... | ... |
- Python 3.11 or higher
- Google Chrome browser installed
- Git
git clone https://github.com/IntentionedReflex35/Web-Automation-Scraping-using-Python.git
cd Web-Automation-Scraping-using-Python# Windows
python -m venv .venv
.venv\Scripts\activate
# macOS / Linux
python -m venv .venv
source .venv/bin/activatepip install -r requirements.txt
webdriver-managerautomatically downloads and manages the correct ChromeDriver version for your installed Chrome browser β no manual setup needed.
# Run a lesson
python Lessons/intro.py
# Run a project
python Projects/Project2_CraigslistWebsite.pyOutput CSV files are saved to the data/ directory.
| Tool | Purpose |
|---|---|
requests |
HTTP requests for static page content |
html5lib |
Lenient HTML parser (handles malformed HTML well) |
BeautifulSoup4 |
HTML parsing and DOM navigation |
selenium |
Browser automation and dynamic content interaction |
webdriver-manager |
Automatic ChromeDriver version management |
pandas |
Data structuring and CSV export |
pickle |
Serializing and saving browser session cookies |
pathlib |
Cross-platform file path management |
- HTTP requests and response handling
- Static HTML parsing with BeautifulSoup
- Custom request headers to avoid bot detection
- DOM traversal (
.find(),.find_all(),.get_text()) - Browser automation with Selenium WebDriver
- Locating elements by ID, Name, and XPATH
- Simulating user interactions (typing, clicking)
- Cookie persistence with
pickle - JavaScript execution in the browser
- Multi-page crawling
- Structured data extraction and CSV export
- Virtual environment setup and dependency management
- Handling dynamic/JavaScript-rendered pages (Selenium + scraping combined) β upcoming
- Pagination and infinite scroll β upcoming
- Proxy rotation and rate limiting β upcoming
- Scrapy framework β upcoming
This repository is for educational purposes only. All scraping is performed on publicly accessible pages. Always review and comply with a website's robots.txt file and Terms of Service before scraping. The author is not responsible for any misuse of the techniques demonstrated here.
MIT License β Β© 2026 Jeshurun Nana Kojo Ansah
Permission is hereby granted, free of charge, to any person obtaining a copy of this repository and its associated files, to use, copy, modify, merge, publish, distribute, and/or sublicense them, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the software.
This project is provided as-is, without warranty of any kind. The author is not liable for any damages or misuse arising from its use.
Jeshurun Nana Kojo Ansah β Geomatic Engineering student | Aspiring Data Analyst
π GitHub: IntentionedReflex35
"Move stealthy, execute in silence."