Skip to content

Latest commit

Β 

History

History
244 lines (175 loc) Β· 9.3 KB

File metadata and controls

244 lines (175 loc) Β· 9.3 KB

Web Automation & Scraping using Python

Python BeautifulSoup Selenium Pandas License Status

A structured, hands-on learning journey through web scraping and browser automation with Python β€” built alongside a Udemy course, extended with real-world projects.


πŸ“– About This Repository

This repository documents my self-directed learning of web scraping and browser automation using Python. It is organized into two sections:

  • Lessons β€” Core concepts and techniques covered through guided course material
  • Projects β€” Independent, real-world scraping projects that apply and extend those concepts

The goal is to build both a strong conceptual foundation and a practical portfolio of applied scraping work.


πŸ—‚οΈ Project Structure

Web Automation&Scraping using Python/
β”‚
β”œβ”€β”€ .venv/                        # Virtual environment (not tracked)
β”‚
β”œβ”€β”€ Lessons/                      # Guided course lessons
β”‚   β”œβ”€β”€ intro.py                  # BeautifulSoup fundamentals
β”‚   β”œβ”€β”€ lecture_selenium.py       # Selenium basics & browser control
β”‚   β”œβ”€β”€ lecture_selenium1.py      # Cookie saving with pickle
β”‚   └── lecture_selenium2.py     # Executing JavaScript via Selenium
β”‚
β”œβ”€β”€ Projects/                     # Applied real-world scraping projects
β”‚   β”œβ”€β”€ Project1_ConsumerReportsWebsite.py
β”‚   └── Project2_CraigslistWebsite.py
β”‚
β”œβ”€β”€ data/                         # Output data files (CSV, etc.)
β”‚   └── accra_craigslist.csv
β”‚
β”œβ”€β”€ cookies.pkl                   # Saved browser session (generated at runtime)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ README.md
└── requirements.txt

πŸ“š Lessons Overview

intro.py β€” BeautifulSoup Fundamentals

Concepts: requests, html5lib, BeautifulSoup

Scrapes the Wikipedia page on Logical Fallacies and extracts the table of contents. Demonstrates:

  • Making HTTP GET requests with custom headers to mimic a real browser
  • Parsing raw HTML with the html5lib parser
  • Navigating the DOM with .find() and .find_all()
  • Cleaning and formatting extracted text

lecture_selenium.py β€” Selenium Basics & Browser Control

Concepts: webdriver, ChromeDriverManager, element interaction

Opens a real Chrome browser, navigates to Wikipedia, and performs a live search for Bayern Munich. Demonstrates:

  • Launching a Chrome browser session with Selenium
  • Auto-managing ChromeDriver with webdriver-manager
  • Finding elements by ID and XPATH
  • Simulating user input (.send_keys()) and clicks (.click())

lecture_selenium1.py β€” Saving Browser Sessions as Cookies

Concepts: Session persistence, pickle

Logs into a practice test login page and saves the authenticated browser session to a .pkl file. Demonstrates:

  • Filling and submitting login forms via Selenium
  • Capturing and persisting cookies with Python's pickle module
  • The concept of reusable authenticated sessions (to avoid repeated logins)

lecture_selenium2.py β€” Executing JavaScript via Selenium

Concepts: execute_script(), browser-side JavaScript

Navigates to Google and runs a raw JavaScript snippet directly in the browser. Demonstrates:

  • Using driver.execute_script() to interact with page elements at the JS level
  • Why JS execution is useful when standard Selenium selectors fall short on dynamic pages

πŸš€ Projects

Project 1 β€” Consumer Reports Website

File: Projects/Project1_ConsumerReportsWebsite.py Libraries: requests, BeautifulSoup, pandas

Scrapes article cards from consumerreports.org and follows each article link to extract its "In This Article" section items.

What it does:

  • Fetches the Consumer Reports homepage and identifies article cards
  • Extracts article titles and their full URLs
  • Crawls each article page to extract its sub-topic links and anchor text
  • Stores the aggregated data in a pandas DataFrame

Key techniques: multi-page crawling, link resolution, dictionary aggregation, DataFrame output

⚠️ Note: Consumer Reports is a dynamic (JavaScript-rendered) site. Some content may not be fully accessible via static requests-based scraping. This project intentionally explores the limits of static scraping on modern websites.


Project 2 β€” Accra Craigslist Real Estate Listings

File: Projects/Project2_CraigslistWebsite.py Libraries: requests, BeautifulSoup, pandas, pathlib

Scrapes real estate listings from the Accra section of Craigslist and exports a clean dataset as a CSV file.

What it does:

  • Fetches gallery-view listing results from accra.craigslist.org
  • Extracts listing title, price, location, and direct link for each result
  • Saves the structured data to data/accra_craigslist.csv using pathlib for cross-platform path resolution

Key techniques: structured data extraction, multi-field parsing, CSV export, Path(__file__).resolve() for portable paths

Sample output:

Name Link Location Price
3 Bedroom House https://accra.craigslist.org/... East Legon GHβ‚΅3,500
... ... ... ...

βš™οΈ Setup & Installation

Prerequisites

  • Python 3.11 or higher
  • Google Chrome browser installed
  • Git

1. Clone the repository

git clone https://github.com/IntentionedReflex35/Web-Automation-Scraping-using-Python.git
cd Web-Automation-Scraping-using-Python

2. Create and activate a virtual environment

# Windows
python -m venv .venv
.venv\Scripts\activate

# macOS / Linux
python -m venv .venv
source .venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

webdriver-manager automatically downloads and manages the correct ChromeDriver version for your installed Chrome browser β€” no manual setup needed.


▢️ Running the Scripts

# Run a lesson
python Lessons/intro.py

# Run a project
python Projects/Project2_CraigslistWebsite.py

Output CSV files are saved to the data/ directory.


πŸ› οΈ Tech Stack

Tool Purpose
requests HTTP requests for static page content
html5lib Lenient HTML parser (handles malformed HTML well)
BeautifulSoup4 HTML parsing and DOM navigation
selenium Browser automation and dynamic content interaction
webdriver-manager Automatic ChromeDriver version management
pandas Data structuring and CSV export
pickle Serializing and saving browser session cookies
pathlib Cross-platform file path management

🧠 Concepts Covered

  • HTTP requests and response handling
  • Static HTML parsing with BeautifulSoup
  • Custom request headers to avoid bot detection
  • DOM traversal (.find(), .find_all(), .get_text())
  • Browser automation with Selenium WebDriver
  • Locating elements by ID, Name, and XPATH
  • Simulating user interactions (typing, clicking)
  • Cookie persistence with pickle
  • JavaScript execution in the browser
  • Multi-page crawling
  • Structured data extraction and CSV export
  • Virtual environment setup and dependency management
  • Handling dynamic/JavaScript-rendered pages (Selenium + scraping combined) β€” upcoming
  • Pagination and infinite scroll β€” upcoming
  • Proxy rotation and rate limiting β€” upcoming
  • Scrapy framework β€” upcoming

⚠️ Disclaimer

This repository is for educational purposes only. All scraping is performed on publicly accessible pages. Always review and comply with a website's robots.txt file and Terms of Service before scraping. The author is not responsible for any misuse of the techniques demonstrated here.


πŸ“„ License

MIT License β€” Β© 2026 Jeshurun Nana Kojo Ansah

Permission is hereby granted, free of charge, to any person obtaining a copy of this repository and its associated files, to use, copy, modify, merge, publish, distribute, and/or sublicense them, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the software.

This project is provided as-is, without warranty of any kind. The author is not liable for any damages or misuse arising from its use.


πŸ‘€ Author

Jeshurun Nana Kojo Ansah β€” Geomatic Engineering student | Aspiring Data Analyst
πŸ”— GitHub: IntentionedReflex35

"Move stealthy, execute in silence."