Skip to content

khafidhteer/href2csv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

href2csv

Extract anchor tag text and href URLs from HTML/HTML-fragment files and convert to CSV format.

Features

  • Supports .html and .txt input files (treats them as plain text)
  • Handles full HTML documents and HTML fragments
  • Extracts <a href="...">...</a> links
  • Sanitizes text by removing commas, quotes, and newlines
  • Outputs properly formatted CSV

Installation

No external dependencies required. Uses Python 3 standard library only.

Usage

python scripts/extract_hrefs.py

The script will:

  1. Scan input/ for .html and .txt files
  2. Extract anchor tags from each file
  3. Generate CSV files in output/

Example output (output/sra_hrefs.csv):

text_value,url
Venus Beauty Pte Ltd,http://www.venusbeauty.com.sg/
Vision Lab Eyewear Premium Pte Ltd,http://www.visionlabeyewear.com.sg

Project Structure

href2csv/
├── input/              # Source HTML/TXT files
├── output/             # Generated CSV files (gitignored)
├── scripts/
│   └── extract_hrefs.py # Main extraction script
├── memory-bank/        # Project context and decisions
├── clinerules.md       # Project rules and conventions
└── README.md

About

Extract anchor tag text and href URLs from HTML/HTML-fragment files and convert to CSV format.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages