Extract anchor tag text and href URLs from HTML/HTML-fragment files and convert to CSV format.
- Supports
.htmland.txtinput files (treats them as plain text) - Handles full HTML documents and HTML fragments
- Extracts
<a href="...">...</a>links - Sanitizes text by removing commas, quotes, and newlines
- Outputs properly formatted CSV
No external dependencies required. Uses Python 3 standard library only.
python scripts/extract_hrefs.pyThe script will:
- Scan
input/for.htmland.txtfiles - Extract anchor tags from each file
- Generate CSV files in
output/
Example output (output/sra_hrefs.csv):
text_value,url
Venus Beauty Pte Ltd,http://www.venusbeauty.com.sg/
Vision Lab Eyewear Premium Pte Ltd,http://www.visionlabeyewear.com.sg
href2csv/
├── input/ # Source HTML/TXT files
├── output/ # Generated CSV files (gitignored)
├── scripts/
│ └── extract_hrefs.py # Main extraction script
├── memory-bank/ # Project context and decisions
├── clinerules.md # Project rules and conventions
└── README.md