You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A simple, production-friendly Streamlit web app that recursively crawls links from a start page up to a specified maximum depth.
It supports BFS crawling, obeys robots.txt, filters non-HTML resources, and exports results as CSV.
β¨ Features
Feature
Description
π Recursive BFS crawling
Crawl links up to a configurable max depth
π§ Domain scoping
Restrict to same domain or include subdomains
π« robots.txt awareness
Toggle robots.txt compliance on or off
π°οΈ Rate limiting & timeouts
Polite crawling with delays and request timeouts
π§Ή Smart URL normalization
Normalize and deduplicate URLs
ποΈ Binary content filtering
Skip images, videos, docs, and other non-HTML resources
π Live results table
View depth, status, content type, and notes in real time
πΎ Download results as CSV
Export crawl results
βοΈ Configurable via sidebar
All options adjustable in the Streamlit sidebar
π Project Structure
File/Folder
Description
app.py
Main Streamlit application
requirements.txt
Python dependencies
π§° Requirements
Requirement
Details
Python
3.9+
streamlit
>=1.32
requests
>=2.31
beautifulsoup4
>=4.12
pandas
>=2.0
π How to Run
# 1. Clone or download the repository
git clone https://github.com/your-username/recursive-link-crawler.git
cd recursive-link-crawler
# 2. (Optional) Create and activate a virtual environment
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:source .venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Run the Streamlit app
streamlit run app.py