🕷️ Recursive Link Crawler (Streamlit)

A simple, production-friendly Streamlit web app that recursively crawls links from a start page up to a specified maximum depth.
It supports BFS crawling, obeys robots.txt, filters non-HTML resources, and exports results as CSV.

✨ Features

Feature	Description
🌐 Recursive BFS crawling	Crawl links up to a configurable max depth
🧭 Domain scoping	Restrict to same domain or include subdomains
🚫 robots.txt awareness	Toggle robots.txt compliance on or off
🕰️ Rate limiting & timeouts	Polite crawling with delays and request timeouts
🧹 Smart URL normalization	Normalize and deduplicate URLs
🗂️ Binary content filtering	Skip images, videos, docs, and other non-HTML resources
📊 Live results table	View depth, status, content type, and notes in real time
💾 Download results as CSV	Export crawl results
⚙️ Configurable via sidebar	All options adjustable in the Streamlit sidebar

📁 Project Structure

File/Folder	Description
`app.py`	Main Streamlit application
`requirements.txt`	Python dependencies

🧰 Requirements

Requirement	Details
Python	3.9+
streamlit	>=1.32
requests	>=2.31
beautifulsoup4	>=4.12
pandas	>=2.0

🚀 How to Run

# 1. Clone or download the repository
git clone https://github.com/your-username/recursive-link-crawler.git
cd recursive-link-crawler

# 2. (Optional) Create and activate a virtual environment
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Run the Streamlit app
streamlit run app.py

⚙️ Configuration Options

Setting	Description
Start URL	Initial page to start crawling
Max Depth	Maximum recursion level
Restrict to Same Domain	Limit crawling to the same host
Include Subdomains	Include links from subdomains
Delay Between Requests	Delay (seconds) between requests
Request Timeout	Maximum wait time for a response
Respect Robots.txt	Skip URLs blocked by robots.txt
User-Agent	Identify your crawler politely

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🕷️ Recursive Link Crawler (Streamlit)

✨ Features

📁 Project Structure

🧰 Requirements

🚀 How to Run

⚙️ Configuration Options

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🕷️ Recursive Link Crawler (Streamlit)

✨ Features

📁 Project Structure

🧰 Requirements

🚀 How to Run

⚙️ Configuration Options