🕷️ Recursive Link Crawler (Streamlit)

A simple, production-friendly Streamlit web app that recursively crawls links from a start page up to a specified maximum depth.
It supports BFS crawling, obeys robots.txt, filters non-HTML resources, and exports results as CSV.

✨ Features

Feature	Description
🌐 Recursive BFS crawling	Crawl links up to a configurable max depth
🧭 Domain scoping	Restrict to same domain or include subdomains
🚫 robots.txt awareness	Toggle robots.txt compliance on or off
🕰️ Rate limiting & timeouts	Polite crawling with delays and request timeouts
🧹 Smart URL normalization	Normalize and deduplicate URLs
🗂️ Binary content filtering	Skip images, videos, docs, and other non-HTML resources
📊 Live results table	View depth, status, content type, and notes in real time
💾 Download results as CSV	Export crawl results
⚙️ Configurable via sidebar	All options adjustable in the Streamlit sidebar

📁 Project Structure

File/Folder	Description
`app.py`	Main Streamlit application
`requirements.txt`	Python dependencies

🧰 Requirements

Requirement	Details
Python	3.9+
streamlit	>=1.32
requests	>=2.31
beautifulsoup4	>=4.12
pandas	>=2.0

🚀 How to Run

# 1. Clone or download the repository
git clone https://github.com/your-username/recursive-link-crawler.git
cd recursive-link-crawler

# 2. (Optional) Create and activate a virtual environment
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Run the Streamlit app
streamlit run app.py

⚙️ Configuration Options

Setting	Description
Start URL	Initial page to start crawling
Max Depth	Maximum recursion level
Restrict to Same Domain	Limit crawling to the same host
Include Subdomains	Include links from subdomains
Delay Between Requests	Delay (seconds) between requests
Request Timeout	Maximum wait time for a response
Respect Robots.txt	Skip URLs blocked by robots.txt
User-Agent	Identify your crawler politely

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
image		image
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕷️ Recursive Link Crawler (Streamlit)

✨ Features

📁 Project Structure

🧰 Requirements

🚀 How to Run

⚙️ Configuration Options

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🕷️ Recursive Link Crawler (Streamlit)

✨ Features

📁 Project Structure

🧰 Requirements

🚀 How to Run

⚙️ Configuration Options

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages