Skip to content

Latest commit

Β 

History

History
81 lines (61 loc) Β· 2.91 KB

File metadata and controls

81 lines (61 loc) Β· 2.91 KB

πŸ•·οΈ Recursive Link Crawler (Streamlit)

Crawler screenshot

A simple, production-friendly Streamlit web app that recursively crawls links from a start page up to a specified maximum depth.
It supports BFS crawling, obeys robots.txt, filters non-HTML resources, and exports results as CSV.


✨ Features

Feature Description
🌐 Recursive BFS crawling Crawl links up to a configurable max depth
🧭 Domain scoping Restrict to same domain or include subdomains
🚫 robots.txt awareness Toggle robots.txt compliance on or off
πŸ•°οΈ Rate limiting & timeouts Polite crawling with delays and request timeouts
🧹 Smart URL normalization Normalize and deduplicate URLs
πŸ—‚οΈ Binary content filtering Skip images, videos, docs, and other non-HTML resources
πŸ“Š Live results table View depth, status, content type, and notes in real time
πŸ’Ύ Download results as CSV Export crawl results
βš™οΈ Configurable via sidebar All options adjustable in the Streamlit sidebar

πŸ“ Project Structure

File/Folder Description
app.py Main Streamlit application
requirements.txt Python dependencies

🧰 Requirements

Requirement Details
Python 3.9+
streamlit >=1.32
requests >=2.31
beautifulsoup4 >=4.12
pandas >=2.0

πŸš€ How to Run

# 1. Clone or download the repository
git clone https://github.com/your-username/recursive-link-crawler.git
cd recursive-link-crawler

# 2. (Optional) Create and activate a virtual environment
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Run the Streamlit app
streamlit run app.py

βš™οΈ Configuration Options

Setting Description
Start URL Initial page to start crawling
Max Depth Maximum recursion level
Restrict to Same Domain Limit crawling to the same host
Include Subdomains Include links from subdomains
Delay Between Requests Delay (seconds) between requests
Request Timeout Maximum wait time for a response
Respect Robots.txt Skip URLs blocked by robots.txt
User-Agent Identify your crawler politely