Skip to content

akash012-ctrl/recursive-web_crawl-streamlit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕷️ Recursive Link Crawler (Streamlit)

Crawler screenshot

A simple, production-friendly Streamlit web app that recursively crawls links from a start page up to a specified maximum depth.
It supports BFS crawling, obeys robots.txt, filters non-HTML resources, and exports results as CSV.


✨ Features

Feature Description
🌐 Recursive BFS crawling Crawl links up to a configurable max depth
🧭 Domain scoping Restrict to same domain or include subdomains
🚫 robots.txt awareness Toggle robots.txt compliance on or off
🕰️ Rate limiting & timeouts Polite crawling with delays and request timeouts
🧹 Smart URL normalization Normalize and deduplicate URLs
🗂️ Binary content filtering Skip images, videos, docs, and other non-HTML resources
📊 Live results table View depth, status, content type, and notes in real time
💾 Download results as CSV Export crawl results
⚙️ Configurable via sidebar All options adjustable in the Streamlit sidebar

📁 Project Structure

File/Folder Description
app.py Main Streamlit application
requirements.txt Python dependencies

🧰 Requirements

Requirement Details
Python 3.9+
streamlit >=1.32
requests >=2.31
beautifulsoup4 >=4.12
pandas >=2.0

🚀 How to Run

# 1. Clone or download the repository
git clone https://github.com/your-username/recursive-link-crawler.git
cd recursive-link-crawler

# 2. (Optional) Create and activate a virtual environment
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Run the Streamlit app
streamlit run app.py

⚙️ Configuration Options

Setting Description
Start URL Initial page to start crawling
Max Depth Maximum recursion level
Restrict to Same Domain Limit crawling to the same host
Include Subdomains Include links from subdomains
Delay Between Requests Delay (seconds) between requests
Request Timeout Maximum wait time for a response
Respect Robots.txt Skip URLs blocked by robots.txt
User-Agent Identify your crawler politely

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages