Skip to content

Implement new crawler based on wpull#81

Merged
chosak merged 9 commits into
mainfrom
feature/wpull
Nov 2, 2023
Merged

Implement new crawler based on wpull#81
chosak merged 9 commits into
mainfrom
feature/wpull

Conversation

@chosak
Copy link
Copy Markdown
Member

@chosak chosak commented Nov 2, 2023

This PR adds an alternate method of crawling a website based on wpull.

The current approach uses 2 steps:

  1. Use wget to crawl a website, generating a WARC file
  2. Run a Django management command (warc_to_db) to convert the WARC to a queryable SQLite database

The new approach uses only a single step:

  1. Use wpull plus a custom plugin to crawl a website directly into a queryable SQLite database

This can be done using a new Django management command:

% ./manage.py crawl --help
Usage: manage.py crawl [OPTIONS] START_URL DB_FILENAME

  Crawl a website to a SQLite database.

Options:
  --max-pages INTEGER            Maximum number of pages to crawl
  --depth INTEGER                Maximum crawl depth
  --recreate                     Overwrite SQLite database if it already
                                 exists  [default: False]
  --resume

Because wpull unfortunately doesn't support Python greater than 3.6 (ArchiveTeam/wpull#426), this new approach requires downgrading the runtime of this repo to Python 3.6 as well. This in turn requires downgrading Django from version 4.x back to 3.2.

Unfortunately wpull only supports Python 3.6, see

- ArchiveTeam/wpull#404
- ArchiveTeam/wpull#451

Django 4.0 dropped support for Python 3.6, see

https://docs.djangoproject.com/en/4.2/releases/4.0/#python-compatibility

In order to integrate wpull with the viewer application, we need to
downgrade the viewer Django version from 4.0 to 3.2.
Unfortunately wpull only supports Python 3.6, see

- ArchiveTeam/wpull#404
- ArchiveTeam/wpull#451

In order to integrate wpull with the viewer application, we need to
downgrade Python from 3.8 to 3.6.
This change adds a new management command (manage.py crawl) that crawls
a website directly into a SQLite database, using the wpull package:

https://github.com/ArchiveTeam/wpull

Usage: manage.py crawl [OPTIONS] START_URL DB_FILENAME
@chosak chosak merged commit bcd66f0 into main Nov 2, 2023
@chosak chosak deleted the feature/wpull branch November 2, 2023 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant