Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# `wikiteam3` Documentation
Welcome to the (WIP) documentation of `wikiteam3`!

References:
- [WikiTeam wiki](https://github.com/WikiTeam/wikiteam/wiki)
- [WikiTeam](https://wiki.archiveteam.org/index.php/WikiTeam) at the ArchiveTeam Wiki
- [WikiTeam3 tutorial](https://meta.miraheze.org/wiki/Backups#WikiTeam3) at Miraheze Meta
9 changes: 9 additions & 0 deletions docs/database-api-xml-relation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Database-API-XML Relation

WIP: Writing this reference requires testing on a **real** MediaWiki instance.

Most of the data is stored in the database. Among them, some are mandatory for rebuilding a wiki.

Without direct access to the database, the API can expose those data.

MediaWiki also developed an XML format that holds those data.
65 changes: 65 additions & 0 deletions docs/dump_structure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Dump Structure

Local directory structure:

```
<url>-<date>-wikidump
Comment thread
TripleCamera marked this conversation as resolved.
Outdated
Comment thread
TripleCamera marked this conversation as resolved.
Outdated
├── config.json
├── index.html
├── SpecialVersion.html
├── siteinfo.json
├── all_dumped.mark
├── uploaded_to_IA.mark
├── errors.log
├── <url>-<date>-titles.txt
├── <url>-<date>-history.xml
Comment thread
TripleCamera marked this conversation as resolved.
Outdated
├── <url>-<date>-images.txt
├── images
│ └── ...
├── images_mismatch
│ └── ...
└── <url>-<date>-redirects.jsonl
```

Internet Archive item structure:

```
wiki-<url>-<date>
├── <url>-<date>-dumpMeta
│ ├── config.json
│ ├── index.html
│ ├── SpecialVersion.html
│ ├── siteinfo.json
│ ├── errors.log
│ ├── <url>-<date>-titles.txt.zst
│ ├── <url>-<date>-images.txt.zst
│ └── <url>-<date>-redirects.jsonl.zst
├── <url>-<date>-history.xml.zst
├── <url>-<date>-images.7z
├── <url>-<date>-images_mismatch.7z
└── <identifier>_logo.<suffix>
```

## General
- `config.json`: Dump configuration. Used when [resuming an incomplete dump](https://github.com/saveweb/wikiteam3/blob/v4-main/README.md#resuming-an-incomplete-dump).
- `index.html`: Archive of `index.php` (the main page).
- `SpecialVersion.html`: Archive of `[[Special:Version]]`.
- `siteinfo.json`: Archive of Siteinfo API response.
- `all_dumped.mark`: Marks the end of the dump. Content: `<time>:<msg>`
- `uploaded_to_IA.mark`: Marks the success upload to IA. Content: `<time>: identifier: <identifier>`
- `errors.log`: Errors log. Please check this file after the dump is finished.
- `<identifier>_logo.<suffix>`: Logo. This is downloaded when uploading to IA, and would not be stored locally.

## XML Dump
- `<url>-<date>-titles.txt`: List of titles.
- `<url>-<date>-history.xml`: The XML dump. See [Manual:Importing XML dumps](https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps) for importing.

## Image Dump
- `<url>-<date>-images.txt`: Image metadata in TSV (Tab-Separated Values) format.
- `images` (directory): The image dump, i.e. the dump of all uploaded files.
- `<url>-<date>-images.7z`: Compression of the `images` directory.
- `images_mismatch` (directory): Images whose actual size or SHA1 doesn't match API responses. Please contact the webmaster.
- `<url>-<date>-images_mismatch.7z`: Compression of the `images_mismatch` directory.
Comment thread
TripleCamera marked this conversation as resolved.
Outdated

## Redirects Dump
- `<url>-<date>-redirects.jsonl`: The redirects dump. Each line contains one redirection.
66 changes: 66 additions & 0 deletions docs/dump_types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Dump Types
There are three types of backups that can be made with `wikiteam3dumpgenerator`: **XML dumps**, **image dumps**, and **redirect dumps**.

## XML dump
An XML dump contains the entire history or the latest revision of all pages. To generate an XML dump, use the `--xml` option.

### Revisions
You can export all revisions (default) or the current revision only.

| Revisions | Option |
|-----------|--------|
| All | *None* |
| Current | `--curonly` |

### API
List of export APIs supported by wikiteam3:
- [Special:Export](https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export) (default)
- First, get the list of page titles to export.
- Then, send POST requests to Special:Export to retrieve all/current revisions of pages in the list. The responses are in XML format.
- [API:Allrevisions](https://www.mediawiki.org/wiki/API:Allrevisions)
- Send GET requests to `api.php` with `action=query&list=allrevisions` to retrieve all/current revisions of all pages. The responses are then converted from JSON to XML.
- This API is significantly faster because it doesn't rely on the list of titles. You may disable delay between requests (`--delay 0`) if you are using this API.
- [API:Revisions](https://www.mediawiki.org/wiki/API:Revisions)
- First, get the list of page titles to export.
- Then, send GET requests to `api.php` with `action=query&format=xml&prop=revisions&titles=<title>` to retrieve all/current revisions of pages in the list. The responses are then converted from JSON to XML.
- [API:Query](https://www.mediawiki.org/wiki/API:Query) (DEVELOPMENT ONLY)
- First, get the list of page titles to export.
- Then, send GET requests to `api.php` with `action=query&titles=<title>&export=1` to retrieve the **current** revision of pages in the list. The exported data is in XML format.

If the list of page titles is needed, wikiteam3 tries [API:Allpages](https://www.mediawiki.org/wiki/API:Allpages) (1.8+) first. If it fails, wikiteam3 then tries to extract page titles from Special:Allpages.

**Limitations**: XML dumps produced using API:Allrevisions or API:Revisions are missing `<redirect>` tags, because these API don't return redirect information. This doesn't matter, since redirections can be parsed from wikitext. To retrieve redirect information, see [redirect dump](#redirect-dump) below.

Here is a table for comparison. Legend:
- **MW version**: Supported MediaWiki versions. The use of old APIs enables wikiteam3 to create dumps for wikis running older versions of MediaWiki software.
- **Titles**: Requires a list of page titles to export before exporting.

| API | Option | MW version | Titles |
|-----|--------|------------|--------|
| Special:Export | *None* | 1.16+ (?) | Yes |
| API:Allrevisions | `--xmlrevisions` | 1.27+ | No |
| API:Revisions | `--xmlapiexport` | 1.8+ | Yes |
| API:Query | `--xmlrevisions_page` | 1.8+ | Yes |

TODO: Figure out the exact version for Special:Export

## Image dump
An image dump contains all files along with their metadata. It is called an *image* dump for historic reasons. To generate an image dump, use the `--images` option.

It takes three steps for the program to create an image dump:
1. Get file names and metadata.
- If the API is available, try [API:Allimages](https://www.mediawiki.org/wiki/API:Allimages) (MW 1.13+) first. If it fails, use [API:Allpages](https://www.mediawiki.org/wiki/API:Allpages) (MW 1.8+).
- Otherwise, scrape and parse Special:Imagelist.
2. Save file names and metadata at `<url>-<date>-images.txt`. The file format is documented at [`DEV.md`](https://github.com/saveweb/wikiteam3/blob/v4-main/DEV.md).
Comment thread
TripleCamera marked this conversation as resolved.
Outdated
3. Download the files. For each file, the actual size and SHA1 are checked against the API responses:
- If they match, the file would be saved at the `images` directory.
- Otherwise, the file would be saved at the `images_mismatch` directory, and an error message would be written to `errors.log`.

After creating an image dump, please check the `images_mismatch` directory. If files appear in this directory, that would be a problem. The probable cause is that the webmaster has turned on image compression for server responses. Please contact the webmaster.

TODO: File name limitations

## Redirect dump
A redirect dump contains a list of all redirects. The output file `<url>-<date>-redirects.jsonl` is in JSONL format, each line contains the infomation of one redirect, taken from the response of [API:Allredirects](https://www.mediawiki.org/wiki/API:Allredirects).

This feature is introduced in commit [`f901972`](https://github.com/saveweb/wikiteam3/commit/f901972ffc7525001f23cc20368d6437369c8953), due to limitations of some APIs. See section [XML dump](#xml-dump).
Comment thread
TripleCamera marked this conversation as resolved.
Outdated
Loading