Open IRE

Open Institutional Repository Expansion (IRE) is a configurable crawler for collecting articles from open-access research repositories.

Installation

1. Clone the Git Repository

This repository contains everything needed to run the software, so you just need to clone it to somewhere in your file system:

git clone https://github.com/uw-ssec/open-ire.git

2. Install Package Manager

This project uses the Pixi package manager to install and manage other prerequisites, including a compatible version of Python. The easiest way to install Pixi is with one of these commands:

macOS/Linux:

curl -fsSL https://pixi.sh/install.sh | sh

Windows:

powershell -ExecutionPolicy ByPass -c "irm -useb https://pixi.sh/install.ps1 | iex"

For other installation options, visit the Pixi installation guide.

3. Install the Prerequisites

To install the default Pixi environment, which will allow running all of the provided web-crawling spiders, execute the following command in the directory where you cloned the open-ire repository:

cd open-ire
pixi install

Once the installation finishes, you can then run several pre-defined tasks using the pixi run <task> command in the open-ire directory.

4. Configure the Execution Environment

pixi run dotenv

This command creates the environment file template .env that you then need to edit to configure your settings. The only required setting is ENVIRONMENT, which needs to be set to either development or production.

To store collected files in a Microsoft SharePoint Drive, you also need to set your SharePoint credentials:

SHAREPOINT_TENANT_ID=<your_application_tenant_id>
SHAREPOINT_CLIENT_ID=<your_application_id>
SHAREPOINT_SITE_ID=<your_sharepoint_site_id>
SHAREPOINT_CLIENT_SECRET=<your_application_client_secret>

Alternatively, you can disable the SharePointPipeline in src/open_ire/settings/.

Crawling the Repositories

This project includes spiders for crawling repositories using two main methods: a list of keywords or a CSV file of author names.

Search by Keyword

To run a spider with a custom list of search terms, use the search-terms command:

pixi run search-terms <spider_name> "term1,term2,..." [<page>]

For example, to search the eric repository:

pixi run search-terms eric "ocean acidification,coral bleaching"

Search by Author

To search for a single author's name, use the search-author (singular) command:

pixi run search-author <spider_name> "<author's full name>"

For example:

pixi run search-author openalex "Michelle Habell-Pallán"

To run a spider that supports searching by author against a list of authors, use the search-authors (plural) command. This requires a CSV file with FirstName, MiddleNames (optional), LastName, and Email columns.

pixi run search-authors <spider_name> "<path_to_csv>"

For example, to search openalex for publications by any one among a number of authors:

pixi run search-authors openalex "data/authors.csv"

Crawl and Resume

To run a spider with a persistent crawl state (Scrapy JOBDIR) and optionally skip already-known files, use the resume command:

pixi run resume <spider_name> [--skip-existing]

For example:

pixi run resume eric --skip-existing

Tracking Deleted Articles

To detect previously collected article metadata and downloaded files that are no longer available, run the unavailable_articles spider. It reads from OPEN_IRE_DATABASE_FILE and writes a CSV report under output/.

pixi run resume unavailable_articles

Notebooks

This project includes marimo notebooks for data analysis under notebooks/.

Notebook	Description
`metadata_analysis.py`	Collection stats, repository breakdowns, and text analysis
`unavailable_articles.py`	Re-checks URLs from an unavailable-articles CSV to identify which have recovered

To install the additional libraries needed for the notebooks:

pixi install -e notebooks

To open a notebook in the interactive editor:

pixi run -e dev marimo edit notebooks/metadata_analysis.py

To run a notebook as a read-only app:

pixi run -e dev marimo run notebooks/metadata_analysis.py

Contributing

We welcome contributions! For detailed development setup, including pre-commit hooks, please see CONTRIBUTING.md and our contribution guidelines:

Name		Name	Last commit message	Last commit date
Latest commit History 308 Commits
.github		.github
docs		docs
notebooks		notebooks
src		src
tests		tests
.coveragerc		.coveragerc
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
noxfile.py		noxfile.py
pixi.lock		pixi.lock
pyproject.toml		pyproject.toml
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open IRE

Installation

1. Clone the Git Repository

2. Install Package Manager

3. Install the Prerequisites

4. Configure the Execution Environment

Crawling the Repositories

Search by Keyword

Search by Author

Crawl and Resume

Tracking Deleted Articles

Notebooks

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Open IRE

Installation

1. Clone the Git Repository

2. Install Package Manager

3. Install the Prerequisites

4. Configure the Execution Environment

Crawling the Repositories

Search by Keyword

Search by Author

Crawl and Resume

Tracking Deleted Articles

Notebooks

Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages