This repository contains everything needed to run the software, so you just need to clone it to somewhere in your file system:
git clone https://github.com/uw-ssec/open-ire.gitThis project uses the Pixi package manager to install and manage other prerequisites, including a compatible version of Python. The easiest way to install Pixi is with one of these commands:
macOS/Linux:
curl -fsSL https://pixi.sh/install.sh | shWindows:
powershell -ExecutionPolicy ByPass -c "irm -useb https://pixi.sh/install.ps1 | iex"For other installation options, visit the Pixi installation guide.
To install the default Pixi environment, which will allow running all of the
provided web-crawling spiders, execute the following command in the directory
where you cloned the open-ire repository:
cd open-ire
pixi installOnce the installation finishes, you can then run several pre-defined tasks using
the pixi run <task> command in the open-ire directory.
pixi run dotenvThis command creates the environment file template .env that you then need to
edit to configure your settings. The only required setting is ENVIRONMENT,
which needs to be set to either development or production.
To store collected files in a Microsoft SharePoint Drive, you also need to set your SharePoint credentials:
SHAREPOINT_TENANT_ID=<your_application_tenant_id>
SHAREPOINT_CLIENT_ID=<your_application_id>
SHAREPOINT_SITE_ID=<your_sharepoint_site_id>
SHAREPOINT_CLIENT_SECRET=<your_application_client_secret>Alternatively, you can disable the SharePointPipeline in
src/open_ire/settings/.
This project includes spiders for crawling repositories using two main methods: a list of keywords or a CSV file of author names.
To run a spider with a custom list of search terms, use the search-terms
command:
pixi run search-terms <spider_name> "term1,term2,..." [<page>]For example, to search the eric repository:
pixi run search-terms eric "ocean acidification,coral bleaching"To search for a single author's name, use the search-author (singular)
command:
pixi run search-author <spider_name> "<author's full name>"For example:
pixi run search-author openalex "Michelle Habell-Pallán"To run a spider that supports searching by author against a list of authors, use
the search-authors (plural) command. This requires a CSV file with
FirstName, MiddleNames (optional), LastName, and Email columns.
pixi run search-authors <spider_name> "<path_to_csv>"For example, to search openalex for publications by any one among a number of
authors:
pixi run search-authors openalex "data/authors.csv"To run a spider with a persistent crawl state (Scrapy JOBDIR) and optionally
skip already-known files, use the resume command:
pixi run resume <spider_name> [--skip-existing]For example:
pixi run resume eric --skip-existingTo detect previously collected article metadata and downloaded files that are no
longer available, run the unavailable_articles spider. It reads from
OPEN_IRE_DATABASE_FILE and writes a CSV report under output/.
pixi run resume unavailable_articlesThis project includes marimo notebooks for data analysis
under notebooks/.
| Notebook | Description |
|---|---|
metadata_analysis.py |
Collection stats, repository breakdowns, and text analysis |
unavailable_articles.py |
Re-checks URLs from an unavailable-articles CSV to identify which have recovered |
To install the additional libraries needed for the notebooks:
pixi install -e notebooksTo open a notebook in the interactive editor:
pixi run -e dev marimo edit notebooks/metadata_analysis.pyTo run a notebook as a read-only app:
pixi run -e dev marimo run notebooks/metadata_analysis.pyWe welcome contributions! For detailed development setup, including pre-commit hooks, please see CONTRIBUTING.md and our contribution guidelines: