Skip to content

Commit b048faf

Browse files
authored
Merge pull request #18 from bladeszasza/15-add-more-cli_options
More CLI options
2 parents 940eb06 + 2ce3f88 commit b048faf

34 files changed

Lines changed: 2210 additions & 510 deletions

.claude/settings.local.json

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"permissions": {
3+
"allow": [
4+
"Bash(git add:*)",
5+
"Bash(git reset:*)",
6+
"Bash(pylint:*)",
7+
"Bash(python -m py_compile:*)"
8+
],
9+
"deny": []
10+
}
11+
}

.github/workflows/pylint.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ jobs:
1010
runs-on: ubuntu-latest
1111
strategy:
1212
matrix:
13-
python-version: ["3.10", "3.11"]
13+
python-version: ["3.10", "3.11", "3.12", "3.13"]
1414
steps:
1515
- uses: actions/checkout@v4
1616
- name: Set up Python ${{ matrix.python-version }}

.gitignore

Whitespace-only changes.

CLAUDE.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
SOWLv2 is a command-line tool and Python library for text-prompted object segmentation that combines Google's OWLv2 (open-vocabulary object detector) with Meta's SAM 2 (Segment Anything Model V2) for precise pixel-level segmentation. The tool processes images, video frames, or videos based on natural language prompts.
8+
9+
## Architecture
10+
11+
### Core Pipeline Flow
12+
1. **Input Processing**: Images/videos → Frame extraction
13+
2. **Detection**: OWLv2 finds objects matching text prompts → Bounding boxes
14+
3. **Segmentation**: SAM 2 creates precise masks from bounding boxes
15+
4. **Output Generation**: Binary masks + visual overlays + videos
16+
17+
### Key Components
18+
- `cli.py`: Command-line interface entry point
19+
- `pipeline.py`: Main orchestration pipeline
20+
- `image_pipeline.py`: Single image processing
21+
- `video_pipeline.py`: Video processing with temporal tracking
22+
- `models/owl.py`: OWLv2 wrapper for object detection
23+
- `models/sam2_wrapper.py`: SAM 2 wrapper for segmentation
24+
- `utils/`: File system, frame, pipeline, and video utilities
25+
26+
## Development Commands
27+
28+
### Installation
29+
```bash
30+
# Development install
31+
pip install -e .
32+
33+
# Install dependencies
34+
pip install -r requirements.txt
35+
```
36+
37+
### Code Quality
38+
```bash
39+
# Run linting (matches CI)
40+
pylint $(git ls-files '*.py')
41+
```
42+
43+
### Testing
44+
- Primary testing through Google Colab notebook demonstrations
45+
- GitHub Actions CI runs pylint on Python 3.10-3.13
46+
47+
### CLI Usage
48+
```bash
49+
# Basic usage
50+
sowlv2-detect --prompt "cat" --input image.jpg --output results/
51+
52+
# Multiple objects
53+
sowlv2-detect --prompt "cat" "dog" --input video.mp4 --output results/
54+
55+
# With config file
56+
sowlv2-detect --config config.yaml
57+
```
58+
59+
## Configuration
60+
61+
- YAML configuration files supported (see `config/config_example.yaml`)
62+
- Key parameters: `prompt`, `owl_model`, `sam_model`, `threshold`, `fps`, `device`
63+
- Command-line arguments override config file values
64+
65+
## Output Structure
66+
67+
```
68+
output_dir/
69+
├── binary/ # Binary mask images/videos
70+
│ ├── merged/ # Merged binary masks (all objects combined)
71+
│ │ ├── 000001_merged_mask.png
72+
│ │ ├── 000002_merged_mask.png
73+
│ │ └── ...
74+
│ └── frames/ # Individual binary masks per object
75+
│ ├── 000001_obj1_cat_mask.png
76+
│ ├── 000001_obj2_dog_mask.png
77+
│ └── ...
78+
├── overlay/ # RGB overlay images/videos
79+
│ ├── merged/ # Merged overlays (all objects combined)
80+
│ │ ├── 000001_merged_overlay.png
81+
│ │ ├── 000002_merged_overlay.png
82+
│ │ └── ...
83+
│ └── frames/ # Individual overlays per object
84+
│ ├── 000001_obj1_cat_overlay.png
85+
│ ├── 000001_obj2_dog_overlay.png
86+
│ └── ...
87+
└── video/ # Generated videos (for video input)
88+
├── binary/ # Binary mask videos
89+
│ ├── merged_mask.mp4
90+
│ ├── obj1_cat_mask.mp4
91+
│ └── obj2_dog_mask.mp4
92+
└── overlay/ # Overlay videos
93+
├── merged_overlay.mp4
94+
├── obj1_cat_overlay.mp4
95+
└── obj2_dog_overlay.mp4
96+
```
97+
98+
### File Naming Convention:
99+
- **Individual files**: `{frame_num}_obj{obj_id}_{prompt}_mask.png` / `{frame_num}_obj{obj_id}_{prompt}_overlay.png`
100+
- **Merged files**: `{frame_num}_merged_mask.png` / `{frame_num}_merged_overlay.png`
101+
- **Videos**: `obj{obj_id}_{prompt}_mask.mp4` / `merged_mask.mp4`
102+
103+
## Dependencies
104+
105+
- Core ML: `torch>=1.13.0`, `transformers>=4.32.1`, `sam2>=1.1.0`
106+
- Image/Video: `opencv-python>=4.5.5.64`, `Pillow>=9.0.0`
107+
- Utilities: `pyyaml>=6.0`, `huggingface_hub>=0.15.0`
108+
- GPU/CPU auto-detection with CUDA support
109+
110+
## Entry Points
111+
112+
- CLI tool: `sowlv2-detect` (defined in pyproject.toml)
113+
- Python API: Import `sowlv2` modules directly

README.md

Lines changed: 58 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ TL;DR: SOWLv2: Text-prompted object segmentation using OWLv2 and SAM 2 -->
2323
<br>
2424
</p>
2525

26-
SOWLv2 (**S**egmented**OWLv2**) is a powerful command-line tool for **text-prompted object segmentation**. It seamlessly integrates Googles [OWLv2](https://huggingface.co/docs/transformers/en/model_doc/owlv2) open-vocabulary object detector with Metas [SAM 2](https://github.com/facebookresearch/sam2) (Segment Anything Model V2) to precisely segment objects in images, image sequences (frames), or videos based on natural language descriptions.
26+
SOWLv2 (**S**egmented**OWLv2**) is a powerful command-line tool for **text-prompted object segmentation**. It seamlessly integrates Google's [OWLv2](https://huggingface.co/docs/transformers/en/model_doc/owlv2) open-vocabulary object detector with Meta's [SAM 2](https://github.com/facebookresearch/sam2) (Segment Anything Model V2) to precisely segment objects in images, image sequences (frames), or videos based on natural language descriptions.
2727

2828
Given one or more text prompts (e.g., `"a red bicycle"`, or `"cat" "dog"`) and an input source, SOWLv2 will:
2929
1. Utilize **OWLv2** to detect bounding boxes for objects matching the text prompt(s), based on the principles from the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683).
@@ -88,8 +88,7 @@ Note: If a single prompt contains spaces, it should be enclosed in quotes (e.g.,
8888

8989
### Command-Line Options:
9090

91-
| Argument | Description | Default Value |
92-
|-----------------|------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------|
91+
/
9392
| `--prompt` | **(Required)** One or more text queries for object detection (e.g., `"cat"`, or `"dog" "person" "a red car"`). | `None` |
9493
| `--input` | **(Required)** Path to the input: a single image file, a directory of image frames, or a video file. | `None` |
9594
| `--output` | Directory where outputs (masks and overlays) will be saved. Created if it doesn't exist. | `output/` |
@@ -98,6 +97,9 @@ Note: If a single prompt contains spaces, it should be enclosed in quotes (e.g.,
9897
| `--threshold` | (Optional) Detection confidence threshold for OWLv2 (a float between 0 and 1). | `0.1` |
9998
| `--fps` | (Optional) Frame sampling rate (frames per second) for video inputs. | `24` |
10099
| `--device` | (Optional) Compute device (`"cuda"` or `"cpu"`). | Auto-detects GPU, else `cpu` |
100+
| `--no-merged` | (Optional) Disables merged mode. Merged mode (where all masks are combined into a single output [image/video] ) is enabled by default. | Enabled |
101+
| `--no-binary` | (Optional) Disables binary mask generation. Binary mask output is enabled by default. | Enabled |
102+
| `--no-overlay` | (Optional) Disables overlay image generation. Overlay image output (original image with masks) is enabled by default. | Enabled |
101103
| `--config` | (Optional) Path to a YAML configuration file to specify arguments (see [Configuration](#configuration)). Prompts can also be a list in YAML. | `None` |
102104
103105
### Examples:
@@ -126,13 +128,60 @@ Note: If a single prompt contains spaces, it should be enclosed in quotes (e.g.,
126128
127129
### Output Structure:
128130
129-
The tool saves results in the specified output directory. For each detected object instance (corresponding to any of the given prompts), SOWLv2 generates:
130-
* A **binary mask** image (e.g., `imagename_object0_mask.png`): Grayscale PNG where foreground pixels are white (255) and background pixels are black (0). The filename includes a sequential object ID.
131-
* An **overlay image** (e.g., `imagename_object0_overlay.png`): The original image with the segmentation mask overlaid (typically colored with transparency).
131+
The tool saves results in the specified output directory with the following structure:
132132
133-
Objects are numbered sequentially (e.g., `object0`, `object1`) in the order they are detected by OWLv2, regardless of which text prompt they matched. For video inputs, output filenames will also include frame identifiers, and separate videos for each object's masks and overlays will be generated (e.g., `obj0_mask_video.mp4`, `obj0_overlay_video.mp4`).
133+
```
134+
output_dir/
135+
├── binary/ # Binary mask images/videos
136+
│ ├── merged/ # Merged binary masks (all objects combined)
137+
│ │ ├── 000001_merged_mask.png
138+
│ │ ├── 000002_merged_mask.png
139+
│ │ └── ...
140+
│ └── frames/ # Individual binary masks per object
141+
│ ├── 000001_obj1_cat_mask.png
142+
│ ├── 000001_obj2_dog_mask.png
143+
│ ├── 000002_obj1_cat_mask.png
144+
│ └── ...
145+
├── overlay/ # RGB overlay images/videos
146+
│ ├── merged/ # Merged overlays (all objects combined)
147+
│ │ ├── 000001_merged_overlay.png
148+
│ │ ├── 000002_merged_overlay.png
149+
│ │ └── ...
150+
│ └── frames/ # Individual overlays per object
151+
│ ├── 000001_obj1_cat_overlay.png
152+
│ ├── 000001_obj2_dog_overlay.png
153+
│ ├── 000002_obj1_cat_overlay.png
154+
│ └── ...
155+
└── video/ # Generated videos (for video input)
156+
├── binary/ # Binary mask videos
157+
│ ├── merged_mask.mp4 # Merged binary mask video
158+
│ ├── obj1_cat_mask.mp4 # Individual object videos
159+
│ └── obj2_dog_mask.mp4
160+
└── overlay/ # Overlay videos
161+
├── merged_overlay.mp4 # Merged overlay video
162+
├── obj1_cat_overlay.mp4
163+
└── obj2_dog_overlay.mp4
164+
```
165+
166+
#### File Naming Convention:
167+
168+
For each detected object instance, SOWLv2 generates files using the following patterns:
169+
170+
**Individual Object Files:**
171+
* **Binary masks**: `{frame_num}_obj{obj_id}_{prompt}_mask.png` (e.g., `000001_obj1_cat_mask.png`)
172+
* **Overlay images**: `{frame_num}_obj{obj_id}_{prompt}_overlay.png` (e.g., `000001_obj1_cat_overlay.png`)
173+
174+
**Merged Files (all objects combined):**
175+
* **Binary masks**: `{frame_num}_merged_mask.png` (e.g., `000001_merged_mask.png`)
176+
* **Overlay images**: `{frame_num}_merged_overlay.png` (e.g., `000001_merged_overlay.png`)
177+
178+
**Video Files:**
179+
* **Individual object videos**: `obj{obj_id}_{prompt}_mask.mp4` / `obj{obj_id}_{prompt}_overlay.mp4`
180+
* **Merged videos**: `merged_mask.mp4` / `merged_overlay.mp4`
181+
182+
Objects are numbered sequentially (`obj1`, `obj2`, etc.) in the order they are detected by OWLv2, regardless of which text prompt they matched. Frame numbers use 6-digit zero-padding (`000001`, `000002`, etc.).
134183
135-
SOWLv2 automatically assigns a unique color to each detected OWLv2 label, making it easy to visually distinguish different object classes in the output overlays and merged results.
184+
SOWLv2 automatically assigns a unique color to each detected object class, making it easy to visually distinguish different object types in the output overlays and merged results.
136185
137186
### <a name="configuration"></a>Configuration File (Optional):
138187
@@ -177,7 +226,7 @@ SOWLv2 follows a two-stage pipeline:
177226
SOWLv2 relies on the following major Python packages:
178227
* `torch` (PyTorch)
179228
* `transformers` (for OWLv2 models)
180-
* `sam2` (Metas SAM 2 package)
229+
* `sam2` (Meta's SAM 2 package)
181230
* `opencv-python` (for image and video processing)
182231
* `numpy`, `Pillow`, `pyyaml`, `huggingface_hub`
183232

assets/SOWLv2Multilabel.png

198 KB
Loading

0 commit comments

Comments
 (0)