bladeszasza
diff --git a/‎.claude/settings.local.json‎
Lines changed: 11 additions & 0 deletions b/‎.claude/settings.local.json‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎.github/workflows/pylint.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/pylint.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.gitignore‎ b/‎.gitignore‎
diff --git a/‎CLAUDE.md‎
Lines changed: 113 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 113 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 58 additions & 9 deletions b/‎README.md‎
Lines changed: 58 additions & 9 deletions
diff --git a/‎assets/SOWLv2Multilabel.png‎
198 KB b/‎assets/SOWLv2Multilabel.png‎
198 KB
@@ -0,0 +1,11 @@
+{
+  "permissions": {
+    "allow": [
+      "Bash(git add:*)",
+      "Bash(git reset:*)",
+      "Bash(pylint:*)",
+      "Bash(python -m py_compile:*)"
+    ],
+    "deny": []
+  }
+}
@@ -10,7 +10,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.10", "3.11"]
+        python-version: ["3.10", "3.11", "3.12", "3.13"]
     steps:
     - uses: actions/checkout@v4
     - name: Set up Python ${{ matrix.python-version }}
 
@@ -0,0 +1,113 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+SOWLv2 is a command-line tool and Python library for text-prompted object segmentation that combines Google's OWLv2 (open-vocabulary object detector) with Meta's SAM 2 (Segment Anything Model V2) for precise pixel-level segmentation. The tool processes images, video frames, or videos based on natural language prompts.
+
+## Architecture
+
+### Core Pipeline Flow
+1. **Input Processing**: Images/videos → Frame extraction  
+2. **Detection**: OWLv2 finds objects matching text prompts → Bounding boxes
+3. **Segmentation**: SAM 2 creates precise masks from bounding boxes  
+4. **Output Generation**: Binary masks + visual overlays + videos
+
+### Key Components
+- `cli.py`: Command-line interface entry point
+- `pipeline.py`: Main orchestration pipeline
+- `image_pipeline.py`: Single image processing
+- `video_pipeline.py`: Video processing with temporal tracking
+- `models/owl.py`: OWLv2 wrapper for object detection
+- `models/sam2_wrapper.py`: SAM 2 wrapper for segmentation
+- `utils/`: File system, frame, pipeline, and video utilities
+
+## Development Commands
+
+### Installation
+```bash
+# Development install
+pip install -e .
+
+# Install dependencies
+pip install -r requirements.txt
+```
+
+### Code Quality
+```bash
+# Run linting (matches CI)
+pylint $(git ls-files '*.py')
+```
+
+### Testing
+- Primary testing through Google Colab notebook demonstrations
+- GitHub Actions CI runs pylint on Python 3.10-3.13
+
+### CLI Usage
+```bash
+# Basic usage
+sowlv2-detect --prompt "cat" --input image.jpg --output results/
+
+# Multiple objects
+sowlv2-detect --prompt "cat" "dog" --input video.mp4 --output results/
+
+# With config file
+sowlv2-detect --config config.yaml
+```
+
+## Configuration
+
+- YAML configuration files supported (see `config/config_example.yaml`)
+- Key parameters: `prompt`, `owl_model`, `sam_model`, `threshold`, `fps`, `device`
+- Command-line arguments override config file values
+
+## Output Structure
+
+```
+output_dir/
+├── binary/                  # Binary mask images/videos
+│   ├── merged/             # Merged binary masks (all objects combined)
+│   │   ├── 000001_merged_mask.png
+│   │   ├── 000002_merged_mask.png
+│   │   └── ...
+│   └── frames/             # Individual binary masks per object
+│       ├── 000001_obj1_cat_mask.png
+│       ├── 000001_obj2_dog_mask.png
+│       └── ...
+├── overlay/                # RGB overlay images/videos
+│   ├── merged/            # Merged overlays (all objects combined)
+│   │   ├── 000001_merged_overlay.png
+│   │   ├── 000002_merged_overlay.png
+│   │   └── ...
+│   └── frames/            # Individual overlays per object
+│       ├── 000001_obj1_cat_overlay.png
+│       ├── 000001_obj2_dog_overlay.png
+│       └── ...
+└── video/                 # Generated videos (for video input)
+    ├── binary/            # Binary mask videos
+    │   ├── merged_mask.mp4
+    │   ├── obj1_cat_mask.mp4
+    │   └── obj2_dog_mask.mp4
+    └── overlay/           # Overlay videos
+        ├── merged_overlay.mp4
+        ├── obj1_cat_overlay.mp4
+        └── obj2_dog_overlay.mp4
+```
+
+### File Naming Convention:
+- **Individual files**: `{frame_num}_obj{obj_id}_{prompt}_mask.png` / `{frame_num}_obj{obj_id}_{prompt}_overlay.png`
+- **Merged files**: `{frame_num}_merged_mask.png` / `{frame_num}_merged_overlay.png`
+- **Videos**: `obj{obj_id}_{prompt}_mask.mp4` / `merged_mask.mp4`
+
+## Dependencies
+
+- Core ML: `torch>=1.13.0`, `transformers>=4.32.1`, `sam2>=1.1.0`
+- Image/Video: `opencv-python>=4.5.5.64`, `Pillow>=9.0.0`
+- Utilities: `pyyaml>=6.0`, `huggingface_hub>=0.15.0`
+- GPU/CPU auto-detection with CUDA support
+
+## Entry Points
+
+- CLI tool: `sowlv2-detect` (defined in pyproject.toml)
+- Python API: Import `sowlv2` modules directly
@@ -23,7 +23,7 @@ TL;DR: SOWLv2: Text-prompted object segmentation using OWLv2 and SAM 2 -->
   <br>
 </p>
 
-SOWLv2 (**S**egmented**OWLv2**) is a powerful command-line tool for **text-prompted object segmentation**. It seamlessly integrates Google’s [OWLv2](https://huggingface.co/docs/transformers/en/model_doc/owlv2) open-vocabulary object detector with Meta’s [SAM 2](https://github.com/facebookresearch/sam2) (Segment Anything Model V2) to precisely segment objects in images, image sequences (frames), or videos based on natural language descriptions.
+SOWLv2 (**S**egmented**OWLv2**) is a powerful command-line tool for **text-prompted object segmentation**. It seamlessly integrates Google's [OWLv2](https://huggingface.co/docs/transformers/en/model_doc/owlv2) open-vocabulary object detector with Meta's [SAM 2](https://github.com/facebookresearch/sam2) (Segment Anything Model V2) to precisely segment objects in images, image sequences (frames), or videos based on natural language descriptions.
 
 Given one or more text prompts (e.g., `"a red bicycle"`, or `"cat" "dog"`) and an input source, SOWLv2 will:
 1.  Utilize **OWLv2** to detect bounding boxes for objects matching the text prompt(s), based on the principles from the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683).
@@ -88,8 +88,7 @@ Note: If a single prompt contains spaces, it should be enclosed in quotes (e.g.,
 
 ### Command-Line Options:
 
-| Argument        | Description                                                                                                                        | Default Value                        |
-|-----------------|------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------|
+/
 | `--prompt`      | **(Required)** One or more text queries for object detection (e.g., `"cat"`, or `"dog" "person" "a red car"`).                       | `None`                               |
 | `--input`       | **(Required)** Path to the input: a single image file, a directory of image frames, or a video file.                                | `None`                               |
 | `--output`      | Directory where outputs (masks and overlays) will be saved. Created if it doesn't exist.                                           | `output/`                            |
@@ -98,6 +97,9 @@ Note: If a single prompt contains spaces, it should be enclosed in quotes (e.g.,
 | `--threshold`   | (Optional) Detection confidence threshold for OWLv2 (a float between 0 and 1).                                                     | `0.1`                                |
 | `--fps`         | (Optional) Frame sampling rate (frames per second) for video inputs.                                                               | `24`                                 |
 | `--device`      | (Optional) Compute device (`"cuda"` or `"cpu"`).                                                                                     | Auto-detects GPU, else `cpu`         |
+| `--no-merged`   | (Optional) Disables merged mode. Merged mode (where all masks are combined into a single output [image/video] ) is enabled by default.                   | Enabled                              |
+| `--no-binary`   | (Optional) Disables binary mask generation. Binary mask output is enabled by default.                                                  | Enabled                              |
+| `--no-overlay`  | (Optional) Disables overlay image generation. Overlay image output (original image with masks) is enabled by default.                  | Enabled                              |
 | `--config`      | (Optional) Path to a YAML configuration file to specify arguments (see [Configuration](#configuration)). Prompts can also be a list in YAML. | `None`                               |
 
 ### Examples:
@@ -126,13 +128,60 @@ Note: If a single prompt contains spaces, it should be enclosed in quotes (e.g.,
 
 ### Output Structure:
 
-The tool saves results in the specified output directory. For each detected object instance (corresponding to any of the given prompts), SOWLv2 generates:
-*   A **binary mask** image (e.g., `imagename_object0_mask.png`): Grayscale PNG where foreground pixels are white (255) and background pixels are black (0). The filename includes a sequential object ID.
-*   An **overlay image** (e.g., `imagename_object0_overlay.png`): The original image with the segmentation mask overlaid (typically colored with transparency).
+The tool saves results in the specified output directory with the following structure:
 
-Objects are numbered sequentially (e.g., `object0`, `object1`) in the order they are detected by OWLv2, regardless of which text prompt they matched. For video inputs, output filenames will also include frame identifiers, and separate videos for each object's masks and overlays will be generated (e.g., `obj0_mask_video.mp4`, `obj0_overlay_video.mp4`).
+```
+output_dir/
+├── binary/                  # Binary mask images/videos
+│   ├── merged/             # Merged binary masks (all objects combined)
+│   │   ├── 000001_merged_mask.png
+│   │   ├── 000002_merged_mask.png
+│   │   └── ...
+│   └── frames/             # Individual binary masks per object
+│       ├── 000001_obj1_cat_mask.png
+│       ├── 000001_obj2_dog_mask.png
+│       ├── 000002_obj1_cat_mask.png
+│       └── ...
+├── overlay/                # RGB overlay images/videos
+│   ├── merged/            # Merged overlays (all objects combined)
+│   │   ├── 000001_merged_overlay.png
+│   │   ├── 000002_merged_overlay.png
+│   │   └── ...
+│   └── frames/            # Individual overlays per object
+│       ├── 000001_obj1_cat_overlay.png
+│       ├── 000001_obj2_dog_overlay.png
+│       ├── 000002_obj1_cat_overlay.png
+│       └── ...
+└── video/                 # Generated videos (for video input)
+    ├── binary/            # Binary mask videos
+    │   ├── merged_mask.mp4     # Merged binary mask video
+    │   ├── obj1_cat_mask.mp4   # Individual object videos
+    │   └── obj2_dog_mask.mp4
+    └── overlay/           # Overlay videos
+        ├── merged_overlay.mp4  # Merged overlay video
+        ├── obj1_cat_overlay.mp4
+        └── obj2_dog_overlay.mp4
+```
+
+#### File Naming Convention:
+
+For each detected object instance, SOWLv2 generates files using the following patterns:
+
+**Individual Object Files:**
+*   **Binary masks**: `{frame_num}_obj{obj_id}_{prompt}_mask.png` (e.g., `000001_obj1_cat_mask.png`)
+*   **Overlay images**: `{frame_num}_obj{obj_id}_{prompt}_overlay.png` (e.g., `000001_obj1_cat_overlay.png`)
+
+**Merged Files (all objects combined):**
+*   **Binary masks**: `{frame_num}_merged_mask.png` (e.g., `000001_merged_mask.png`)
+*   **Overlay images**: `{frame_num}_merged_overlay.png` (e.g., `000001_merged_overlay.png`)
+
+**Video Files:**
+*   **Individual object videos**: `obj{obj_id}_{prompt}_mask.mp4` / `obj{obj_id}_{prompt}_overlay.mp4`
+*   **Merged videos**: `merged_mask.mp4` / `merged_overlay.mp4`
+
+Objects are numbered sequentially (`obj1`, `obj2`, etc.) in the order they are detected by OWLv2, regardless of which text prompt they matched. Frame numbers use 6-digit zero-padding (`000001`, `000002`, etc.).
 
-SOWLv2 automatically assigns a unique color to each detected OWLv2 label, making it easy to visually distinguish different object classes in the output overlays and merged results.
+SOWLv2 automatically assigns a unique color to each detected object class, making it easy to visually distinguish different object types in the output overlays and merged results.
 
 ### <a name="configuration"></a>Configuration File (Optional):
 
@@ -177,7 +226,7 @@ SOWLv2 follows a two-stage pipeline:
 SOWLv2 relies on the following major Python packages:
 *   `torch` (PyTorch)
 *   `transformers` (for OWLv2 models)
-*   `sam2` (Meta’s SAM 2 package)
+*   `sam2` (Meta's SAM 2 package)
 *   `opencv-python` (for image and video processing)
 *   `numpy`, `Pillow`, `pyyaml`, `huggingface_hub`