SeregaCodit
diff --git a/‎README.MD‎
Lines changed: 68 additions & 58 deletions b/‎README.MD‎
Lines changed: 68 additions & 58 deletions
diff --git a/‎const_utils/annotation.py‎
Lines changed: 51 additions & 51 deletions b/‎const_utils/annotation.py‎
Lines changed: 51 additions & 51 deletions
diff --git a/‎const_utils/arguments.py‎
Lines changed: 2 additions & 1 deletion b/‎const_utils/arguments.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎const_utils/default_values.py‎
Lines changed: 15 additions & 3 deletions b/‎const_utils/default_values.py‎
Lines changed: 15 additions & 3 deletions
diff --git a/‎const_utils/parser_help.py‎
Lines changed: 4 additions & 1 deletion b/‎const_utils/parser_help.py‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎docs/api/yolo_voc_converter.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/api/yolo_voc_converter.md‎
Lines changed: 1 addition & 0 deletions
@@ -1,58 +1,68 @@
-# DataForge
-
-A simple way to automate working with datasets. You can set a time delay for automatic execution of your command.
-
-if you don’t want the command works in a cycle, just don't use "-r" argument. And it will be executed for one time.
-
-
-## Available commands
-- **move** - move files from source directory to target directory
-- **slice** - slice video files to images from the source directory to the target directory. Also, you can set flag "--remove" or "-rm" for deleting a source video file after slicing
-
-- **delete** - delete files that match patterns from source directory
-- **dedup** - find duplicates in source directory that matches a pattern. An image means a duplicate if it's hash has lower 
-Hamming distance with comparing image hash than threshold value. The threshold value setups in percentage and must be in range [0, 100]. Pay attention to core_size parameter: the lower value makes details at photo less important, and the higher value makes details mach important while comparing information at images. It’s implemented only dHash comparing method for now.
-- **clean-annotations** - find annotation files in directory that doesn't have corresponding files
-- **convert-annotations** - converts annotations from source format to destination format
-
-#### to see command syntax and arguments use:
-    python data_forge.py <command> -h
-
-## How to use:
-clone git repository:
-
-    git clone https://github.com/SeregaCodit/AutoFileManager.git
-
-go to project directory:
-
-    cd path_to_project
-    
-create virtual environment and activate it:
-    
-    python -m venv .venv
-    
-install requirements :
-    
-    pip install -r requirements.txt
-    
-read the --help command for learn more about available commands and arguments:
-    
-for check available commands
-
-    python data_forge.py --help 
-
-for check the command usage and available arguments
-
-    python data_forge.py {command} --help
-
-
-## What else?
-
-For more comfortable using FileManager with multiple tasks you can create an .sh file or modify [strat_all_tasks.sh](https://github.com/SeregaCodit/DataForge/blob/main/start_all_tasks.sh) with list of your commands. And run all of them just by one simple command:
-
-    bash path_to_file/start_all_tasks.sh
-
-for stop executing of all commands use:
-
-    pkill -f data_forge.py
-
+# DataForge 
+**DataForge** is a high-performance CLI tool designed to automate the preparation and management of machine learning datasets. It helps you transform raw and dirty data (like videos and unsorted images) into clean, balanced, and ready-to-train datasets with minimal effort.
+
+`📖 [Read the full documentation here](https://seregacodit.github.io/DataForge/)`
+### Key Features
+* **Parallel Processing:** uses multiprocessing to handle thousands of files quickly.
+* **Vectorized Calculations:** employs NumPy for ultra-fast image comparison.
+* **Smart Caching:** incremental caching (MD5-based) allows working with large datasets on NAS without re-calculating everything.
+* **Config:** Built with Pydantic v2 for safe and flexible settings via JSON or CLI.
+
+---
+
+### Available Commands
+
+* **`move`** — move files from source to target directory based on patterns.
+* **`slice`** — convert video files into sequences of images. Use `--remove` to delete the source video after a successful slice.
+* **`delete`** — safely remove files matching specific patterns.
+* **`dedup`** — find and remove visual duplicates using **dHash**.
+    * *Threshold:* information similarity limit (0-100%).
+    * *Core Size:* higher values (e.g., 32, 64) detect small changes (like a moving car), lower values (e.g., 8) ignore noise.
+* **`clean-annotations`** — automatically find and delete "orphan" annotation files (XML/TXT) that no longer have a corresponding image.
+* **`convert-annotations`** — convert dataset labels between formats (e.g., **Pascal VOC** to **YOLO**).
+
+---
+
+### Automation & Intervals
+By default, commands run once. If you want to monitor a folder and process files as they appear, use the repeat flag:
+* Use **`-r`** to run the command in a cycle.
+* Set the delay between cycles with **`-s`** (seconds).
+
+---
+
+### Quick Start
+
+1. **Clone the repository:**
+```bash
+git clone https://github.com/SeregaCodit/DataForge.git
+cd DataForge
+```
+
+2. **Setup environment:**
+```bash
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+
+3. **Check usage:**
+```bash
+python data_forge.py --help             # See all commands
+python data_forge.py {command} --help   # See arguments for a specific command
+```
+
+---
+
+### Workflow
+For multiple tasks, you can modify `start_all_tasks.sh` and run them all in the background:
+```bash
+bash start_all_tasks.sh
+```
+To stop all running DataForge processes:
+```bash
+pkill -f data_forge.py
+```
+
+### Configuration
+You can manage all default settings in `config.json`. DataForge follows this priority:
+**CLI Arguments > config.json > Internal Defaults.**
@@ -1,51 +1,51 @@
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Tuple, Dict, Optional
-
-from logger.log_level_mapping import LevelMapping
-from logger.logger import LoggerConfigurator
-
-
-class ObjectAnnotation:
-    def __init__(self, log_level: str = LevelMapping.debug, log_path: Optional[Path] = None,  **kwargs):
-        self.imsize: Tuple[int, int] = kwargs.get("imsize")
-        self.name: str = kwargs.get("name")
-        self.pose: str = kwargs.get("pose", 'Unspecified')
-        self.truncated: int = kwargs.get("truncated", 0)
-        self.difficult: int = kwargs.get("difficult", 0)
-        self.bndbox: Dict[str, int] = kwargs.get("bndbox", {})
-        self.width: int = None
-        self.height: int = None
-        self.x_center: int = None
-        self.y_center: int = None
-        self.area: int = None
-        self.aspect_ratio: int = None
-        self.relative_area: float = None
-
-        self.logger = LoggerConfigurator.setup(
-            name=self.__class__.__name__,
-            log_level=log_level,
-            log_path=Path(log_path) / f"{self.__class__.__name__}.log" if log_path else None
-        )
-
-    @property
-    def area(self) -> int:
-        return self._area
-
-    @area.setter
-    def area(self, value: int) -> None:
-        if isinstance(value, int):
-            self._area = value
-        else:
-            try:
-                self._area = int(float(value))
-            except TypeError as e:
-                error_text = f"Area must be an integer, got {value}"
-                self.logger.warning(error_text)
-                raise TypeError(e)
-
-    @property
-    def width(self) -> int:
-        return self._width
-
-
+# from dataclasses import dataclass
+# from pathlib import Path
+# from typing import Tuple, Dict, Optional
+#
+# from logger.log_level_mapping import LevelMapping
+# from logger.logger import LoggerConfigurator
+#
+#
+# class ObjectAnnotation:
+#     def __init__(self, log_level: str = LevelMapping.debug, log_path: Optional[Path] = None,  **kwargs):
+#         self.imsize: Tuple[int, int] = kwargs.get("imsize")
+#         self.name: str = kwargs.get("name")
+#         self.pose: str = kwargs.get("pose", 'Unspecified')
+#         self.truncated: int = kwargs.get("truncated", 0)
+#         self.difficult: int = kwargs.get("difficult", 0)
+#         self.bndbox: Dict[str, int] = kwargs.get("bndbox", {})
+#         self.width: int = None
+#         self.height: int = None
+#         self.x_center: int = None
+#         self.y_center: int = None
+#         self.area: int = None
+#         self.aspect_ratio: int = None
+#         self.relative_area: float = None
+#
+#         self.logger = LoggerConfigurator.setup(
+#             name=self.__class__.__name__,
+#             log_level=log_level,
+#             log_path=Path(log_path) / f"{self.__class__.__name__}.log" if log_path else None
+#         )
+#
+#     @property
+#     def area(self) -> int:
+#         return self._area
+#
+#     @area.setter
+#     def area(self, value: int) -> None:
+#         if isinstance(value, int):
+#             self._area = value
+#         else:
+#             try:
+#                 self._area = int(float(value))
+#             except TypeError as e:
+#                 error_text = f"Area must be an integer, got {value}"
+#                 self.logger.warning(error_text)
+#                 raise TypeError(e)
+#
+#     @property
+#     def width(self) -> int:
+#         return self._width
+#
+#
@@ -5,7 +5,6 @@ class Arguments:
     """Command arguments"""
     src: str = "src"
     dst: str = "--dst"
-
     pattern: str = "--pattern"
     p: str = "-p"
     repeat: str = "--repeat"
@@ -32,3 +31,5 @@ class Arguments:
     a_suffix: str = "--a_suffix"
     a_source: str = "--a_source"
     destination_type: str = "--destination-type"
+    img_path: str = "--img_path"
+    extensions: str = "--ext"
@@ -1,7 +1,7 @@
 import json
 import multiprocessing
 
-from typing import Union, Tuple, Optional
+from typing import Union, Tuple, Optional, List
 
 from pydantic import Field, field_validator
 from pydantic_settings import BaseSettings, SettingsConfigDict
@@ -33,14 +33,15 @@ class AppSettings(BaseSettings):
     filetype: str = Field(default=Constants.image)
     method: str = Field(default=Constants.dhash)
     hash_threshold: int = Field(default=10, ge=0, le=100)
-    confirm_choice: tuple = Field(default=("delete", "вудуеу"))
-    core_size: int = Field(default=8, ge=8) # чи можна вказати що це має бути ступінь двійки?
+    confirm_choice: tuple = Field(default=("delete",))
+    core_size: int = Field(default=8, ge=8)
     n_jobs: int = Field(default=2, ge=1, le=multiprocessing.cpu_count())
     cache_file_path: Path = Field(default=Path("./cache"))
     cache_name: Optional[Path] = Field(default=None)
     a_suffix: Tuple[str, ...] = Field(default_factory=tuple)
     a_source: Optional[Path] = Field(default=None)
     destination_type: Optional[str] = Field(default=None)
+    extensions: Tuple[str, ...] = Field(default=(".jpg", ".jpeg,", ".png"))
 
     @field_validator('core_size')
     @classmethod
@@ -68,6 +69,17 @@ def ensure_n_jobs(cls, value: Union[int, str]) -> int:
         else:
             return value
 
+    @field_validator("extensions")
+    @classmethod
+    def ensure_extensions(cls, value: Union[str, List[str]]) -> Tuple[str, ...]:
+        if isinstance(value, tuple):
+            return value
+        else:
+            try:
+                return tuple(value)
+            except TypeError as e:
+                raise TypeError(e)
+
     @classmethod
     def load_config(cls, config_path: Path = Constants.config_file) -> "AppSettings":
 
 
@@ -33,4 +33,7 @@ class HelpStrings:
     a_suffix: str = "A suffix pattern for annotations"
     a_source: str = ("A source directory to annotations. If None - that means annotations are in the same folder with"
                      " images")
-    destination_type: str = "A type of destination annotation format"
+    destination_type: str = "A type of destination annotation format"
+    img_path: str = "Path to dataset images directory"
+    extensions: str = ("A tuple of file extensions that will be used as pattern for building file whitelists for "
+                       "converting from yolo to other formats")
@@ -0,0 +1 @@
+::: tools.annotation_converter.converter.yolo_voc_converter.YoloVocConverter
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+::: tools.annotation_converter.converter.yolo_voc_converter.YoloVocConverter`