Skip to content

Commit 0178a5e

Browse files
committed
Merge branch 'main' of github.com:SeregaCodit/AutoFileManager
2 parents 06f3192 + d41ea2b commit 0178a5e

18 files changed

Lines changed: 712 additions & 196 deletions

File tree

README.MD

Lines changed: 68 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1,58 +1,68 @@
1-
# DataForge
2-
3-
A simple way to automate working with datasets. You can set a time delay for automatic execution of your command.
4-
5-
if you don’t want the command works in a cycle, just don't use "-r" argument. And it will be executed for one time.
6-
7-
8-
## Available commands
9-
- **move** - move files from source directory to target directory
10-
- **slice** - slice video files to images from the source directory to the target directory. Also, you can set flag "--remove" or "-rm" for deleting a source video file after slicing
11-
12-
- **delete** - delete files that match patterns from source directory
13-
- **dedup** - find duplicates in source directory that matches a pattern. An image means a duplicate if it's hash has lower
14-
Hamming distance with comparing image hash than threshold value. The threshold value setups in percentage and must be in range [0, 100]. Pay attention to core_size parameter: the lower value makes details at photo less important, and the higher value makes details mach important while comparing information at images. It’s implemented only dHash comparing method for now.
15-
- **clean-annotations** - find annotation files in directory that doesn't have corresponding files
16-
- **convert-annotations** - converts annotations from source format to destination format
17-
18-
#### to see command syntax and arguments use:
19-
python data_forge.py <command> -h
20-
21-
## How to use:
22-
clone git repository:
23-
24-
git clone https://github.com/SeregaCodit/AutoFileManager.git
25-
26-
go to project directory:
27-
28-
cd path_to_project
29-
30-
create virtual environment and activate it:
31-
32-
python -m venv .venv
33-
34-
install requirements :
35-
36-
pip install -r requirements.txt
37-
38-
read the --help command for learn more about available commands and arguments:
39-
40-
for check available commands
41-
42-
python data_forge.py --help
43-
44-
for check the command usage and available arguments
45-
46-
python data_forge.py {command} --help
47-
48-
49-
## What else?
50-
51-
For more comfortable using FileManager with multiple tasks you can create an .sh file or modify [strat_all_tasks.sh](https://github.com/SeregaCodit/DataForge/blob/main/start_all_tasks.sh) with list of your commands. And run all of them just by one simple command:
52-
53-
bash path_to_file/start_all_tasks.sh
54-
55-
for stop executing of all commands use:
56-
57-
pkill -f data_forge.py
58-
1+
# DataForge
2+
**DataForge** is a high-performance CLI tool designed to automate the preparation and management of machine learning datasets. It helps you transform raw and dirty data (like videos and unsorted images) into clean, balanced, and ready-to-train datasets with minimal effort.
3+
4+
`📖 [Read the full documentation here](https://seregacodit.github.io/DataForge/)`
5+
### Key Features
6+
* **Parallel Processing:** uses multiprocessing to handle thousands of files quickly.
7+
* **Vectorized Calculations:** employs NumPy for ultra-fast image comparison.
8+
* **Smart Caching:** incremental caching (MD5-based) allows working with large datasets on NAS without re-calculating everything.
9+
* **Config:** Built with Pydantic v2 for safe and flexible settings via JSON or CLI.
10+
11+
---
12+
13+
### Available Commands
14+
15+
* **`move`** — move files from source to target directory based on patterns.
16+
* **`slice`** — convert video files into sequences of images. Use `--remove` to delete the source video after a successful slice.
17+
* **`delete`** — safely remove files matching specific patterns.
18+
* **`dedup`** — find and remove visual duplicates using **dHash**.
19+
* *Threshold:* information similarity limit (0-100%).
20+
* *Core Size:* higher values (e.g., 32, 64) detect small changes (like a moving car), lower values (e.g., 8) ignore noise.
21+
* **`clean-annotations`** — automatically find and delete "orphan" annotation files (XML/TXT) that no longer have a corresponding image.
22+
* **`convert-annotations`** — convert dataset labels between formats (e.g., **Pascal VOC** to **YOLO**).
23+
24+
---
25+
26+
### Automation & Intervals
27+
By default, commands run once. If you want to monitor a folder and process files as they appear, use the repeat flag:
28+
* Use **`-r`** to run the command in a cycle.
29+
* Set the delay between cycles with **`-s`** (seconds).
30+
31+
---
32+
33+
### Quick Start
34+
35+
1. **Clone the repository:**
36+
```bash
37+
git clone https://github.com/SeregaCodit/DataForge.git
38+
cd DataForge
39+
```
40+
41+
2. **Setup environment:**
42+
```bash
43+
python -m venv .venv
44+
source .venv/bin/activate
45+
pip install -r requirements.txt
46+
```
47+
48+
3. **Check usage:**
49+
```bash
50+
python data_forge.py --help # See all commands
51+
python data_forge.py {command} --help # See arguments for a specific command
52+
```
53+
54+
---
55+
56+
### Workflow
57+
For multiple tasks, you can modify `start_all_tasks.sh` and run them all in the background:
58+
```bash
59+
bash start_all_tasks.sh
60+
```
61+
To stop all running DataForge processes:
62+
```bash
63+
pkill -f data_forge.py
64+
```
65+
66+
### Configuration
67+
You can manage all default settings in `config.json`. DataForge follows this priority:
68+
**CLI Arguments > config.json > Internal Defaults.**

const_utils/annotation.py

Lines changed: 51 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,51 @@
1-
from dataclasses import dataclass
2-
from pathlib import Path
3-
from typing import Tuple, Dict, Optional
4-
5-
from logger.log_level_mapping import LevelMapping
6-
from logger.logger import LoggerConfigurator
7-
8-
9-
class ObjectAnnotation:
10-
def __init__(self, log_level: str = LevelMapping.debug, log_path: Optional[Path] = None, **kwargs):
11-
self.imsize: Tuple[int, int] = kwargs.get("imsize")
12-
self.name: str = kwargs.get("name")
13-
self.pose: str = kwargs.get("pose", 'Unspecified')
14-
self.truncated: int = kwargs.get("truncated", 0)
15-
self.difficult: int = kwargs.get("difficult", 0)
16-
self.bndbox: Dict[str, int] = kwargs.get("bndbox", {})
17-
self.width: int = None
18-
self.height: int = None
19-
self.x_center: int = None
20-
self.y_center: int = None
21-
self.area: int = None
22-
self.aspect_ratio: int = None
23-
self.relative_area: float = None
24-
25-
self.logger = LoggerConfigurator.setup(
26-
name=self.__class__.__name__,
27-
log_level=log_level,
28-
log_path=Path(log_path) / f"{self.__class__.__name__}.log" if log_path else None
29-
)
30-
31-
@property
32-
def area(self) -> int:
33-
return self._area
34-
35-
@area.setter
36-
def area(self, value: int) -> None:
37-
if isinstance(value, int):
38-
self._area = value
39-
else:
40-
try:
41-
self._area = int(float(value))
42-
except TypeError as e:
43-
error_text = f"Area must be an integer, got {value}"
44-
self.logger.warning(error_text)
45-
raise TypeError(e)
46-
47-
@property
48-
def width(self) -> int:
49-
return self._width
50-
51-
1+
# from dataclasses import dataclass
2+
# from pathlib import Path
3+
# from typing import Tuple, Dict, Optional
4+
#
5+
# from logger.log_level_mapping import LevelMapping
6+
# from logger.logger import LoggerConfigurator
7+
#
8+
#
9+
# class ObjectAnnotation:
10+
# def __init__(self, log_level: str = LevelMapping.debug, log_path: Optional[Path] = None, **kwargs):
11+
# self.imsize: Tuple[int, int] = kwargs.get("imsize")
12+
# self.name: str = kwargs.get("name")
13+
# self.pose: str = kwargs.get("pose", 'Unspecified')
14+
# self.truncated: int = kwargs.get("truncated", 0)
15+
# self.difficult: int = kwargs.get("difficult", 0)
16+
# self.bndbox: Dict[str, int] = kwargs.get("bndbox", {})
17+
# self.width: int = None
18+
# self.height: int = None
19+
# self.x_center: int = None
20+
# self.y_center: int = None
21+
# self.area: int = None
22+
# self.aspect_ratio: int = None
23+
# self.relative_area: float = None
24+
#
25+
# self.logger = LoggerConfigurator.setup(
26+
# name=self.__class__.__name__,
27+
# log_level=log_level,
28+
# log_path=Path(log_path) / f"{self.__class__.__name__}.log" if log_path else None
29+
# )
30+
#
31+
# @property
32+
# def area(self) -> int:
33+
# return self._area
34+
#
35+
# @area.setter
36+
# def area(self, value: int) -> None:
37+
# if isinstance(value, int):
38+
# self._area = value
39+
# else:
40+
# try:
41+
# self._area = int(float(value))
42+
# except TypeError as e:
43+
# error_text = f"Area must be an integer, got {value}"
44+
# self.logger.warning(error_text)
45+
# raise TypeError(e)
46+
#
47+
# @property
48+
# def width(self) -> int:
49+
# return self._width
50+
#
51+
#

const_utils/arguments.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@ class Arguments:
55
"""Command arguments"""
66
src: str = "src"
77
dst: str = "--dst"
8-
98
pattern: str = "--pattern"
109
p: str = "-p"
1110
repeat: str = "--repeat"
@@ -32,3 +31,5 @@ class Arguments:
3231
a_suffix: str = "--a_suffix"
3332
a_source: str = "--a_source"
3433
destination_type: str = "--destination-type"
34+
img_path: str = "--img_path"
35+
extensions: str = "--ext"

const_utils/default_values.py

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import json
22
import multiprocessing
33

4-
from typing import Union, Tuple, Optional
4+
from typing import Union, Tuple, Optional, List
55

66
from pydantic import Field, field_validator
77
from pydantic_settings import BaseSettings, SettingsConfigDict
@@ -33,14 +33,15 @@ class AppSettings(BaseSettings):
3333
filetype: str = Field(default=Constants.image)
3434
method: str = Field(default=Constants.dhash)
3535
hash_threshold: int = Field(default=10, ge=0, le=100)
36-
confirm_choice: tuple = Field(default=("delete", "вудуеу"))
37-
core_size: int = Field(default=8, ge=8) # чи можна вказати що це має бути ступінь двійки?
36+
confirm_choice: tuple = Field(default=("delete",))
37+
core_size: int = Field(default=8, ge=8)
3838
n_jobs: int = Field(default=2, ge=1, le=multiprocessing.cpu_count())
3939
cache_file_path: Path = Field(default=Path("./cache"))
4040
cache_name: Optional[Path] = Field(default=None)
4141
a_suffix: Tuple[str, ...] = Field(default_factory=tuple)
4242
a_source: Optional[Path] = Field(default=None)
4343
destination_type: Optional[str] = Field(default=None)
44+
extensions: Tuple[str, ...] = Field(default=(".jpg", ".jpeg,", ".png"))
4445

4546
@field_validator('core_size')
4647
@classmethod
@@ -68,6 +69,17 @@ def ensure_n_jobs(cls, value: Union[int, str]) -> int:
6869
else:
6970
return value
7071

72+
@field_validator("extensions")
73+
@classmethod
74+
def ensure_extensions(cls, value: Union[str, List[str]]) -> Tuple[str, ...]:
75+
if isinstance(value, tuple):
76+
return value
77+
else:
78+
try:
79+
return tuple(value)
80+
except TypeError as e:
81+
raise TypeError(e)
82+
7183
@classmethod
7284
def load_config(cls, config_path: Path = Constants.config_file) -> "AppSettings":
7385

const_utils/parser_help.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,4 +33,7 @@ class HelpStrings:
3333
a_suffix: str = "A suffix pattern for annotations"
3434
a_source: str = ("A source directory to annotations. If None - that means annotations are in the same folder with"
3535
" images")
36-
destination_type: str = "A type of destination annotation format"
36+
destination_type: str = "A type of destination annotation format"
37+
img_path: str = "Path to dataset images directory"
38+
extensions: str = ("A tuple of file extensions that will be used as pattern for building file whitelists for "
39+
"converting from yolo to other formats")

docs/api/yolo_voc_converter.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
::: tools.annotation_converter.converter.yolo_voc_converter.YoloVocConverter

0 commit comments

Comments
 (0)