Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 17 additions & 13 deletions DEMO/general_workflow_demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,9 @@
"metadata": {},
"source": [
"### The VectorStore class creates a vector database by converting a set of labelled texts to embeddings, using an associated Vectoriser.\n",
"#### Once created, it can be 'searched', using the vectoriser to embed queries as vectors and calculate their semantic similarity to the labelled texts in the VectorStore\n",
"#### Once created, it can be 'searched', using the vectoriser to embed queries as vectors and calculate their semantic similarity to the labelled texts in the VectorStore.\n",
"\n",
"#### By default, the vector database is persisted to a local directory named after the input filename. You can use the `output_dir` argument to change the location of the persisted vector database when creating the VectorStore. If the directory already exists, it will exit with a warning - you can pass the `overwrite=True` argument to permit it to overwrite an existing directory. If you don't want the vector database to be persisted at all, you can pass the `skip_save=True` argument - note that this takes precedence over `output_dir` and `overwrite`.\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making this new paragraph normal text size? - the two above it were shorter summary sentences made bigger for emphasis. The longer paragraph as a header makes it look slightly cluttered in the rendered notebook

"![VectorStore_image](files/VectorStore.png)\n"
]
},
Expand All @@ -123,11 +125,13 @@
"from classifai.indexers import VectorStore\n",
"\n",
"my_vector_store = VectorStore(\n",
" file_name=\"data/testdata.csv\",\n",
" data_type=\"csv\",\n",
" vectoriser=vectoriser,\n",
" meta_data={\"colour\": str, \"language\": str},\n",
" overwrite=True,\n",
" file_name=\"data/testdata.csv\", # required\n",
" data_type=\"csv\", # required\n",
" vectoriser=vectoriser, # required\n",
" meta_data={\"colour\": str, \"language\": str}, # optional\n",
" output_dir=\"testdata\", # optional\n",
" overwrite=True, # optional\n",
" skip_save=False, # optional\n",
")"
]
},
Expand Down Expand Up @@ -165,12 +169,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In the above cell, we're building the object that the VectorStore takes in to do the search process. Our input expects two columns of data, id and query, as above. And this data can be passed to our VectorStoreSearchInput class, as a dictionary <b> or alreadt as a Pandas dataframe. </b>\n",
"In the above cell, we're building the object that the VectorStore takes in to do the search process. Our input expects two columns of data, id and query, as above. We can create a VectorStoreSearchInput object by passing in this data as a dictionary <b> or as a Pandas dataframe. </b>\n",
"\n",
"If you try to remove some of the data, say the 'id' column. Our data class object will inform you that you're missing some data. In this sense the data classes keep you right when working with the Package.\n",
"If you try to remove some of the data, say the 'id' column, the class constructor will inform you that you're missing some data. The data validation embedded withing these data classes helps avoid unexpected behaviour when working with classifai.\n",
"\n",
"\n",
"Look at the type of the input_data object we created, notice that it is not of type Pandas, but our own custom type. Under the hood this is doing the additional work to validate the data your passing in."
"Look at the type of the input_data object we created; notice that it is not of type Pandas DataFrame, but our own custom datatype. You can think of these classes as dataframes with additional functionality to validate the data you pass in."
]
},
{
Expand Down Expand Up @@ -228,7 +232,7 @@
"metadata": {},
"source": [
"### With reverse search you can do partial matching!\n",
"use the `partial match` flag to check if the **ids/labels** start with our query id"
"Use the `partial match` flag to check if the returned **doc_labels** start with our query"
]
},
{
Expand Down Expand Up @@ -259,7 +263,7 @@
"source": [
"### VectorStore Embed method\n",
"\n",
"Its also possible to get the vector embeddings for each from some input text or queries by calling the VectorStore <i>.embed()</i> method.\n",
"It is also possible to get the vector embeddings for each from some input text or queries by calling the VectorStore <i>.embed()</i> method.\n",
"\n",
"Once again, this method has its own data class to inferace with: `VectorStoreEmbedInput`\n"
]
Expand Down Expand Up @@ -445,7 +449,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "classifai",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -459,7 +463,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.7"
"version": "3.12.4"
}
},
"nbformat": 4,
Expand Down
95 changes: 60 additions & 35 deletions src/classifai/indexers/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ def __init__( # noqa: C901, PLR0912, PLR0913, PLR0915
output_dir: str | None = None,
overwrite: bool = False,
hooks: dict | None = None,
skip_save: bool = False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move the new parameter above hooks, so that its grouped with the relevant other parameters?

):
"""Initializes the `VectorStore` object by processing the input CSV file and generating
vector embeddings.
Expand All @@ -107,9 +108,14 @@ def __init__( # noqa: C901, PLR0912, PLR0913, PLR0915
Defaults to `None`.
output_dir (str): [optional] The directory where the `VectorStore` will be saved.
Defaults to `None`, where input file name will be used.
Note: ignored if `skip_save=True`.
overwrite (bool): [optional] If `True`, allows overwriting existing folders with the same name.
Defaults to `False` to prevent accidental overwrites.
Note: ignored if `skip_save=True`.
hooks (dict): [optional] A dictionary of user-defined hooks for preprocessing and postprocessing. Defaults to `None`.
skip_save (bool): [optional] If `False`, will save the `VectorStore` to disk after creation, if `True`, will
just keep it in memory (for testing or ephemeral use cases).
Defaults to `False`.


Raises:
Expand Down Expand Up @@ -160,28 +166,45 @@ def __init__( # noqa: C901, PLR0912, PLR0913, PLR0915
self.num_vectors = None
self.vectoriser_class = vectoriser.__class__.__name__
self.hooks = {} if hooks is None else hooks
self.skip_save = skip_save

# ---- Output directory handling (filesystem problems) -> ConfigurationError
try:
if self.output_dir is None:
logging.info("No output directory specified, attempting to use input file name as output folder name.")
normalized_file_name = os.path.basename(os.path.splitext(self.file_name)[0])
self.output_dir = os.path.join(normalized_file_name)

if os.path.isdir(self.output_dir):
if overwrite:
shutil.rmtree(self.output_dir)
else:
raise ConfigurationError(
"Output directory already exists. Pass overwrite=True to overwrite the folder.",
context={"output_dir": self.output_dir},
if self.output_dir is not None and self.skip_save:
logging.warning(
"VectorStore creation: output_dir is set to %s but skip_save is True, so the VectorStore will not be saved to disk. output_dir will be ignored.",
self.output_dir,
)

if self.output_dir is not None and not isinstance(self.output_dir, str):
raise DataValidationError(
"output_dir must be a string or None.", context={"output_dir_type": type(self.output_dir).__name__}
)

if not self.skip_save:
# ---- Output directory handling (filesystem problems) -> ConfigurationError
try:
if self.output_dir is None:
logging.info(
"No output directory specified, attempting to use input file name as output folder name."
)
os.makedirs(self.output_dir, exist_ok=True)
except Exception as e:
raise ConfigurationError(
"Failed to prepare output directory.",
context={"output_dir": self.output_dir},
) from e
normalized_file_name = os.path.basename(os.path.splitext(self.file_name)[0])
self.output_dir = os.path.join(normalized_file_name)

if os.path.isdir(self.output_dir):
if overwrite:
shutil.rmtree(self.output_dir)
else:
raise ConfigurationError(
"Output directory already exists. Pass overwrite=True to overwrite the folder.",
context={"output_dir": self.output_dir},
)
os.makedirs(self.output_dir, exist_ok=True)
except Exception as e:
raise ConfigurationError(
"Failed to prepare output directory.",
context={"output_dir": self.output_dir},
) from e
else:
logging.debug("skip_save is set to True, the VectorStore will not be saved to disk after creation.")

# ---- Build index (wrap every unexpected failure) -> IndexBuildError
try:
Expand All @@ -202,23 +225,25 @@ def __init__( # noqa: C901, PLR0912, PLR0913, PLR0915
) from e

# ---- Save + derived metadata (IO/format problems) -> IndexBuildError
try:
logging.info("Gathering metadata and saving vector store / metadata...")

self.vector_shape = self.vectors["embeddings"].to_numpy().shape[1]
self.num_vectors = len(self.vectors)
self.vector_shape = self.vectors["embeddings"].to_numpy().shape[1]
self.num_vectors = len(self.vectors)

self.vectors.write_parquet(os.path.join(self.output_dir, "vectors.parquet"))
self._save_metadata(os.path.join(self.output_dir, "metadata.json"))
if not self.skip_save:
try:
logging.info("Gathering metadata and saving vector store / metadata...")
self.vectors.write_parquet(os.path.join(self.output_dir, "vectors.parquet"))
self._save_metadata(os.path.join(self.output_dir, "metadata.json"))

logging.info("Vector Store created - files saved to %s", self.output_dir)
except ClassifaiError:
raise
except Exception as e:
raise IndexBuildError(
"Vector store was created but saving outputs failed.",
context={"cause_type": type(e).__name__, "cause_message": str(e)},
) from e
logging.info("Vector Store created - files saved to %s", self.output_dir)
except ClassifaiError:
raise
except Exception as e:
raise IndexBuildError(
"Vector store was created but saving outputs failed.",
context={"cause_type": type(e).__name__, "cause_message": str(e)},
) from e
else:
logging.debug("skip_save is True, skipping saving VectorStore to disk.")

def _save_metadata(self, path: str):
"""Saves metadata about the `VectorStore` to a JSON file.
Expand Down
Loading