datasciencecampus · lukeroantreeONS · May 19, 2026 · May 21, 2026 · May 21, 2026 · May 21, 2026
@@ -110,7 +110,9 @@
    "metadata": {},
    "source": [
     "### The VectorStore class creates a vector database by converting a set of labelled texts to embeddings, using an associated Vectoriser.\n",
-    "#### Once created, it can be 'searched', using the vectoriser to embed queries as vectors and calculate their semantic similarity to the labelled texts in the VectorStore\n",
+    "#### Once created, it can be 'searched', using the vectoriser to embed queries as vectors and calculate their semantic similarity to the labelled texts in the VectorStore.\n",
+    "\n",
+    "#### By default, the vector database is persisted to a local directory named after the input filename. You can use the `output_dir` argument to change the location of the persisted vector database when creating the VectorStore. If the directory already exists, it will exit with a warning - you can pass the `overwrite=True` argument to permit it to overwrite an existing directory. If you don't want the vector database to be persisted at all, you can pass the `skip_save=True` argument - note that this takes precedence over `output_dir` and `overwrite`.\n",
     "![VectorStore_image](files/VectorStore.png)\n"
    ]
   },
@@ -123,11 +125,13 @@
     "from classifai.indexers import VectorStore\n",
     "\n",
     "my_vector_store = VectorStore(\n",
-    "    file_name=\"data/testdata.csv\",\n",
-    "    data_type=\"csv\",\n",
-    "    vectoriser=vectoriser,\n",
-    "    meta_data={\"colour\": str, \"language\": str},\n",
-    "    overwrite=True,\n",
+    "    file_name=\"data/testdata.csv\",  # required\n",
+    "    data_type=\"csv\",  # required\n",
+    "    vectoriser=vectoriser,  # required\n",
+    "    meta_data={\"colour\": str, \"language\": str},  # optional\n",
+    "    output_dir=\"testdata\",  # optional\n",
+    "    overwrite=True,  # optional\n",
+    "    skip_save=False,  # optional\n",
     ")"
    ]
   },
@@ -165,12 +169,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In the above cell, we're building the object that the VectorStore takes in to do the search process. Our input expects two columns of data, id and query, as above. And this data can be passed to our VectorStoreSearchInput class, as a dictionary <b> or alreadt as a Pandas dataframe. </b>\n",
+    "In the above cell, we're building the object that the VectorStore takes in to do the search process. Our input expects two columns of data, id and query, as above. We can create a VectorStoreSearchInput object by passing in this data as a dictionary <b> or as a Pandas dataframe. </b>\n",
     "\n",
-    "If you try to remove some of the data, say the 'id' column. Our data class object will inform you that you're missing some data. In this sense the data classes keep you right when working with the Package.\n",
+    "If you try to remove some of the data, say the 'id' column, the class constructor will inform you that you're missing some data. The data validation embedded withing these data classes helps avoid unexpected behaviour when working with classifai.\n",
     "\n",
     "\n",
-    "Look at the type of the input_data object we created, notice that it is not of type Pandas, but our own custom type. Under the hood this is doing the additional work to validate the data your passing in."
+    "Look at the type of the input_data object we created; notice that it is not of type Pandas DataFrame, but our own custom datatype. You can think of these classes as dataframes with additional functionality to validate the data you pass in."
    ]
   },
   {
@@ -228,7 +232,7 @@
    "metadata": {},
    "source": [
     "### With reverse search you can do partial matching!\n",
-    "use the `partial match` flag to check if the **ids/labels** start with our query id"
+    "Use the `partial match` flag to check if the returned **doc_labels** start with our query"
    ]
   },
   {
@@ -259,7 +263,7 @@
    "source": [
     "### VectorStore Embed method\n",
     "\n",
-    "Its also possible to get the vector embeddings for each from some input text or queries by calling the VectorStore <i>.embed()</i> method.\n",
+    "It is also possible to get the vector embeddings for each from some input text or queries by calling the VectorStore <i>.embed()</i> method.\n",
     "\n",
     "Once again, this method has its own data class to inferace with: `VectorStoreEmbedInput`\n"
    ]
@@ -445,7 +449,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "classifai",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -459,7 +463,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.13.7"
+   "version": "3.12.4"
   }
  },
  "nbformat": 4,

@@ -92,6 +92,7 @@ def __init__(  # noqa: C901, PLR0912, PLR0913, PLR0915
         output_dir: str | None = None,
         overwrite: bool = False,
         hooks: dict | None = None,
+        skip_save: bool = False,
     ):
         """Initializes the `VectorStore` object by processing the input CSV file and generating
         vector embeddings.
@@ -107,9 +108,14 @@ def __init__(  # noqa: C901, PLR0912, PLR0913, PLR0915
                                 Defaults to `None`.
             output_dir (str): [optional] The directory where the `VectorStore` will be saved.
                                 Defaults to `None`, where input file name will be used.
+                                Note: ignored if `skip_save=True`.
             overwrite (bool): [optional] If `True`, allows overwriting existing folders with the same name.
                                 Defaults to `False` to prevent accidental overwrites.
+                                Note: ignored if `skip_save=True`.
             hooks (dict): [optional] A dictionary of user-defined hooks for preprocessing and postprocessing. Defaults to `None`.
+            skip_save (bool): [optional] If `False`, will save the `VectorStore` to disk after creation, if `True`, will
+                                just keep it in memory (for testing or ephemeral use cases).
+                                Defaults to `False`.
 
 
         Raises:
@@ -160,28 +166,45 @@ def __init__(  # noqa: C901, PLR0912, PLR0913, PLR0915
         self.num_vectors = None
         self.vectoriser_class = vectoriser.__class__.__name__
         self.hooks = {} if hooks is None else hooks
+        self.skip_save = skip_save
 
-        # ---- Output directory handling (filesystem problems) -> ConfigurationError
-        try:
-            if self.output_dir is None:
-                logging.info("No output directory specified, attempting to use input file name as output folder name.")
-                normalized_file_name = os.path.basename(os.path.splitext(self.file_name)[0])
-                self.output_dir = os.path.join(normalized_file_name)
-
-            if os.path.isdir(self.output_dir):
-                if overwrite:
-                    shutil.rmtree(self.output_dir)
-                else:
-                    raise ConfigurationError(
-                        "Output directory already exists. Pass overwrite=True to overwrite the folder.",
-                        context={"output_dir": self.output_dir},
+        if self.output_dir is not None and self.skip_save:
+            logging.warning(
+                "VectorStore creation: output_dir is set to %s but skip_save is True, so the VectorStore will not be saved to disk. output_dir will be ignored.",
+                self.output_dir,
+            )
+
+        if self.output_dir is not None and not isinstance(self.output_dir, str):
+            raise DataValidationError(
+                "output_dir must be a string or None.", context={"output_dir_type": type(self.output_dir).__name__}
+            )
+
+        if not self.skip_save:
+            # ---- Output directory handling (filesystem problems) -> ConfigurationError
+            try:
+                if self.output_dir is None:
+                    logging.info(
+                        "No output directory specified, attempting to use input file name as output folder name."
                     )
-            os.makedirs(self.output_dir, exist_ok=True)
-        except Exception as e:
-            raise ConfigurationError(
-                "Failed to prepare output directory.",
-                context={"output_dir": self.output_dir},
-            ) from e
+                    normalized_file_name = os.path.basename(os.path.splitext(self.file_name)[0])
+                    self.output_dir = os.path.join(normalized_file_name)
+
+                if os.path.isdir(self.output_dir):
+                    if overwrite:
+                        shutil.rmtree(self.output_dir)
+                    else:
+                        raise ConfigurationError(
+                            "Output directory already exists. Pass overwrite=True to overwrite the folder.",
+                            context={"output_dir": self.output_dir},
+                        )
+                os.makedirs(self.output_dir, exist_ok=True)
+            except Exception as e:
+                raise ConfigurationError(
+                    "Failed to prepare output directory.",
+                    context={"output_dir": self.output_dir},
+                ) from e
+        else:
+            logging.debug("skip_save is set to True, the VectorStore will not be saved to disk after creation.")
 
         # ---- Build index (wrap every unexpected failure) -> IndexBuildError
         try:
@@ -202,23 +225,25 @@ def __init__(  # noqa: C901, PLR0912, PLR0913, PLR0915
             ) from e
 
         # ---- Save + derived metadata (IO/format problems) -> IndexBuildError
-        try:
-            logging.info("Gathering metadata and saving vector store / metadata...")
-
-            self.vector_shape = self.vectors["embeddings"].to_numpy().shape[1]
-            self.num_vectors = len(self.vectors)
+        self.vector_shape = self.vectors["embeddings"].to_numpy().shape[1]
+        self.num_vectors = len(self.vectors)
 
-            self.vectors.write_parquet(os.path.join(self.output_dir, "vectors.parquet"))
-            self._save_metadata(os.path.join(self.output_dir, "metadata.json"))
+        if not self.skip_save:
+            try:
+                logging.info("Gathering metadata and saving vector store / metadata...")
+                self.vectors.write_parquet(os.path.join(self.output_dir, "vectors.parquet"))
+                self._save_metadata(os.path.join(self.output_dir, "metadata.json"))
 
-            logging.info("Vector Store created - files saved to %s", self.output_dir)
-        except ClassifaiError:
-            raise
-        except Exception as e:
-            raise IndexBuildError(
-                "Vector store was created but saving outputs failed.",
-                context={"cause_type": type(e).__name__, "cause_message": str(e)},
-            ) from e
+                logging.info("Vector Store created - files saved to %s", self.output_dir)
+            except ClassifaiError:
+                raise
+            except Exception as e:
+                raise IndexBuildError(
+                    "Vector store was created but saving outputs failed.",
+                    context={"cause_type": type(e).__name__, "cause_message": str(e)},
+                ) from e
+        else:
+            logging.debug("skip_save is True, skipping saving VectorStore to disk.")
 
     def _save_metadata(self, path: str):
         """Saves metadata about the `VectorStore` to a JSON file.