Conversation
| if y is None: | ||
| pass | ||
| elif isinstance(y, pd.DataFrame): | ||
| elif isinstance(y, pd.DataFrame) or isinstance(y, cudf.DataFrame): |
There was a problem hiding this comment.
@dcolinmorgan or (cudf is not None and isinstance(y, cudf.DataFrame)
There was a problem hiding this comment.
maybe same problem elsewhere?
There was a problem hiding this comment.
If cudf is None, I think the current would throw an exn
|
@tanmoyio can we close, or this is still live and needs review? |
|
|
Awesome - is the plan to start landing, or more first? And would it make sense to start reviewing any PRs? If a sequence, can you stack them & point out so clear? |
|
landing would be wonderful -- before end of july is my dream DT4 is latest cu_cat PR branch which passes many pytests + works as expected in every demo ive done in last few months |
|
ok @tanmoyio can you help double check tests, take for a testdrive, and land first in cu_cat and then here? After, can you help add to main graphistry (https://github.com/graphistry/graphistry/blob/master/compose/dockerfiles/base/05-nvidia.Dockerfile) ? I think we should keep default-off for now, and should test that it's truly default off -- that existence doesn't (yet) trigger it to be used, only explicit use. |
|
test-full-ai test L395 seems to be getting hung up by 1 of 3 features being exactly reproduced
|
| # return False, object | ||
| def check_cudf(): | ||
| try: | ||
| import cudf |
There was a problem hiding this comment.
-
Who calls this on import?
-
And can this be a a) cached call that b) checks module path vs an import?
There was a problem hiding this comment.
its only test_embed_utils#L14 that calls check_cudf, swapped out for lazy_cudf_import from umap_utils
There was a problem hiding this comment.
oh, so that shouldn't be the issue, right? test/* shouldn't get imported by import graphistry..
There was a problem hiding this comment.
sorry, no i wasnt clear, test_embed_utils is only OTHER place lazy_cudf_import was present. It was used in embed_utils and imported cudf to check df dtype, but I have swapped it out in place of just checking via getmodule e.g. if 'cudf' in str(getmodule(self._nodes)): , so I believe the problem is solved -- tuna looks much better

|
| SuperVectorizer = Any | ||
| GapEncoder = Any | ||
| SimilarityEncoder = Any | ||
| # SimilarityEncoder = Any |
There was a problem hiding this comment.
remove all these dead lines
| X = np.round(X, decimals=keep_n_decimals) # type: ignore # noqa | ||
| X = pd.DataFrame(X, columns=columns, index=index) | ||
| else: | ||
| X = transformer.fit_transform(X.to_numpy()) |
There was a problem hiding this comment.
how do we know if the transformer is cpu vs gpu? it seems to always assume cpu here, but if X is cudf and transformer is gpu, can't we keep X on gpu?
There was a problem hiding this comment.
a sometimes-ok soln would be checking transformer for being from cuml or maybe cu_cat, but that seems non-generalizable
There was a problem hiding this comment.
wow this nearly gave me a heart attack -- good thoughts i will work with... but this is an artifact, no .to_numpy needed
|
|
||
|
|
||
| def make_safe_gpu_dataframes(X, y, engine): | ||
| has_cudf_dependancy_, _, cudf = lazy_import_has_dependancy_cu_cat() |
There was a problem hiding this comment.
Add assert cudf is not None ?
There was a problem hiding this comment.
also probably good to switch from lazy_import...cu_cat to a cudf one
There was a problem hiding this comment.
ok -- after the if statement seems best here again like other assert you mentioned
| if c in xc: | ||
| remove_cols.append(c) | ||
| elif isinstance(y, pd.Series): | ||
| elif isinstance(y, pd.Series) or isinstance(y, cudf.Series): |
There was a problem hiding this comment.
handle non-cu_cat import returning None for cudf
| X = transformer.fit_transform(X.to_numpy()) | ||
| if keep_n_decimals: | ||
| X = np.round(X, decimals=keep_n_decimals) # type: ignore # noqa | ||
| _, _, cudf = lazy_import_has_dependancy_cu_cat() |
There was a problem hiding this comment.
good practice to assert even after if statement check? good to know, yay learning
There was a problem hiding this comment.
It's useful when you don't want the 'if' but can imagine future changes or misuses accidentally getting the assumption wrong
umap match transpose index type-spec concat type-spec concat dc for comp_cluster dirty_cat as default, cc passes most tests ;) source cu_cat from pypi source cu_cat from pypi remove cc tests, tested for in dc place remove cc tests, tested for in dc place init 1dc > 2cc init 1dc > 2cc use constants throughout revert from constants revert from constants init 1dc > 2cc better dc default better dc default
bab7c02 to
cdda3e7
Compare
| ./bin/test-umap-learn-core.sh | ||
|
|
||
|
|
||
| test-gpu-umap: # well cpu until get a github actions gpu node |
There was a problem hiding this comment.
There was a problem hiding this comment.
maybe we should add some sort of test to confirm the gpu is enabled + gpu is used, e.g.,
nvidia-smi || exit 1
python3 -c "import cudf; cudf.DataFrame({'x': [0,1,2]})['x'].sum()"There was a problem hiding this comment.
if that is overreach for this pr, we should comment this out
|
this seems to have drifted a bit from main, see merge conflict i'm unsure about the cu_cat bits here, but if this pr replaces a bunch of |
Starter script