Process "Killed" (OOM) due to RAM limits with custom dataset – How to implement lazy loading? #931
tllmmaster
started this conversation in
General
Replies: 1 comment
-
|
I think the first step here would be to find out what exact step is causing the issue. 1.5 GB still sounds small, and I don't think this is necessarily causing the issue. I recommend trying to run the code in chapter 2, i.e., just loading the data and using dataloader, to see if the dataset (not the LLM training) is causing the crashing, and also to find out how much memory this step is using. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to train a GPT model from scratch using a custom dataset (approx. 100MB - 1.5GB raw text in Turkmen language). My machine has 16GB of RAM.
However, when I try to run the training script, the process gets "Killed" by the OS, presumably due to an Out of Memory (OOM) issue. It seems that loading and tokenizing the entire text file into memory at once (as done in the
GPTDatasetV1class) is consuming all available RAM.Could you please advise on how to modify the
Datasetclass or the data loading pipeline to handle larger datasets efficiently?Specifically, I am looking for a way to:
Any code snippets or guidance on implementing a memory-efficient
Datasetclass for this project would be greatly appreciated.Environment:
OS: Ubuntu
RAM: 16GB
Python version: 3.10 (mysal üçin)
Dataset size: >100MB
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions