When you are running your training script, you may get an Out Of Memory error. This can also happen with Jupyter notebook inside your Workspace or with CLI Jobs. You will notice that your Jupyter kernel dies and it doesn't return any useful information for debugging your code. In most cases this is caused by an Out Of Memory Error. So why does this happen?
Every instance on FloydHub comes with a fixed amount of memory (RAM and GPU dedicated memory). You can learn about the specs of our instances here. When your training code uses all of this memory you will get the Out Of Memory error.
To fix this you can reduce the batch size in your training code - to reduce the amount of memory used. If you still see this problem it maybe be that the model cannot fit in the instance memory. You can then try switching to a different instance that has more memory. Or you can also consider to move some operations from GPU (RAM) to CPU (dedicated memory) or viceversa. Be careful with this last advice because it can introduce bottlenecks (degrade the performance during training/testing)!
You can also consider to convert your notebook into a script and run it as a CLI Job or from your Workspace's Terminal, in this way you will be able to catch the proper error.
Is there any equation which suggests the relation between the vectors size and memory required?
Yes, assume X as a tensor (a, b, c, ... z) where a-z maps for the dimensions, and each of these has the same data type, you can compute the amount of memory in this way: a * b * ... * z * dtype_in_bit ( / 8 if you want it expressed in Byte).
e.g. tensor = (9600,9600,9600), dtype=32 bit (float representation)
With the above equation you will get 9600^3 *4 bytes=3.5TB.
How can I monitor the resource usage more accurately?
If you are running inside a Workspace and want to see how much memory you are using now, we recommend running these commands from a terminal:
watch -n1 nvidia-smi (or
nvidia-smi -l 2). This is a just a high-level monitoring solution, if you want to accurately track the resource usage, you can profile the computation (we are investigating different integration).
Note: It could happens that a memory leakage was introduced or fixed in a certain version of a package you are using for your computation, consider to use the latest stable release of it (for more, see installing extra-dependency).