Skip to content
This repository was archived by the owner on Aug 28, 2021. It is now read-only.

How much memory do I need to train #37

@HongweiQin

Description

@HongweiQin

Hi,

First of all, thanks to your nice work.

I was trying to run your go engine on my server, which has about 120 GiB memory.

It went all right until I tried to train with provided dataset.

The output are as followed:

[root@localhost darkforestGo]# ./train.sh
{
  nstep = 3,
  optim = "supervised",
  loss = "policy",
  progress = false,
  nthread = 4,
  model_name = "model-12-parallel-384-n-output-bn",
  data_augmentation = true,
  actor = "policy",
  nGPU = 1,
  sampling = "replay",
  intermediate_step = 50,
  userank = true,
  alpha = 0.05,
  num_forward_models = 2048,
  batchsize = 256,
  epoch_size_test = 128000,
  feature_type = "extended",
  epoch_size = 128000,
  datasource = "kgs"
}	
fm_init: function: 0x4076e7c8	
fm_gen: function: 0x410f4a58	
fm_postprocess: nil	
rl.Dataset.__init(): forward_model_init is set, run it
rl.Dataset.__init(): forward_model_init is set, run it
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-4547/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 1 module of nn.Sequential:
In 9 module of nn.Sequential:
/root/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-4547/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
	[C]: in function 'v'
	/root/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'BatchNormalization_updateOutput'
	/root/torch/install/share/lua/5.1/nn/BatchNormalization.lua:124: in function </root/torch/install/share/lua/5.1/nn/BatchNormalization.lua:113>
	[C]: in function 'xpcall'
	/root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41>
	[C]: in function 'xpcall'
	/root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
	./train/rl_framework/infra/bundle.lua:161: in function 'forward'
	./train/rl_framework/infra/agent.lua:46: in function 'optimize'
	./train/rl_framework/infra/engine.lua:114: in function 'train'
	./train/rl_framework/infra/framework.lua:304: in function 'run_rl'
	train.lua:155: in main chunk
	[C]: in function 'dofile'
	/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x004064f0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
	[C]: in function 'error'
	/root/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
	/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
	./train/rl_framework/infra/bundle.lua:161: in function 'forward'
	./train/rl_framework/infra/agent.lua:46: in function 'optimize'
	./train/rl_framework/infra/engine.lua:114: in function 'train'
	./train/rl_framework/infra/framework.lua:304: in function 'run_rl'
	train.lua:155: in main chunk
	[C]: in function 'dofile'
	/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x004064f0

I ran the "free" command before training. It turns out like this:

[root@localhost darkforestGo]# free
              total        used        free      shared  buff/cache   available
Mem:      115383448     1317128   112506528       10744     1559792   113786336
Swap:      67108860           0    67108860

It seems that I'm facing an "out of memory" issue.

May I ask how much memory do I need to train?

Or, is there anything wrong elsewere?

Thanks in advance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions