Coder Social home page Coder Social logo

Comments (16)

davek44 avatar davek44 commented on August 18, 2024

batcher.lua is my script. I realized I'm not giving proper advice for setting the LUA_PATH variable so basset_train.lua will find it. I fixed the README.

Try
export LUA_PATH="$BASSETDIR/src/?.lua;$LUA_PATH"

from basset.

vjcitn avatar vjcitn commented on August 18, 2024

I've set LUA_PATH as directed on README.md, value given below. Still have

%vjcair> basset_train.lua -job models/pretrained_params.txt -stagnant_t 10 encode_roadmap.h5
{
conv_filter_sizes :
{
1 : 19
2 : 11
3 : 7
}
weight_norm : 7
momentum : 0.98
learning_rate : 0.002
hidden_units :
{
1 : 1000
2 : 1000
}
conv_filters :
{
1 : 300
2 : 200
3 : 200
}
hidden_dropouts :
{
1 : 0.3
2 : 0.3
}
pool_width :
{
1 : 3
2 : 4
3 : 4
}
}
/Users/stvjc/torch/install/bin/luajit: /Users/stvjc/Research/BASSET/Basset/src/basset_train.lua:99: attempt to index global 'ConvNet' (a nil value)
stack traceback:
/Users/stvjc/Research/BASSET/Basset/src/basset_train.lua:99: in main chunk
[C]: in function 'dofile'
...tvjc/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x010c67cad0

/Users/stvjc/Research/BASSET/Basset/src/?.lua;/Users/stvjc/.luarocks/share/lua/5.1/?.lua;/Users/stvjc/.luarocks/share/lua/5.1/?/init.lua;/Users/stvjc/torch/install/share/lua/5.1/?.lua;/Users/stvjc/torch/install/share/lua/5.1/?/init.lua;./?.lua;/Users/stvjc/torch/install/share/luajit-2.1.0-beta1/?.lua;/usr/local/share/lua/5.1/?.lua;/usr/local/share/lua/5.1/?/init.lua

from basset.

davek44 avatar davek44 commented on August 18, 2024

What do you see when you start torch up and try to "require 'batcher'" or "require 'convnet'"?

from basset.

vjcitn avatar vjcitn commented on August 18, 2024

th> require 'convnet'

true

[0.5321s]

th> require 'batcher'

{

make_batch_iterators : function: 0x140ca6a0

make_chunk_iterator : function: 0x140d6e70

load_text : function: 0x140d46b8

split_indices : function: 0x140d78c0

make_chunk_iterators : function: 0x140d8390

stack : function: 0x140d8df0

make_batch_iterator : function: 0x140d9888

}

On Sat, May 28, 2016 at 8:21 PM, David Kelley [email protected]
wrote:

What do you see when you start torch up and try to "require 'batcher'" or
"require 'convnet'"?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwsctwPuzzpwFrE10XpUaOhbJY-C8ks5qGNv4gaJpZM4Ij9tH
.

from basset.

davek44 avatar davek44 commented on August 18, 2024

OK, after require 'convnet', try "convnet = ConvNet:__init()". That's the line that it's crashing on when you run the script, right?

from basset.

davek44 avatar davek44 commented on August 18, 2024

Also, that's not the output that I get from requiring batcher. Is there a chance that there's another script in your path called batcher?

from basset.

vjcitn avatar vjcitn commented on August 18, 2024

With a new checkout on an amazon EC2 instance of type g2.2xlarge,
basset_train.lua appears to be working on the encode_roadmap.h5, running as
indicated on the readme.

top shows the load on this machine to be 430. I have no experience with
gpus, there are apparently 1536 cores. I don't see any messages after
starting the basset_train.lua so I wonder if things are OK. What kind of
checkpointing is done, are there temporary files to check?

If this is working I will register the AMI and make it public so that
others can test and build off it.

On Wed, Jun 1, 2016 at 3:41 PM, Vikram Agarwal [email protected]
wrote:

I believe I'm also having a problem with dependencies. I've done all of
the appropriate exports including the new LUA_PATH you mentioned, and have
run "install_dependencies.py", then I tried "./basset_train.lua $FILE" on
my h5 file, and it ran this error:

Hyper-parameters unspecified. Applying a small model architecture
/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: attempt to
perform arithmetic on local 'storageOffset' (a nil value)
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: in function
'flatten'
/home/ubuntu/torch/install/share/lua/5.1/dpnn/Module.lua:198: in function
'getParameters'
./convnet.lua:249: in function 'build'
./basset_train.lua:111: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670

I tried "luarocks install nn" to get the latest version of the 'nn'
package, and it resulted in a different error, also associated with the
'nn' package:

/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/nn/test.lua:12: attempt to call
field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363: in function
'require'
./basset_train.lua:37: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670

Any idea how to fix this?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwtJbFl6OW4ESC7HmOEy73gYGxda_ks5qHeCDgaJpZM4Ij9tH
.

from basset.

vjcitn avatar vjcitn commented on August 18, 2024

the EC2 instance got completely overwhelmed and I stopped it. I ran
without the -cuda option and got to

ubuntu@ip-10-152-71-81:~/Basset/data$ basset_train.lua -job
pretrained_params.txt -stagnant_t 10 encode_roadmap.h5

{

conv_filter_sizes :

{

  1 : 19

  2 : 11

  3 : 7

}

weight_norm : 7

momentum : 0.98

learning_rate : 0.002

hidden_units :

{

  1 : 1000

  2 : 1000

}

conv_filters :

{

  1 : 300

  2 : 200

  3 : 200

}

hidden_dropouts :

{

  1 : 0.3

  2 : 0.3

}

pool_width :

{

  1 : 3

  2 : 4

  3 : 4

}

}

nn.Sequential {

[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) ->
(10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) ->
(19) -> (20) -> (21) -> (22) -> (23) -> output]

(1): nn.SpatialConvolution(4 -> 300, 19x1)

(2): nn.SpatialBatchNormalization

(3): nn.ReLU

(4): nn.SpatialMaxPooling(3x1, 3,1)

(5): nn.SpatialConvolution(300 -> 200, 11x1)

(6): nn.SpatialBatchNormalization

(7): nn.ReLU

(8): nn.SpatialMaxPooling(4x1, 4,1)

(9): nn.SpatialConvolution(200 -> 200, 7x1)

(10): nn.SpatialBatchNormalization

(11): nn.ReLU

(12): nn.SpatialMaxPooling(4x1, 4,1)

(13): nn.Reshape(2000)

(14): nn.Linear(2000 -> 1000)

(15): nn.BatchNormalization

(16): nn.ReLU

(17): nn.Dropout(0.300000)

(18): nn.Linear(1000 -> 1000)

(19): nn.BatchNormalization

(20): nn.ReLU

(21): nn.Dropout(0.300000)

(22): nn.Linear(1000 -> 164)

(23): nn.Sigmoid

}

OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please
rebuild the library with USE_OPENMP=1 option.

this last message happens a lot and it looks like we need to rebuild
openblas.

more later

On Wed, Jun 1, 2016 at 7:28 PM, Vincent Carey [email protected]
wrote:

With a new checkout on an amazon EC2 instance of type g2.2xlarge,
basset_train.lua appears to be working on the encode_roadmap.h5, running as
indicated on the readme.

top shows the load on this machine to be 430. I have no experience with
gpus, there are apparently 1536 cores. I don't see any messages after
starting the basset_train.lua so I wonder if things are OK. What kind of
checkpointing is done, are there temporary files to check?

If this is working I will register the AMI and make it public so that
others can test and build off it.

On Wed, Jun 1, 2016 at 3:41 PM, Vikram Agarwal [email protected]
wrote:

I believe I'm also having a problem with dependencies. I've done all of
the appropriate exports including the new LUA_PATH you mentioned, and have
run "install_dependencies.py", then I tried "./basset_train.lua $FILE" on
my h5 file, and it ran this error:

Hyper-parameters unspecified. Applying a small model architecture
/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: attempt to
perform arithmetic on local 'storageOffset' (a nil value)
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: in function
'flatten'
/home/ubuntu/torch/install/share/lua/5.1/dpnn/Module.lua:198: in function
'getParameters'
./convnet.lua:249: in function 'build'
./basset_train.lua:111: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670

I tried "luarocks install nn" to get the latest version of the 'nn'
package, and it resulted in a different error, also associated with the
'nn' package:

/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/nn/test.lua:12: attempt to call
field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363: in function
'require'
./basset_train.lua:37: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670

Any idea how to fix this?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwtJbFl6OW4ESC7HmOEy73gYGxda_ks5qHeCDgaJpZM4Ij9tH
.

from basset.

vjcitn avatar vjcitn commented on August 18, 2024

export OMP_NUM_THREADS=1 solves the OpenBLAS warning.

now just running with a 3 GB RAM image. how long will an epoch take?

On Wed, Jun 1, 2016 at 8:19 PM, Vincent Carey [email protected]
wrote:

the EC2 instance got completely overwhelmed and I stopped it. I ran
without the -cuda option and got to

ubuntu@ip-10-152-71-81:~/Basset/data$ basset_train.lua -job
pretrained_params.txt -stagnant_t 10 encode_roadmap.h5

{

conv_filter_sizes :

{

  1 : 19

  2 : 11

  3 : 7

}

weight_norm : 7

momentum : 0.98

learning_rate : 0.002

hidden_units :

{

  1 : 1000

  2 : 1000

}

conv_filters :

{

  1 : 300

  2 : 200

  3 : 200

}

hidden_dropouts :

{

  1 : 0.3

  2 : 0.3

}

pool_width :

{

  1 : 3

  2 : 4

  3 : 4

}

}

nn.Sequential {

[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) ->
(10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) ->
(19) -> (20) -> (21) -> (22) -> (23) -> output]

(1): nn.SpatialConvolution(4 -> 300, 19x1)

(2): nn.SpatialBatchNormalization

(3): nn.ReLU

(4): nn.SpatialMaxPooling(3x1, 3,1)

(5): nn.SpatialConvolution(300 -> 200, 11x1)

(6): nn.SpatialBatchNormalization

(7): nn.ReLU

(8): nn.SpatialMaxPooling(4x1, 4,1)

(9): nn.SpatialConvolution(200 -> 200, 7x1)

(10): nn.SpatialBatchNormalization

(11): nn.ReLU

(12): nn.SpatialMaxPooling(4x1, 4,1)

(13): nn.Reshape(2000)

(14): nn.Linear(2000 -> 1000)

(15): nn.BatchNormalization

(16): nn.ReLU

(17): nn.Dropout(0.300000)

(18): nn.Linear(1000 -> 1000)

(19): nn.BatchNormalization

(20): nn.ReLU

(21): nn.Dropout(0.300000)

(22): nn.Linear(1000 -> 164)

(23): nn.Sigmoid

}

OpenBLAS Warning : Detect OpenMP Loop and this application may hang.
Please rebuild the library with USE_OPENMP=1 option.

this last message happens a lot and it looks like we need to rebuild
openblas.

more later

On Wed, Jun 1, 2016 at 7:28 PM, Vincent Carey [email protected]
wrote:

With a new checkout on an amazon EC2 instance of type g2.2xlarge,
basset_train.lua appears to be working on the encode_roadmap.h5, running as
indicated on the readme.

top shows the load on this machine to be 430. I have no experience with
gpus, there are apparently 1536 cores. I don't see any messages after
starting the basset_train.lua so I wonder if things are OK. What kind of
checkpointing is done, are there temporary files to check?

If this is working I will register the AMI and make it public so that
others can test and build off it.

On Wed, Jun 1, 2016 at 3:41 PM, Vikram Agarwal [email protected]
wrote:

I believe I'm also having a problem with dependencies. I've done all of
the appropriate exports including the new LUA_PATH you mentioned, and have
run "install_dependencies.py", then I tried "./basset_train.lua $FILE" on
my h5 file, and it ran this error:

Hyper-parameters unspecified. Applying a small model architecture
/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: attempt to
perform arithmetic on local 'storageOffset' (a nil value)
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: in function
'flatten'
/home/ubuntu/torch/install/share/lua/5.1/dpnn/Module.lua:198: in
function 'getParameters'
./convnet.lua:249: in function 'build'
./basset_train.lua:111: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670

I tried "luarocks install nn" to get the latest version of the 'nn'
package, and it resulted in a different error, also associated with the
'nn' package:

/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/nn/test.lua:12: attempt to call
field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363: in function
'require'
./basset_train.lua:37: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670

Any idea how to fix this?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AEaOwtJbFl6OW4ESC7HmOEy73gYGxda_ks5qHeCDgaJpZM4Ij9tH
.

from basset.

vjcitn avatar vjcitn commented on August 18, 2024

i've noticed that when i use the -cuda option i do not get the diagnostic
trace reported two messages ago. would the g2.8xlarge instance type be
more effective with the example settings?

On Wed, Jun 1, 2016 at 8:32 PM, Vincent Carey [email protected]
wrote:

export OMP_NUM_THREADS=1 solves the OpenBLAS warning.

now just running with a 3 GB RAM image. how long will an epoch take?

On Wed, Jun 1, 2016 at 8:19 PM, Vincent Carey [email protected]
wrote:

the EC2 instance got completely overwhelmed and I stopped it. I ran
without the -cuda option and got to

ubuntu@ip-10-152-71-81:~/Basset/data$ basset_train.lua -job
pretrained_params.txt -stagnant_t 10 encode_roadmap.h5

{

conv_filter_sizes :

{

  1 : 19

  2 : 11

  3 : 7

}

weight_norm : 7

momentum : 0.98

learning_rate : 0.002

hidden_units :

{

  1 : 1000

  2 : 1000

}

conv_filters :

{

  1 : 300

  2 : 200

  3 : 200

}

hidden_dropouts :

{

  1 : 0.3

  2 : 0.3

}

pool_width :

{

  1 : 3

  2 : 4

  3 : 4

}

}

nn.Sequential {

[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9)
-> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) ->
(19) -> (20) -> (21) -> (22) -> (23) -> output]

(1): nn.SpatialConvolution(4 -> 300, 19x1)

(2): nn.SpatialBatchNormalization

(3): nn.ReLU

(4): nn.SpatialMaxPooling(3x1, 3,1)

(5): nn.SpatialConvolution(300 -> 200, 11x1)

(6): nn.SpatialBatchNormalization

(7): nn.ReLU

(8): nn.SpatialMaxPooling(4x1, 4,1)

(9): nn.SpatialConvolution(200 -> 200, 7x1)

(10): nn.SpatialBatchNormalization

(11): nn.ReLU

(12): nn.SpatialMaxPooling(4x1, 4,1)

(13): nn.Reshape(2000)

(14): nn.Linear(2000 -> 1000)

(15): nn.BatchNormalization

(16): nn.ReLU

(17): nn.Dropout(0.300000)

(18): nn.Linear(1000 -> 1000)

(19): nn.BatchNormalization

(20): nn.ReLU

(21): nn.Dropout(0.300000)

(22): nn.Linear(1000 -> 164)

(23): nn.Sigmoid

}

OpenBLAS Warning : Detect OpenMP Loop and this application may hang.
Please rebuild the library with USE_OPENMP=1 option.

this last message happens a lot and it looks like we need to rebuild
openblas.

more later

On Wed, Jun 1, 2016 at 7:28 PM, Vincent Carey <[email protected]

wrote:

With a new checkout on an amazon EC2 instance of type g2.2xlarge,
basset_train.lua appears to be working on the encode_roadmap.h5, running as
indicated on the readme.

top shows the load on this machine to be 430. I have no experience with
gpus, there are apparently 1536 cores. I don't see any messages after
starting the basset_train.lua so I wonder if things are OK. What kind of
checkpointing is done, are there temporary files to check?

If this is working I will register the AMI and make it public so that
others can test and build off it.

On Wed, Jun 1, 2016 at 3:41 PM, Vikram Agarwal <[email protected]

wrote:

I believe I'm also having a problem with dependencies. I've done all of
the appropriate exports including the new LUA_PATH you mentioned, and have
run "install_dependencies.py", then I tried "./basset_train.lua $FILE" on
my h5 file, and it ran this error:

Hyper-parameters unspecified. Applying a small model architecture
/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: attempt to
perform arithmetic on local 'storageOffset' (a nil value)
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: in function
'flatten'
/home/ubuntu/torch/install/share/lua/5.1/dpnn/Module.lua:198: in
function 'getParameters'
./convnet.lua:249: in function 'build'
./basset_train.lua:111: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in
main chunk
[C]: at 0x00406670

I tried "luarocks install nn" to get the latest version of the 'nn'
package, and it resulted in a different error, also associated with the
'nn' package:

/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/nn/test.lua:12: attempt to call
field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363: in
function 'require'
./basset_train.lua:37: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in
main chunk
[C]: at 0x00406670

Any idea how to fix this?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AEaOwtJbFl6OW4ESC7HmOEy73gYGxda_ks5qHeCDgaJpZM4Ij9tH
.

from basset.

vjcitn avatar vjcitn commented on August 18, 2024

I tried with the g2.8xlarge instance type. the program moved along but took up 85GB ram according to top, on a 60GB machine. so i stopped it.

from basset.

davek44 avatar davek44 commented on August 18, 2024

I'm not familiar with EC2, so I can't help much with details there.

You asked how long epochs typically take, and it varies pretty widely based on the model, input data, and computing device. A large model trained on the full data on the Tesla K20's that I've been using can take up to 6 hours.

With respect to the memory leak, I had a similar problem at one point. I started towards a fix from the following thread: torch/torch7#229

Basically, malloc holds any allocated memory within the program within some Linux OS's. You can write over that memory within the program, but it’s not released to the OS.

This different version of malloc solves the problem: http://www.canonware.com/jemalloc/index.html

Specifically, I installed jemalloc in and now run:
export LD_PRELOAD=/mypath/jemalloc-4.0.1/lib/libjemalloc.so
within my .bashrc script. Then Torch releases the memory, and everything works great.

from basset.

vjcitn avatar vjcitn commented on August 18, 2024

the large memory image no longer appears once libjemalloc is used.

however with the -cuda option i don't see the diagnostic print of the model
parameters

top shows a load of 515 ... nothing going on on main cpu. does that seem
right?

top - 01:34:54 up 26 min, 2 users, load average: 515.08, 482.10, 302.66

Tasks:* 1044 total, 3 running, 1041 sleeping, 0 stopped, 0 *
zombie

%Cpu(s):* 0.0 us, 0.1 sy, 0.0 ni, 93.7 id, 0.0 wa, 0.0 hi,
6.1 si, 0.0 *st

KiB Mem: * 61837044 total, 1023484 used, 60813560 free, 40480 *
buffers

KiB Swap:* 0 total, 0 used, 0 free. 349672 *cached
Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND

  • 345 root 20 0 0 0 0 R 100.0 0.0 14:01.17
    ksoftirqd/16
    *

  • 3 root      20   0       0      0      0 R  50.0  0.0   7:05.55
    

    ksoftirqd/0
    *

    4685 root 20 0 0 0 0 D 49.7 0.0 6:48.80
    kworker/0:7

  • 5245 ubuntu 20 0 24548 2588 1168 R 0.7 0.0 0:00.98 top

                                                       *

1772 root 20 0 19676 1248 552 S 0.3 0.0 0:00.41
irqbalance

1 root 20 0 33864 3216 1492 S 0.0 0.0 0:09.84 init

2 root 20 0 0 0 0 S 0.0 0.0 0:00.01
kthreadd

4 root 20 0 0 0 0 D 0.0 0.0 0:00.00
kworker/0:0

5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00
kworker/0:0H

6 root 20 0 0 0 0 S 0.0 0.0 0:00.00
kworker/u256:0

7 root 20 0 0 0 0 S 0.0 0.0 0:00.01
kworker/u257:0

8 root 20 0 0 0 0 S 0.0 0.0 0:00.38
rcu_sched

9 root 20 0 0 0 0 S 0.0 0.0 0:00.06
rcuos/0

10 root 20 0 0 0 0 S 0.0 0.0 0:00.02
rcuos/1

11 root 20 0 0 0 0 S 0.0 0.0 0:00.05
rcuos/2

12 root 20 0 0 0 0 S 0.0 0.0 0:00.02
rcuos/3

13 root 20 0 0 0 0 S 0.0 0.0 0:00.01
rcuos/4

14 root 20 0 0 0 0 S 0.0 0.0 0:00.01
rcuos/5

15 root 20 0 0 0 0 S 0.0 0.0 0:00.01
rcuos/6

16 root 20 0 0 0 0 S 0.0 0.0 0:00.00
rcuos/7

On Thu, Jun 2, 2016 at 10:58 PM, David Kelley [email protected]
wrote:

I'm not familiar with EC2, so I can't help much with details there.

You asked how long epochs typically take, and it varies pretty widely
based on the model, input data, and computing device. A large model trained
on the full data on the Tesla K20's that I've been using can take up to 6
hours.

With respect to the memory leak, I had a similar problem at one point. I
started towards a fix from the following thread: torch/torch7#229
torch/torch7#229

Basically, malloc holds any allocated memory within the program within
some Linux OS's. You can write over that memory within the program, but
it’s not released to the OS.

This different version of malloc solves the problem:
http://www.canonware.com/jemalloc/index.html

Specifically, I installed jemalloc in and now run:
export LD_PRELOAD=/mypath/jemalloc-4.0.1/lib/libjemalloc.so
within my .bashrc script. Then Torch releases the memory, and everything
works great.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwkBzJ8vO4MJ4vo693MsoWgx0hd9tks5qH5hEgaJpZM4Ij9tH
.

from basset.

davek44 avatar davek44 commented on August 18, 2024

It's hard for me to tell if it's working for you or not. If not, let me know if there was any output and perhaps I can advise.

from basset.

vjcitn avatar vjcitn commented on August 18, 2024

hi, thanks for following up. there was no output after about an hour with
top listing 500+ jobs and no indication of memory consumption. i killed
the job and have not revisited. i am still interested in getting it to run
on the example data, but perhaps i need to target some smaller task.

On Tue, Jun 7, 2016 at 9:21 PM, David Kelley [email protected]
wrote:

It's hard for me to tell if it's working for you or not. If not, let me
know if there was any output and perhaps I can advise.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwtdx-fS3FSWwWz02LpM8oRXoxd-Xks5qJhkrgaJpZM4Ij9tH
.

from basset.

yukatherin avatar yukatherin commented on August 18, 2024

Edit: Please disregard. My question was due to bash syntax errors. You just need to check echo $LUA_PATH contains your $BASSETDIR/src/?.lua

from basset.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.