This question has partially been asked here and here with no follow-ups, so maybe this is not the venue to ask this question, but I've figured out a little more information that I'm hoping might get an answer to these questions.
I've been attempting to train object_detection on my own library of roughly 1k photos. I've been using the provided pipeline config file "ssd_inception_v2_pets.config". And I've set up the training data properly, I believe. The program appears to start training just fine. When it couldn't read the data, it alerted with an error, and I fixed that.
My train_config settings are as follows, though I've changed a few of the numbers in order to try and get it to run with fewer resources.
train_config: {
batch_size: 1000 #also tried 1, 10, and 100
optimizer {
rms_prop_optimizer: {
learning_rate: {
exponential_decay_learning_rate {
initial_learning_rate: 0.04 # also tried .004
decay_steps: 800 # also tried 800720. 80072
decay_factor: 0.95
}
}
momentum_optimizer_value: 0.9
decay: 0.9
epsilon: 1.0
}
}
fine_tune_checkpoint: "~/Downloads/ssd_inception_v2_coco_11_06_2017/model.ckpt" #using inception checkpoint
from_detection_checkpoint: true
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
ssd_random_crop {
}
}
}
Basically, what I think is happening is that the computer is getting resource starved very quickly, and I'm wondering if anyone has an optimization that takes more time to build, but uses fewer resources?
OR am I wrong about why the process is getting killed, and is there a way for me to get more information about that from the kernel?
This is the Dmesg information that I get after the process is killed.
[711708.975215] Out of memory: Kill process 22087 (python) score 517 or sacrifice child
[711708.975221] Killed process 22087 (python) total-vm:9086536kB, anon-rss:6114136kB, file-rss:24kB, shmem-rss:0kB
See Question&Answers more detail:os