Coder Social home page Coder Social logo

Fine-tuning? about ctrl HOT 12 CLOSED

saippuakauppias avatar saippuakauppias commented on August 22, 2024 3
Fine-tuning?

from ctrl.

Comments (12)

keskarnitish avatar keskarnitish commented on August 22, 2024 6

An update on this: I'm working on adding the TPU training code but in parallel am also looking into porting the inference/weights to PyTorch. Are folks here interested in fine-tuning on GPUs or TPUs?

from ctrl.

keskarnitish avatar keskarnitish commented on August 22, 2024 4

Thanks for the feedback everyone. I added PyTorch (w/ GPUs) support today, I'll be working on training code next. I'll keep you folks posted.

from ctrl.

minimaxir avatar minimaxir commented on August 22, 2024 3

If this model can barely fit into a GPU for normal generation, then finetuning on a GPU wouldn't remotely work.

You'd have to use a TPU, which isn't cheap. Might be interesting if there is Colaboratory compatibility.

from ctrl.

christophschuhmann avatar christophschuhmann commented on August 22, 2024 2

GPUs would be easier to get. How much GPU RAM is needed to tune the model?

from ctrl.

Stamenov avatar Stamenov commented on August 22, 2024 1

If smaller models (or code), like the 355M GPT-2 model, is released, fine-tuning can become possible.

from ctrl.

keskarnitish avatar keskarnitish commented on August 22, 2024 1

I've added training code in the training_utils folder with an example on GPUs. Training on TPUs requires very few changes and I've added commentary accordingly. There are some snags getting it working on multiple GPUs but I'll look into that later if there is sufficient interest.

Reopen or file another issue as necessary :)

from ctrl.

alexbnewhouse avatar alexbnewhouse commented on August 22, 2024

Very interested in this as well! Looks like they will release the training script soon—maybe then.

from ctrl.

hamletbatista avatar hamletbatista commented on August 22, 2024

An update on this: I'm working on adding the TPU training code but in parallel am also looking into porting the inference/weights to PyTorch. Are folks here interested in fine-tuning on GPUs or TPUs?

It'd be great to be able to fine-tune using the free TPU in Google Colab

from ctrl.

alexbnewhouse avatar alexbnewhouse commented on August 22, 2024

@keskarnitish I generally do fine-tuning with GPT-2 etc on GPUs since my funds are limited, but Google Colab TPU fine-tuning would be awesome

from ctrl.

hamletbatista avatar hamletbatista commented on August 22, 2024

@keskarnitish great progress! I think gsutil is installed by default in Colab.

Can't wait for the training code :)

from ctrl.

saippuakauppias avatar saippuakauppias commented on August 22, 2024

@keskarnitish multi-GPU is very interesting because it is always cheaper (especially with 1080Ti).

from ctrl.

GrahamboJangles avatar GrahamboJangles commented on August 22, 2024

@keskarnitish - I'm following the README for fine-tuning and ran into this error when running !python2 training.py --model_dir seqlen256_v1.ckpt/ --iterations 250

WARNING: Logging before flag parsing goes to stderr.
W1112 23:02:45.336682 140485145098112 deprecation_wrapper.py:119] From training.py:8: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.

W1112 23:02:45.472126 140485145098112 deprecation_wrapper.py:119] From training.py:33: The name tf.random.set_random_seed is deprecated. Please use tf.compat.v1.random.set_random_seed instead.

246534 unique words
2019-11-12 23:02:45.829719: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-11-12 23:02:45.854701: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:02:45.855657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
2019-11-12 23:02:45.860013: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-11-12 23:02:45.869006: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-11-12 23:02:45.874573: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-11-12 23:02:45.884391: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-11-12 23:02:45.895278: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-11-12 23:02:45.901163: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-11-12 23:02:45.917312: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-11-12 23:02:45.917486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:02:45.918313: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:02:45.919022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-11-12 23:02:45.919462: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-12 23:02:46.006427: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:02:46.007293: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a0d0d1ed80 executing computations on platform CUDA. Devices:
2019-11-12 23:02:46.007328: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-11-12 23:02:46.009946: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-11-12 23:02:46.010212: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a0d0d1f2c0 executing computations on platform Host. Devices:
2019-11-12 23:02:46.010246: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-11-12 23:02:46.010470: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:02:46.011178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
2019-11-12 23:02:46.011250: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-11-12 23:02:46.011285: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-11-12 23:02:46.011321: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-11-12 23:02:46.011355: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-11-12 23:02:46.011387: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-11-12 23:02:46.011425: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-11-12 23:02:46.011456: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-11-12 23:02:46.011555: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:02:46.012298: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:02:46.012991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-11-12 23:02:46.013057: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-11-12 23:02:46.014505: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-12 23:02:46.014536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-11-12 23:02:46.014552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-11-12 23:02:46.014690: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:02:46.015445: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:02:46.016134: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:40] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2019-11-12 23:02:46.016189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10805 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
W1112 23:03:07.695101 140485145098112 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W1112 23:03:07.695425 140485145098112 deprecation_wrapper.py:119] From training.py:136: The name tf.train.AdagradOptimizer is deprecated. Please use tf.compat.v1.train.AdagradOptimizer instead.

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 256)]        0                                            
__________________________________________________________________________________________________
tied_embedding_softmax (TiedEmb multiple             315810054   input_1[0][0]                    
                                                                 encoder[0][0]                    
__________________________________________________________________________________________________
encoder (Encoder)               (None, 256, 1280)    1322154496  tied_embedding_softmax[0][0]     
==================================================================================================
Total params: 1,637,964,550
Trainable params: 1,637,964,550
Non-trainable params: 0
__________________________________________________________________________________________________
None
W1112 23:03:07.870990 140485145098112 deprecation_wrapper.py:119] From training.py:158: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W1112 23:03:07.871377 140485145098112 deprecation_wrapper.py:119] From training.py:160: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W1112 23:03:07.871547 140485145098112 deprecation_wrapper.py:119] From training.py:160: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

I1112 23:03:07.871721 140485145098112 keras.py:424] Using the Keras model provided.
W1112 23:03:07.871866 140485145098112 keras.py:452] You are creating an Estimator from a Keras model manually subclassed from `Model`, that was already called on some inputs (and thus already had weights). We are currently unable to preserve the model's state (its weights) as part of the estimator in this case. Be warned that the estimator has been created using a freshly initialized version of your model.
Note that this doesn't affect the state of the model instance you passed as `keras_model` argument.
W1112 23:03:07.872073 140485145098112 estimator.py:1984] Estimator's model_fn (<function model_fn at 0x7fc486491230>) includes params argument, but params are not passed to Estimator.
I1112 23:03:07.882476 140485145098112 estimator.py:209] Using config: {'_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
log_device_placement: true
, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc486501950>, '_model_dir': 'seqlen256_v1.ckpt/', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=[[1, 1], [1, 1]], eval_training_input_configuration=2), '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_cluster': None, '_experimental_distribute': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': None, '_experimental_max_worker_delay_secs': None, '_evaluation_master': '', '_eval_distribute': None, '_global_id_in_cluster': 0, '_master': ''}
I1112 23:03:07.882745 140485145098112 tpu_context.py:209] _TPUContext: eval_on_tpu True
I1112 23:03:07.885145 140485145098112 tpu_system_metadata.py:78] Querying Tensorflow master () for TPU system metadata.
2019-11-12 23:03:07.894349: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:03:07.894909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
2019-11-12 23:03:07.895004: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-11-12 23:03:07.895042: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-11-12 23:03:07.895082: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-11-12 23:03:07.895114: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-11-12 23:03:07.895145: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-11-12 23:03:07.895181: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-11-12 23:03:07.895220: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-11-12 23:03:07.895330: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:03:07.895776: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:03:07.896153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-11-12 23:03:07.896213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-12 23:03:07.896235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-11-12 23:03:07.896250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-11-12 23:03:07.896374: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:03:07.896829: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-12 23:03:07.897206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10805 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
E1112 23:03:07.898518 140485145098112 error_handling.py:70] Error recorded from training_loop: Cannot find any TPU cores in the system (master address ). This usually means the master address is incorrect or the TPU worker has some problems. Available devices: [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 7255588207913172666), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 4055659955816854951), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 12314982218041109490), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 11330115994, 12717145914221894758)]
I1112 23:03:07.898689 140485145098112 error_handling.py:96] training_loop marked as finished
W1112 23:03:07.898884 140485145098112 error_handling.py:130] Reraising captured error
Traceback (most recent call last):
  File "training.py", line 164, in <module>
    estimator_model.train(input_fn=input_fn, steps=args.iterations)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 364, in train
    hooks.extend(self._convert_train_steps_to_hooks(steps, max_steps))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2746, in _convert_train_steps_to_hooks
    if ctx.is_running_on_cpu():
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 442, in is_running_on_cpu
    self._validate_tpu_configuration()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 604, in _validate_tpu_configuration
    num_cores = self.num_cores
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 349, in num_cores
    metadata = self._get_tpu_system_metadata()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 274, in _get_tpu_system_metadata
    query_topology=self.model_parallelism_enabled))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/tpu/tpu_system_metadata.py", line 128, in _query_tpu_system_metadata
    master_address, devices))
RuntimeError: Cannot find any TPU cores in the system (master address ). This usually means the master address is incorrect or the TPU worker has some problems. Available devices: [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 7255588207913172666), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 4055659955816854951), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 12314982218041109490), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 11330115994, 12717145914221894758)]

I get this error when running with a GPU or a TPU. Here's my Colab notebook, edited from the low-memory Colab notebook.

I'm not sure if it's because I have to patch estimator.patch with use_tpu=True but when I try to patch it I get an error:

patching file /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py
Hunk #1 succeeded at 48 (offset 1 line).
Hunk #2 FAILED at 228.
Hunk #3 FAILED at 239.
Hunk #4 FAILED at 448.
Hunk #5 FAILED at 462.
4 out of 5 hunks FAILED -- saving rejects to file /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py.rej

However, I checked my /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py and it has use_tpu set to True so it must have patched anyways.

from ctrl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.