Coder Social home page Coder Social logo

Comments (6)

TobyHFerguson avatar TobyHFerguson commented on August 17, 2024

this might need a systemctl restart rpcbind ... it didn't seem like the start worked :-(

from cdsw_install.

TobyHFerguson avatar TobyHFerguson commented on August 17, 2024

Fixed in 3be91e2

from cdsw_install.

TobyHFerguson avatar TobyHFerguson commented on August 17, 2024

Still seeing this issue:

[root@ip-172-16-21-234 ~]# journalctl -xn | grep failed
[root@ip-172-16-21-234 ~]# systemctl status rpcbind
● rpcbind.service - RPC bind service
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
   Active: active (running) since Wed 2017-05-24 20:03:20 EDT; 6min ago
 Main PID: 7909 (rpcbind)
   CGroup: /system.slice/rpcbind.service
           └─7909 /sbin/rpcbind -w

May 24 20:03:20 ip-172-16-21-234.us-east-2.compute.internal systemd[1]: Starting RPC bind service...
May 24 20:03:20 ip-172-16-21-234.us-east-2.compute.internal systemd[1]: Started RPC bind service.
May 24 20:03:20 ip-172-16-21-234.us-east-2.compute.internal systemd[1]: Dependency failed for RPC bind service.
May 24 20:03:20 ip-172-16-21-234.us-east-2.compute.internal systemd[1]: Job rpcbind.service/start failed with result 'dependency'.

even after restarting rpcbind shortly before doing the cdsw init. Looks like more needs to be done. This is intermittent :-(

from cdsw_install.

TobyHFerguson avatar TobyHFerguson commented on August 17, 2024

I create a new cluster and cdsw was not running.

rpcbind reported itself as running:

[root@ip-10-0-0-66 ~]# systemctl status rpcbind
● rpcbind.service - RPC bind service
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
   Active: active (running) since Thu 2017-05-25 15:54:16 EDT; 19min ago
 Main PID: 7698 (rpcbind)
   CGroup: /system.slice/rpcbind.service
           └─7698 /sbin/rpcbind -w

May 25 15:54:16 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Starting RPC bind service...
May 25 15:54:16 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Started RPC bind service.

cdsw reported itself as not running:

[root@ip-10-0-0-66 ~]# cdsw status
Cloudera Data Science Workbench Status

Service Status
docker: unknown
kubelet: unknown
nfs: inactive
Checking kernel parameters...

Node Status
Cloudera Data Science Workbench is not ready yet: kubectl command failed

When I tried to start cdsw I saw this:

[root@ip-10-0-0-66 ~]# cdsw init
Using user-specified config file: /etc/cdsw/config/cdsw.conf
Prechecking OS Version........[OK]
Prechecking scaling limits for processes........[OK]
Prechecking scaling limits for open files........
WARNING: Cloudera Data Science Workbench recommends that all users have a max-open-files limit set to 1048576.
It is currently set to [32768] as per 'ulimit -n'
Press enter to continue

Prechecking that iptables are not configured........[OK]
Prechecking that SELinux is disabled........[OK]
Prechecking configured block devices and mountpoints........[OK]
Prechecking kernel parameters........[OK]
Prechecking that docker block devices are of adequate size........[OK]
Prechecking that application block devices are of adequate size........[OK]
Prechecking size of root volume........[OK]
Prechecking that CDH gateway roles are configured........[OK]
Prechecking that /etc/krb5 file is not a placeholder........[OK]
Prechecking parcel paths........[OK]
Prechecking CDH client configurations........[OK]
Prechecking Java version........[OK]
Prechecking Java distribution........[OK]
Creating docker thinpool if it does not exist
  Volume group "docker" not found
  Cannot process volume group docker
Unmounting /dev/xvdg
umount: /dev/xvdg: not mounted
Removing Docker volume groups.
  Volume group "docker" not found
  Cannot process volume group docker
  Volume group "docker" not found
  Cannot process volume group docker
Cleaning up docker directories...
  Wiping ext4 signature on /dev/xvdg.
  Physical volume "/dev/xvdg" successfully created.
  Volume group "docker" successfully created
  Logical volume "thinpool" created.
  Logical volume "thinpoolmeta" created.
  WARNING: Converting logical volume docker/thinpool and docker/thinpoolmeta to thin pool's data and metadata volumes with metadata wiping.
  THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
  Converted docker/thinpool to thin pool.
  Logical volume docker/thinpool changed.
Initialize application storage at /var/lib/cdsw
Disabling node with IP [10.0.0.66]...
Node [10.0.0.66] removed from nfs export list successfully.
Stopping rpc-statd...
Stopping nfs-idmapd...
Stopping rpcbind...
Stopping nfs-server...
Removing entry from /etc/fstab...
Skipping format since volumes are already set correctly.
Adding entry to /etc/fstab...
Mounting [/var/lib/cdsw]...
Starting rpc-statd...
Enabling rpc-statd...
Starting nfs-idmapd...
Enabling nfs-idmapd...
Starting rpcbind...

ERROR:: Could not start rpcbind: 1

ERROR:: Unable to create application device: 1

And then rpcbind reported:

[root@ip-10-0-0-66 ~]# systemctl status rpcbind
● rpcbind.service - RPC bind service
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
   Active: active (running) since Thu 2017-05-25 16:14:13 EDT; 13s ago
 Main PID: 10182 (rpcbind)
   CGroup: /system.slice/rpcbind.service
           └─10182 /sbin/rpcbind -w

May 25 16:14:13 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Starting RPC bind service...
May 25 16:14:13 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Started RPC bind service.
May 25 16:14:13 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Dependency failed for RPC bind service.
May 25 16:14:13 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Job rpcbind.service/start failed with result 'dependency'.

I looked into this further:

[root@ip-10-0-0-66 ~]# journalctl -xn
-- Logs begin at Thu 2017-05-25 15:35:45 EDT, end at Thu 2017-05-25 16:14:14 EDT. --
May 25 16:14:14 ip-10-0-0-66.us-west-2.compute.internal rpc.mountd[10249]: Version 1.3.0 starting
May 25 16:14:14 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Started NFS Mount Daemon.
-- Subject: Unit nfs-mountd.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit nfs-mountd.service has finished starting up.
--
-- The start-up result is done.
May 25 16:14:14 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Starting NFS server and services...
-- Subject: Unit nfs-server.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit nfs-server.service has begun starting up.
May 25 16:14:14 ip-10-0-0-66.us-west-2.compute.internal nfsdcltrack[10259]: Unable to prepare select statement: no such table: para
May 25 16:14:14 ip-10-0-0-66.us-west-2.compute.internal nfsdcltrack[10259]: Unable to prepare select statement: no such table: para
May 25 16:14:14 ip-10-0-0-66.us-west-2.compute.internal kernel: NFSD: starting 90-second grace period (net ffffffff81a25e00)
May 25 16:14:14 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Started NFS server and services.
-- Subject: Unit nfs-server.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit nfs-server.service has finished starting up.
--
-- The start-up result is done.
May 25 16:14:14 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Starting Notify NFS peers of a restart...
-- Subject: Unit rpc-statd-notify.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit rpc-statd-notify.service has begun starting up.
May 25 16:14:14 ip-10-0-0-66.us-west-2.compute.internal sm-notify[10271]: Version 1.3.0 starting
May 25 16:14:14 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Started Notify NFS peers of a restart.
-- Subject: Unit rpc-statd-notify.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit rpc-statd-notify.service has finished starting up.
--
-- The start-up result is done.

and then tried to perform the cdsw init again:

[root@ip-10-0-0-66 ~]# cdsw init
Using user-specified config file: /etc/cdsw/config/cdsw.conf
Prechecking OS Version........[OK]
Prechecking scaling limits for processes........[OK]
Prechecking scaling limits for open files........
WARNING: Cloudera Data Science Workbench recommends that all users have a max-open-files limit set to 1048576.
It is currently set to [32768] as per 'ulimit -n'
Press enter to continue

Prechecking that iptables are not configured........[OK]
Prechecking that SELinux is disabled........[OK]
Prechecking configured block devices and mountpoints........[OK]
Prechecking kernel parameters........[OK]
Prechecking that docker block devices are of adequate size........[OK]
Prechecking that application block devices are of adequate size........[OK]
Prechecking size of root volume........[OK]
Prechecking that CDH gateway roles are configured........[OK]
Prechecking that /etc/krb5 file is not a placeholder........[OK]
Prechecking parcel paths........[OK]
Prechecking CDH client configurations........[OK]
Prechecking Java version........[OK]
Prechecking Java distribution........[OK]
State Transition [init-started => init-started] not allowed.

ERROR:: Please run 'cdsw reset' before running this command.: 1

ERROR:: Unable to set state to [init-started]: 1

Given the error messages I tried cdsw reset:

[root@ip-10-0-0-66 ~]# cdsw reset
Resetting state of Cloudera Data Science Workbench...
Stopping Cloudera Data Science Workbench Master Node...
Stopping Cloudera Data Science Workbench App ...
The connection to the server localhost:8080 was refused - did you specify the right host or port?
Failed to delete kubernetes cluster.
The connection to the server localhost:8080 was refused - did you specify the right host or port?
Failed to delete dangling kubernetes pods.
Cloudera Data Science Workbench App stopped.
Stopping kubelet.
Stopping weave.
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.
Stopping all docker containers.
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Stopping docker.
Shutting down NFS components.
Disabling docker, kubelet to prevent restart on reboot.
Reloading systemd configuration.
Cleaning up networking configuration.
Stopping interface docker0...
Deleting interface docker0...
Stopping interface cbr0...
Deleting interface cbr0...
Stopping interface cni0...
Deleting interface cni0...
Stopping interface flannel0...
Deleting interface flannel0...
Stopping interface flannel.1...
Deleting interface flannel.1...
Stopping interface weave...
Deleting interface weave...
Cloudera Data Science Workbench services stopped.
Cloudera Data Science Workbench stopped.
Unmounting mountpoints created by Kubernetes...
Resetting Master Node...
Running pre-flight checks
Stopping the kubelet service...
Unmounting directories in /var/lib/kubelet...
Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf]
Deleting contents of stateful directories: [/var/lib/kubelet /var/lib/etcd]
docker doesn't seem to be running, skipping the removal of kubernetes containers
Removing role of node as [master]
Deleting directories related to Kubernetes...
[root@ip-10-0-0-66 ~]# cdsw init
Using user-specified config file: /etc/cdsw/config/cdsw.conf
Prechecking OS Version........[OK]
Prechecking scaling limits for processes........[OK]
Prechecking scaling limits for open files........
WARNING: Cloudera Data Science Workbench recommends that all users have a max-open-files limit set to 1048576.
It is currently set to [32768] as per 'ulimit -n'
Press enter to continue

Prechecking that iptables are not configured........[OK]
Prechecking that SELinux is disabled........[OK]
Prechecking configured block devices and mountpoints........[OK]
Prechecking kernel parameters........[OK]
Prechecking that docker block devices are of adequate size........[OK]
Prechecking that application block devices are of adequate size........[OK]
Prechecking size of root volume........[OK]
Prechecking that CDH gateway roles are configured........[OK]
Prechecking that /etc/krb5 file is not a placeholder........[OK]
Prechecking parcel paths........[OK]
Prechecking CDH client configurations........[OK]
Prechecking Java version........[OK]
Prechecking Java distribution........[OK]
Creating docker thinpool if it does not exist
  --- Logical volume ---
  LV Name                thinpool
  VG Name                docker
  LV UUID                iGqrlv-kwvC-lgBg-V6iK-QYQz-bIS4-Bwbor4
  LV Write Access        read/write
  LV Creation host, time ip-10-0-0-66.us-west-2.compute.internal, 2017-05-25 16:14:13 -0400
  LV Pool metadata       thinpool_tmeta
  LV Pool data           thinpool_tdata
  LV Status              available
  # open                 0
  LV Size                475.00 GiB
  Allocated pool data    0.00%
  Allocated metadata     0.01%
  Current LE             121599
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2

Docker thinpool already configured.
Initialize application storage at /var/lib/cdsw
Disabling node with IP [10.0.0.66]...
Node [10.0.0.66] removed from nfs export list successfully.
Stopping rpc-statd...
Stopping nfs-idmapd...
Stopping rpcbind...
Stopping nfs-server...
Removing entry from /etc/fstab...
Unmounting [/dev/xvdf]...
Skipping format since volumes are already set correctly.
Adding entry to /etc/fstab...
Mounting [/var/lib/cdsw]...
Starting rpc-statd...

ERROR:: Could not start rpc-statd: 1

ERROR:: Unable to create application device: 1

Clearly something is up - so I looked at the journal:

[root@ip-10-0-0-66 ~]# journalctl -xn
-- Logs begin at Thu 2017-05-25 15:35:45 EDT, end at Thu 2017-05-25 16:15:32 EDT. --
May 25 16:15:32 ip-10-0-0-66.us-west-2.compute.internal rpc.statd[11868]: Failed to register (statd, 1, udp): svc_reg() err: RPC: Remote system error - Connection refused
May 25 16:15:32 ip-10-0-0-66.us-west-2.compute.internal rpc.statd[11868]: Failed to register (statd, 1, tcp): svc_reg() err: RPC: Remote system error - Connection refused
May 25 16:15:32 ip-10-0-0-66.us-west-2.compute.internal rpc.statd[11868]: Failed to register (statd, 1, udp6): svc_reg() err: RPC: Remote system error - Connection refused
May 25 16:15:32 ip-10-0-0-66.us-west-2.compute.internal rpc.statd[11868]: Failed to register (statd, 1, tcp6): svc_reg() err: RPC: Remote system error - Connection refused
May 25 16:15:32 ip-10-0-0-66.us-west-2.compute.internal rpc.statd[11868]: failed to create RPC listeners, exiting
May 25 16:15:32 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: rpc-statd.service: control process exited, code=exited status=1
May 25 16:15:32 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Failed to start NFS status monitor for NFSv2/3 locking..
-- Subject: Unit rpc-statd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit rpc-statd.service has failed.
--
-- The result is failed.

I tried looking at the status and then restarting the rpcbind service:

[root@ip-10-0-0-66 ~]# systemctl status rpcbind
● rpcbind.service - RPC bind service
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
   Active: inactive (dead) since Thu 2017-05-25 16:15:27 EDT; 11min ago
 Main PID: 10182 (code=exited, status=0/SUCCESS)

May 25 16:14:13 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Starting RPC bind service...
May 25 16:14:13 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Started RPC bind service.
May 25 16:14:13 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Dependency failed for RPC bind service.
May 25 16:14:13 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Job rpcbind.service/start failed with result 'dependency'.
May 25 16:15:27 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Stopping RPC bind service...
May 25 16:15:27 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Stopped RPC bind service.
[root@ip-10-0-0-66 ~]# systemctl restart rpcbind
[root@ip-10-0-0-66 ~]# systemctl status rpcbind
● rpcbind.service - RPC bind service
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
   Active: active (running) since Thu 2017-05-25 16:27:31 EDT; 4s ago
  Process: 12846 ExecStart=/sbin/rpcbind -w $RPCBIND_ARGS (code=exited, status=0/SUCCESS)
 Main PID: 12847 (rpcbind)
   Memory: 776.0K
   CGroup: /system.slice/rpcbind.service
           └─12847 /sbin/rpcbind -w

May 25 16:27:31 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Starting RPC bind service...
May 25 16:27:31 ip-10-0-0-66.us-west-2.compute.internal systemd[1]: Started RPC bind service.

Then the cdsw reset && cdsw init cycle, which succeeded.

Perhaps the trick is to do the following: systemctl restart rpcbind && cdsw reset && echo | cdsw init?

from cdsw_install.

TobyHFerguson avatar TobyHFerguson commented on August 17, 2024

I'll try something like:

for i in {1..10}
do
  if cdsw reset && cdsw init
  then
    break
  else
    systemctl restart rpcbind
    sleep 1
  fi
done

from cdsw_install.

TobyHFerguson avatar TobyHFerguson commented on August 17, 2024

Fixed in d97206b

from cdsw_install.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.