Apuntes de Solaris

In this article , we describe with my collegue Nicolas Morono, a bug with ldmd daemon and how to restore the previous configuration of the Logical Domains ( LDOMs ) using ldm-db.xml file

When we wanted assign a lun to a LDOM, we find with this trouble :

# ldm list

Failed to connect to logical domain manager: Connection refused

We check and the service ldmd is in maintenance state

# svcs -xv

svc:/ldoms/ldmd:default (Logical Domains Manager)

State: maintenance since June 2, 2016 06:36:16 PM ART

Reason: Start method exited with $SMF_EXIT_ERR_FATAL.

See: http://support.oracle.com/msg/SMF-8000-KS

See: /var/svc/log/ldoms-ldmd:default.log

Impact: This service is not running.

In the /var/adm/messages it showed this errors

Jun 2 18:36:16 m5-1-pdom2 svc.startd[33]: [ID 652011 daemon.warning] svc:/ldoms/ldmd:default: Method "/opt/SUNWldm/bin/ldmd_start" failed with exit status 95.

Jun 2 18:36:16 m5-1-pdom2 svc.startd[33]: [ID 748625 daemon.error] ldoms/ldmd:default failed fatally: transitioned to maintenance (see 'svcs -xv' for details)

Jun 2 18:36:16 m5-1-pdom2 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SMF-8000-YX, TYPE: defect, VER: 1, SEVERITY: major

Jun 2 18:36:16 m5-1-pdom2 EVENT-TIME: Thu Jun 2 18:36:16 ART 2016

Jun 2 18:36:16 m5-1-pdom2 PLATFORM: SPARC-M5-32, CSN: AK00xx8x1, HOSTNAME: m5-1-pdom2

Jun 2 18:36:16 m5-1-pdom2 SOURCE: software-diagnosis, REV: 0.1

Jun 2 18:36:16 m5-1-pdom2 EVENT-ID: 889f64a0-0102-efd6-997f-8e83e7fba09a

Jun 2 18:36:16 m5-1-pdom2 DESC: A service failed - a start, stop or refresh method failed.

Jun 2 18:36:16 m5-1-pdom2 AUTO-RESPONSE: The service has been placed into the maintenance state.

Jun 2 18:36:16 m5-1-pdom2 IMPACT: svc:/ldoms/ldmd:default is unavailable.

Jun 2 18:36:16 m5-1-pdom2 REC-ACTION: Run 'svcs -xv svc:/ldoms/ldmd:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted. Please refer to the associated reference document athttp://support.oracle.com/msg/SMF-8000-YX for the latest service procedures and policies regarding this diagnosis.

Jun 2 18:40:28 m5-1-pdom2 cmlb: [ID 107833 1

We check in the svc logs

# cat /var/svc/log/ldoms-ldmd:default.log

Jun 02 18:35:16 timeout waiting for op HVctl_op_get_bulk_res_stat

Jun 02 18:35:16 fatal error: waiting for hv response timeout

[ Jun 2 18:35:16 Stopping because process dumped core. ]

[ Jun 2 18:35:16 Executing stop method (:kill). ]

[ Jun 2 18:35:16 Executing start method ("/opt/SUNWldm/bin/ldmd_start"). ]

Jun 02 18:36:16 timeout waiting for op HVctl_op_hello

Jun 02 18:36:16 fatal error: waiting for hv response timeout

[ Jun 2 18:36:16 Method "start" exited with status 95. ]

We looked at the oracle docs and came to the conclusion that there was a bug in firmware versions below 1. 14.2 which matched our environment.

We opened a service request to confirm the analyzed by us and the proposed solution was the same.

The bug is in Hypervisors lower than the version 1.14.2 .

- The short term solution is to perform a power-cycle the system.

- The solution to medium / long term is to update the system firmware to a recent version ( HypV 1.14.2 or Higher )

At this point we find that solutions involve a power cycle that involves all running LDOMS and total reboot of the machine.

We decided to perform the firmware upgrade and make the power-cycle, but we realized that the last saved settings LDOMS is old and we will lose 6 months changes in LDOMs configurations ( like creation of new LDOMs , disk assignments, allocation of network cards, etc )

The solution applied to solved this situation was as follow :

Prior to reboot the PDOM, we backup the file ldom-db.xml located in /var/opt/SUNWldm , ( this file make the Magic ) this file has all the settings that are active in PDOM regardless of whether or not you saved in the SP .

We copy this file ( ldom-db.xml ) in /usr/scripts , to use after easily without a restore from the backup

Here are the steps used

From the ilom

We make the power-cycle

stop Servers/PDomains/PDomain_2/HOST

y then

start Servers/PDomains/PDomain_2/HOST

Once we Boot the PDOM and with the LDOMs down and unbind, we take a backup of the file ldom-db.xml and disable the ldom service daemon.

root@ # ldm ls

NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME

primary active -n-cv- UART 8 16G 0.2% 0.2% 8d 2h 38m

dnet1002 active -n---- 5002 8 8G 0.5% 0.5% 5d 2h 49m

dsunt100 active -n---- 5000 48 40G 0.0% 0.0% 8d 1h 34m

dsunt200 active -n---- 5001 48 40G 0.0% 0.0% 2m

root@#

root@ # ldm stop dsunt200

LDom dsunt200 stopped

root@ # ldm unbind dsunt200

root@ # ldm stop dsunt100

LDom dsunt100 stopped

root@ # ldm unbind dsunt100

root@ # ldm stop dnet1002

LDom dnet1002 stopped

root@ # ldm unbind dnet1002

root@ # ldm ls

NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME

primary active -n-cv- UART 8 16G 0.5% 0.5% 8d 2h 40m

dnet1002 inactive ------ 8 8G

dsunt100 inactive ------ 48 40G

dsunt200 inactive ------ 48 40G

root@ #

cd /var/opt/SUNWldm

cp -p ldom-db.xml ldom-db.xml.orig

svcadm disable ldmd

##### Here we use the file stored previoulsy in /usr/scripts/, Now we overwrite the original stored in /var/opt/SUNWldm

cp -p /usr/scripts/ldom-db.xml /var/opt/SUNWldm/ldom-db.xml

# Enable the ldmd service.

svcadm enable ldmd

### We check the configuration to see if everythings is OK, bind and start of ldoms .

Then, we make an init 6 and after that .. bind and start to all ldoms like we show you next

root@ # ldm bind dsunt200

root@ # ldm start dsunt200

LDom dsunt200 started

root@ # ldm bind dsunt100

root@ # ldm start dsunt100

LDom dsunt100 started

root@ # ldm bind dnet1002

root@ # ldm start dnet1002

LDom dnet1002 started

root@ # ldm ls

NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME

primary active -n-cv- UART 8 16G 3.7% 3.7% 8d 2h 55m

dnet1002 active -n---- 5002 8 8G 0.7% 0.7% 3s

dsunt100 active -n---- 5000 48 40G 0.0% 0.0% 2s

dsunt200 active -n---- 5001 48 40G 9.1% 1.0% 2s

root@ #

PS : Please forgive my english ;-)

Al intentar crear una instancia de bd Oracle, falla con el siguiente error

ORA-27125: unable to create shared memory segment

SVR4 Error: 22: Invalid argument

El server en cuestion es un SPARC T5-2 con 8 gb de RAM, con 3 zonas, en la zona 1 estaran las bd, y en la zona 2 y zona3 la aplicacion.

Lo primero que chequeo es en la zona global, si hay errores en el /var/adm/messages, y veo el mensaje que no hay espacio en el SWAP

Dec 12 11:30:02 net1002 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 29204 (discusge)

Como el SO tiene ZFS, chequeo con zfs list el espacio asignado al swap , y luego con un df el espacio actual ocupado de swap

root@net1002 # zfs list

NAME USED AVAIL REFER MOUNTPOINT

rpool 16.0G 10.3G 106K /rpool

rpool/ROOT 6.35G 10.3G 31K legacy

rpool/ROOT/s10s_u11wos_24a 6.35G 10.3G 6.35G /

rpool/dump 1.50G 10.3G 1.50G -

rpool/export 73K 10.3G 36K /export

rpool/export/home 37K 10.3G 37K /export/home

rpool/swap 8.16G 10.6G 7.91G -

root@net1002 # df -h

Filesystem size used avail capacity Mounted on

rpool/ROOT/s10s_u11wos_24a

26G 6.3G 10G 39% /

/devices 0K 0K 0K 0% /devices

ctfs 0K 0K 0K 0% /system/contract

proc 0K 0K 0K 0% /proc

mnttab 0K 0K 0K 0% /etc/mnttab

swap 1.5G 448K 1.5G 1% /etc/svc/volatile

objfs 0K 0K 0K 0% /system/object

sharefs 0K 0K 0K 0% /etc/dfs/sharetab

/platform/sun4v/lib/libc_psr/libc_psr_hwcap3.so.1

17G 6.3G 10G 39% /platform/sun4v/lib/libc_psr.so.1

/platform/sun4v/lib/sparcv9/libc_psr/libc_psr_hwcap3.so.1

17G 6.3G 10G 39% /platform/sun4v/lib/sparcv9/libc_psr.so.1

fd 0K 0K 0K 0% /dev/fd

swap 1.5G 32K 1.5G 1% /tmp

swap 1.5G 88K 1.5G 1% /var/run

rpool/export 26G 36K 10G 1% /export

rpool/export/home 26G 37K 10G 1% /export/home

rpool 26G 106K 10G 1% /rpool

/dev/md/dsk/d300 30G 27G 2.7G 91% /export/zona3

/dev/md/dsk/d200 30G 25G 4.3G 86% /export/zona2

/dev/md/dsk/d100 30G 5.3G 24G 19% /export/zona1

Ahora voy a consultar cuanto espacio libre tiene el pool RPOOL , para saber cuanto espacio puedo agregarle al volumen SWAP ( tiene 11gb libres)

root@net1002 # zpool get all rpool

NAME PROPERTY VALUE SOURCE

rpool size 26.8G -

rpool capacity 58% -

rpool altroot - default

rpool health ONLINE -

rpool guid 17834260759408459067 -

rpool version 32 default

rpool bootfs rpool/ROOT/s10s_u11wos_24a local

rpool delegation on default

rpool autoreplace off default

rpool cachefile - default

rpool failmode continue local

rpool listsnapshots on default

rpool autoexpand off default

rpool free 11.0G -

rpool allocated 15.8G -

rpool readonly off

Con este comando veo que tiene asignado 7.91gb para el volumen swap

root@net1002 # zfs get all rpool/swap

NAME PROPERTY VALUE SOURCE

rpool/swap type volume -

rpool/swap creation Thu Sep 8 13:47 2016 -

rpool/swap used 8.16G -

rpool/swap available 10.6G -

rpool/swap referenced 7.91G -

rpool/swap compressratio 1.00x -

rpool/swap reservation none default

rpool/swap volsize 7.91G local

rpool/swap volblocksize 1M -

rpool/swap checksum off local

rpool/swap compression off local

rpool/swap readonly off default

rpool/swap shareiscsi off default

rpool/swap copies 1 default

rpool/swap refreservation 8.16G local

rpool/swap primarycache metadata local

rpool/swap secondarycache all default

rpool/swap usedbysnapshots 0 -

rpool/swap usedbydataset 7.91G -

rpool/swap usedbychildren 0 -

rpool/swap usedbyrefreservation 255M -

rpool/swap logbias latency default

rpool/swap sync standard default

rpool/swap rekeydate

Ahora voy a agrandar el tamaño del volumen swap de 8gb que tenia a 16gb

root@net1002 # zfs set volsize=16g rpool/swap

root@net1002 # zfs get all rpool/swap

NAME PROPERTY VALUE SOURCE

rpool/swap type volume -

rpool/swap creation Thu Sep 8 13:47 2016 -

rpool/swap used 16.5G -

rpool/swap available 2.48G -

rpool/swap referenced 16.0G -

rpool/swap compressratio 1.00x -

rpool/swap reservation none default

rpool/swap volsize 16G local

rpool/swap volblocksize 1M -

rpool/swap checksum off local

rpool/swap compression off local

rpool/swap readonly off default

rpool/swap shareiscsi off default

rpool/swap copies 1 default

rpool/swap refreservation 16.5G local

rpool/swap primarycache metadata local

rpool/swap secondarycache all default

rpool/swap usedbysnapshots 0 -

rpool/swap usedbydataset 16.0G -

rpool/swap usedbychildren 0 -

rpool/swap usedbyrefreservation 516M -

rpool/swap logbias latency default

rpool/swap sync standard default

rpool/swap rekeydate

Con esto ya funciona, pero por las dudas que el dba quiera agregar otra instancia mas,

en la zona1 aumento el parametro del shared memory tambien a 16gb.( esto es opcional, si no lo seteo , y dejo en 8gb que era el tamaño que estaba anteriormente, funciona igual )

root@net1c12 # projmod -s -K "project.max-shm-memory=(priv,17179869184,deny)" user.oracle

Problema con LDMD y solucion aplicada

Article 1

Problem with LDMD daemon and the solution (spanish version)

Como Agregar placa fibra a un sparc t7-1

ORA-27125: unable to create shared memory segment

Review de solaris para It Central Station

ERROR: V-3-20003: Cannot open /dev/vx/dsk... No such device or address y UX:vxfs mount: ERROR: V-3-24996: Unable to get disk layout version

Ver y Crear copia de la configuracion de los LDOM's

Capturando paquetes en un puerto en particular, con el comando snoop

Agregar Filesystem al Sun Cluster 3.3

Instalacion del Agente de Control M , Version 9.0.00 y su Fix Pack

Para cambiar ip o setear la ip e un t5220 o t71/2

Poco frecuente, pero nos paso, error fisico de Fibra

Agregar discos a un zpool

Error ANS1051I Invalid password en job de TSM

Desinstalar y upgradear el agente de monitoreo de HP, OVO

Zscp o procedimiento manual

Chequea Fechas

Formato de archivo newbsas

Compila TimeZone

Orden de ejecucion