| Forward | I resonantly bought an used Sun E3000 so that I could play around with it's AP (Alternate Pathing) and DR (Dynamic Reconfiguration) capabilities. AP and DR are not well known, so this a history of what I have found out (starting with DR, I'll work with AP later). |
| General Info | Sun mid-range servers have the ability to hot swap their CPU/Memory and I/O boards. This is because of their architecture. They are a chassis type server that has a backplane (Gigaplane) that has the major components plugged into it (system boards). There are several types of system boards: Clock, CPU/Memory, SBUS I/O, PCI I/O, and Graphics I/O. Examples of system boards are the CPU/Memory board which can have up to 2 CPUs and 2 banks of memory or the SBUS I/O board which has 3 SBUS slots (Sun expansion cards like a quad fast ethernet or SCSI controller), a Fast-Wide SCSI controller, and a 10/100 ethernet port. |
| What servers can do DR? | The entry level server that can do DR is the E3000. Anything above it can also do DR (E3000 - E6500). The E10000 (Starfire) can also do DR, but it is different since it is almost a supercomputer. |
A pdf file on DR from Sun is located here.
First, the system that this documentation is based on is a Sun E3000 (it is a 4 slot server). It has:
Step 1: Make sure that the PROM (fiirmware) version on all of the boards support DR. You can do this by typing ".version" at the OK prompt or by looking at "dmesg". My dmesg says:
May 12 13:28:48 bigmac sysctrl: [ID 979883 kern.info] NOTICE: Firmware
supports
Dynamic Reconfiguration of CPU/Memory boards.
May 12 13:28:48 bigmac sysctrl: [ID 787141 kern.info] NOTICE: Firmware
supports
Dynamic Reconfiguration of I/O board types 1, 4.
Now, you need to make sure that the kernel is set for DR:
From /etc/system: set soc:soc_enable_detach_suspend=1 set pln:pln_enable_detach_suspend=1 set kernel_cage_enable=1 Note: the kernel cage only needs to be enabled if you are going DR CPU/Memory boards.
If you were going to DR CPU/Memory boards you would have to also make sure that memory interleaving was set to "min". However, I won't be DR'ing a CPU/Memory board so I will leave mine set to max. The interleave is set in the eeprom, my eeprom shows:
# eeprom
disabled-memory-list: data not available.
disabled-board-list: data not available.
memory-interleave=max
configuration-policy=component
scsi-initiator-id=7
keyboard-click?=false
keymap: data not available.
ttyb-rts-dtr-off=false
ttyb-ignore-cd=true
ttya-rts-dtr-off=false
ttya-ignore-cd=true
ttyb-mode=9600,8,n,1,-
ttya-mode=9600,8,n,1,-
sbus-specific-probe: data not available.
sbus-probe-default=d3120
mfg-mode=off
diag-level=min
powerfail-time=0
#power-cycles=1426063405
fcode-debug?=false
output-device=screen
input-device=keyboard
load-base=16384
boot-command=boot
auto-boot?=true
watchdog-reboot?=false
diag-file: data not available.
diag-device=disk diskbrd diskisp disksoc net
boot-file: data not available.
boot-device=diskbrd:a disk diskbrd diskisp disksoc net
local-mac-address?=false
ansi-terminal?=true
screen-#columns=80
screen-#rows=34
silent-mode?=false
use-nvramrc?=false
nvramrc: data not available.
security-mode=none
security-password: data not available.
security-#badlogins=2684354560
oem-logo: data not available.
oem-logo?=false
oem-banner: data not available.
oem-banner?=false
hardware-revision: data not available.
last-hardware-update=
diag-switch?=false
#
And prtdiag shows:
System Configuration: Sun Microsystems sun4u 4-slot Sun
Enterprise 3000
System clock frequency: 82 MHz
Memory size: 512Mb
========================= CPUs =========================
Run Ecache CPU CPU
Brd CPU Module MHz
MB Impl. Mask
--- --- ------- ----- ------ ------
----
7 14 0
248 1.0 US-II 1.1
7 15 1
248 1.0 US-II 1.1
========================= Memory =========================
Intrlv. Intrlv.
Brd Bank MB Status
Condition Speed Factor With
--- ----- ---- ------- ---------- -----
------- -------
7 0 256
Active OK
60ns 2-way A
7 1 256
Active OK
60ns 2-way A
========================= IO Cards =========================
Bus Freq
Brd Type MHz Slot Name
Model
--- ---- ---- ---- --------------------------------
----------------------
1 SBus 25 2
cgsix
SUNW,501-2325
1 SBus 25 3
SUNW,hme
1 SBus 25 3
SUNW,fas/sd (block)
1 SBus 25 13
SUNW,soc
501-2069
3 SBus 25 3
SUNW,hme
3 SBus 25 3
SUNW,fas/sd (block)
3 SBus 25 13
SUNW,soc
501-2069
No failures found in System
===========================
No System Faults found
======================
And, cfgadm shows:
# cfgadm -l
Ap_Id
Type Receptacle
Occupant Condition
ac0:bank0
memory connected
configured ok
ac0:bank1
memory connected
configured ok
c0
scsi-bus connected configured
unknown
c1
scsi-bus connected unconfigured
unknown
sysctrl0:slot1
dual-sbus connected configured
ok
sysctrl0:slot3
dual-sbus connected configured
ok
sysctrl0:slot5
unknown empty
unconfigured unknown
sysctrl0:slot7
cpu/mem connected configured
ok
A test of quiece shows that that machine can actually do DR:
# cfgadm -x quiesce-test sysctrl0:slot1
NOTE: The machine will freeze for up to 1 minute when you do this. However, since it came back I know that quiesce works!
Now, alittle hardware reconfiguration. There are several restrictions (until I get AP working):
Ok, the main network connection (hostname: bigmac) has been moved to hme0 on the first I/O board (slot1). The second I/O board is now free (slot3).
Now, unconfigure slot3:
bigmac# cfgadm -c unconfigure sysctrl0:slot3
And, dmesg shows that slot3 is unconfigured:
May 15 20:38:15 bigmac pseudo: [ID 129642 kern.info] pseudo-device:
devinfo0
May 15 20:38:15 bigmac genunix: [ID 936769 kern.info] devinfo0 is /pseudo/devinfo@0
May 15 20:38:15 bigmac sysctrl: [ID 523642 kern.notice] NOTICE: unconfiguring
dual-sbus board in slot 3
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,hme@3,8c00000
(hme1) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@0,0
(sd15) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@1,0
(sd16) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@2,0
(sd17) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@3,0
(sd18) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@4,0
(sd19) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@5,0
(sd20) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@6,0
(sd21) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@8,0
(sd22) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@9,0
(sd23) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@a,0
(sd24) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@b,0
(sd25) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@c,0
(sd26) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@d,0
(sd27) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@e,0
(sd28) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000/sd@f,0
(sd29) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000
(fas1) offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@7,0 (sbus3)
offline
May 15 20:38:15 bigmac genunix: [ID 408114 kern.info] /sbus@6,0 (sbus2)
offline
May 15 20:38:15 bigmac sysctrl: [ID 549876 kern.notice] NOTICE: dual-sbus
board in slot 3 is unconfigured
bigmac[barnesr]27:
Now, detach slot3:
bigmac# cfgadm -c disconnect sysctrl0:slot3
And, dmesg shows that slot3 is disconnected:
May 15 20:41:36 bigmac sysctrl: [ID 523642 kern.notice] NOTICE: disconnecting
dual-sbus board in slot 3
May 15 20:41:36 bigmac genunix: [ID 408114 kern.info] /fhc@6,f8800000/ac@0,1000000
(ac2) offline
May 15 20:41:36 bigmac genunix: [ID 408114 kern.info] /fhc@6,f8800000/environment@0,400000
(environ2) offline
May 15 20:41:36 bigmac genunix: [ID 408114 kern.info] /fhc@6,f8800000
(fhc2) offline
May 15 20:41:37 bigmac sysctrl: [ID 549876 kern.notice] NOTICE: dual-sbus
board in slot 3 is disconnected
May 15 20:41:37 bigmac sysctrl: [ID 258214 kern.notice] NOTICE: board
3 is ready to remove
May 15 20:41:37 bigmac sysctrl: [ID 404430 kern.notice] NOTICE: Redundant
power available
And, cfgadm -l now shows:
bigmac# cfgadm -l
Ap_Id
Type Receptacle
Occupant Condition
ac0:bank0
memory connected
configured ok
ac0:bank1
memory connected
configured ok
c0
scsi-bus connected configured
unknown
sysctrl0:slot1
dual-sbus connected configured
ok
sysctrl0:slot3
dual-sbus disconnected unconfigured unknown
sysctrl0:slot5
unknown empty
unconfigured unknown
sysctrl0:slot7
cpu/mem connected configured
ok
bigmac#
The status LEDs on board 3 (slot3) have now changed to off, on (orange), off, orange indicating a hardware fault. The status LEDs on the other boards continue normally (on, off, on (blink)). The board in slot 3 can now safely be removed from the system WHILE THE SYSTEM CONTINUES TO RUN!!!! This is SOOO COOL!
Now, it's time to bring board 3 back online.
First, reconnect it:
bigmac# cfgadm -c connect sysctrl0:slot3
system will be temporarily suspended to connect a board: proceed (yes/no)?
yes
bigmac#
And, dmesg shows:
May 15 20:53:24 bigmac sysctrl: [ID 523642 kern.notice] NOTICE: connecting
dual-sbus board in slot 3
May 15 20:53:36 bigmac rootnex: [ID 349649 kern.info] fhc2 at root:
UPA 0x6 0xf8800000
May 15 20:53:36 bigmac genunix: [ID 936769 kern.info] fhc2 is /fhc@6,f8800000
May 15 20:53:36 bigmac genunix: [ID 408114 kern.info] /fhc@6,f8800000
(fhc2) online
May 15 20:53:36 bigmac genunix: [ID 936769 kern.info] ac2 is /fhc@6,f8800000/ac@0,1000000
May 15 20:53:36 bigmac genunix: [ID 408114 kern.info] /fhc@6,f8800000/ac@0,1000000
(ac2) online
May 15 20:53:36 bigmac genunix: [ID 936769 kern.info] environ2 is /fhc@6,f8800000/environment@0,400000
May 15 20:53:36 bigmac genunix: [ID 408114 kern.info] /fhc@6,f8800000/environment@0,400000
(environ2) online
May 15 20:53:36 bigmac sysctrl: [ID 549876 kern.notice] NOTICE: dual-sbus
board in slot 3 is connected
May 15 20:53:37 bigmac sysctrl: [ID 459609 kern.warning] WARNING: Redundant
power lost
May 15 20:53:38 bigmac hme: [ID 517527 kern.info] SUNW,hme0 : Internal
Transceiver Selected.
May 15 20:53:38 bigmac hme: [ID 517527 kern.info] SUNW,hme0 : Auto-Negotiated
100 Mbps Half-Duplex Link Up
Now, configure the board:
bigmac# cfgadm -c configure sysctrl0:slot3
bigmac#
And, dmesg shows:
May 15 20:55:37 bigmac sysctrl: [ID 523642 kern.notice] NOTICE: configuring
dual-sbus board in slot 3
May 15 20:55:37 bigmac rootnex: [ID 349649 kern.info] sbus2 at root:
UPA 0x6 0x0 ...
May 15 20:55:37 bigmac genunix: [ID 936769 kern.info] sbus2 is /sbus@6,0
May 15 20:55:37 bigmac genunix: [ID 408114 kern.info] /sbus@6,0 (sbus2)
online
May 15 20:55:37 bigmac sbus: [ID 349649 kern.info] sbusmem0 at sbus0:
SBus0 slot 0x1 offset 0x0
May 15 20:55:37 bigmac genunix: [ID 936769 kern.info] sbusmem0 is /sbus@2,0/sbusmem@1,0
May 15 20:55:37 bigmac sbus: [ID 349649 kern.info] sbusmem1 at sbus0:
SBus0 slot 0x2 offset 0x0
May 15 20:55:37 bigmac genunix: [ID 936769 kern.info] sbusmem1 is /sbus@2,0/sbusmem@2,0
May 15 20:55:37 bigmac sbus: [ID 349649 kern.info] sbusmem2 at sbus0:
SBus0 slot 0xd offset 0x0
May 15 20:55:37 bigmac genunix: [ID 936769 kern.info] sbusmem2 is /sbus@2,0/sbusmem@d,0
May 15 20:55:37 bigmac sbus: [ID 349649 kern.info] sbusmem3 at sbus1:
SBus1 slot 0x0 offset 0x0
May 15 20:55:37 bigmac genunix: [ID 936769 kern.info] sbusmem3 is /sbus@3,0/sbusmem@0,0
May 15 20:55:37 bigmac sbus: [ID 349649 kern.info] sbusmem4 at sbus1:
SBus1 slot 0x3 offset 0x0
May 15 20:55:37 bigmac genunix: [ID 936769 kern.info] sbusmem4 is /sbus@3,0/sbusmem@3,0
May 15 20:55:37 bigmac sbus: [ID 349649 kern.info] sbusmem5 at sbus2:
SBus2 slot 0x1 offset 0x0
May 15 20:55:37 bigmac genunix: [ID 936769 kern.info] sbusmem5 is /sbus@6,0/sbusmem@1,0
May 15 20:55:37 bigmac sbus: [ID 349649 kern.info] sbusmem6 at sbus2:
SBus2 slot 0x2 offset 0x0
May 15 20:55:37 bigmac genunix: [ID 936769 kern.info] sbusmem6 is /sbus@6,0/sbusmem@2,0
May 15 20:55:37 bigmac sbus: [ID 349649 kern.info] sbusmem7 at sbus2:
SBus2 slot 0xd offset 0x0
May 15 20:55:37 bigmac genunix: [ID 936769 kern.info] sbusmem7 is /sbus@6,0/sbusmem@d,0
May 15 20:55:37 bigmac genunix: [ID 408114 kern.info] /sbus@2,0/sbusmem@1,0
(sbusmem0) online
May 15 20:55:37 bigmac genunix: [ID 408114 kern.info] /sbus@2,0/sbusmem@2,0
(sbusmem1) online
May 15 20:55:37 bigmac genunix: [ID 408114 kern.info] /sbus@2,0/sbusmem@d,0
(sbusmem2) online
May 15 20:55:37 bigmac genunix: [ID 408114 kern.info] /sbus@3,0/sbusmem@0,0
(sbusmem3) online
May 15 20:55:37 bigmac genunix: [ID 408114 kern.info] /sbus@3,0/sbusmem@3,0
(sbusmem4) online
May 15 20:55:37 bigmac genunix: [ID 408114 kern.info] /sbus@6,0/sbusmem@1,0
(sbusmem5) online
May 15 20:55:37 bigmac genunix: [ID 408114 kern.info] /sbus@6,0/sbusmem@2,0
(sbusmem6) online
May 15 20:55:37 bigmac genunix: [ID 408114 kern.info] /sbus@6,0/sbusmem@d,0
(sbusmem7) online
May 15 20:55:37 bigmac soc: [ID 854183 kern.info] ID[SUNWssa.soc.driver.1010]
soc0:: host adapter fw date code: Wed Jan 17 20:34:59 1996
May 15 20:55:37 bigmac
May 15 20:55:37 bigmac sbus: [ID 349649 kern.info] soc0 at sbus0: SBus0
slot 0xd offset 0x10000 Onboard device sparc9 ipl 5
May 15 20:55:37 bigmac genunix: [ID 936769 kern.info] soc0 is /sbus@2,0/SUNW,soc@d,10000
May 15 20:55:37 bigmac soc: [ID 854183 kern.info] ID[SUNWssa.soc.driver.1010]
soc1:: host adapter fw date code: Wed Jan 17 20:34:59 1996
May 15 20:55:37 bigmac
May 15 20:55:37 bigmac sbus: [ID 349649 kern.info] soc1 at sbus2: SBus2
slot 0xd offset 0x10000 Onboard device sparc9 ipl 5
May 15 20:55:37 bigmac genunix: [ID 936769 kern.info] soc1 is /sbus@6,0/SUNW,soc@d,10000
May 15 20:55:37 bigmac genunix: [ID 408114 kern.info] /sbus@6,0/SUNW,soc@d,10000
(soc1) online
May 15 20:55:48 bigmac rootnex: [ID 349649 kern.info] sbus3 at root:
UPA 0x7 0x0 ...
May 15 20:55:48 bigmac genunix: [ID 936769 kern.info] sbus3 is /sbus@7,0
May 15 20:55:48 bigmac genunix: [ID 408114 kern.info] /sbus@7,0 (sbus3)
online
May 15 20:55:48 bigmac sbus: [ID 349649 kern.info] sbusmem8 at sbus3:
SBus3 slot 0x0 offset 0x0
May 15 20:55:48 bigmac genunix: [ID 936769 kern.info] sbusmem8 is /sbus@7,0/sbusmem@0,0
May 15 20:55:48 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/sbusmem@0,0
(sbusmem8) online
May 15 20:55:48 bigmac sbus: [ID 349649 kern.info] sbusmem9 at sbus3:
SBus3 slot 0x3 offset 0x0
May 15 20:55:48 bigmac genunix: [ID 936769 kern.info] sbusmem9 is /sbus@7,0/sbusmem@3,0
May 15 20:55:48 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/sbusmem@3,0
(sbusmem9) online
May 15 20:55:48 bigmac hme: [ID 517527 kern.info] SUNW,hme1 : Sbus
(Rev Id = 22) Found
May 15 20:55:48 bigmac sbus: [ID 349649 kern.info] hme1 at sbus3: SBus3
slot 0x3 offset 0x8c00000 and slot 0x3 offset 0x8c02000 and slot 0x3 offset
0x8c04000 and slot 0x3 offset 0x8c06000 and slot 0x3 offset 0x8c07000 SBus
level 4 sparc9 ipl 7
May 15 20:55:48 bigmac genunix: [ID 936769 kern.info] hme1 is /sbus@7,0/SUNW,hme@3,8c00000
May 15 20:55:48 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,hme@3,8c00000
(hme1) online
May 15 20:55:48 bigmac scsi: [ID 365881 kern.info] /sbus@7,0/SUNW,fas@3,8800000
(fas1):
May 15 20:55:48 bigmac rev 2.2 FEPS chip
May 15 20:55:48 bigmac sbus: [ID 349649 kern.info] fas1 at sbus3: SBus3
slot 0x3 offset 0x8800000 and slot 0x3 offset 0x8810000 SBus level 3 sparc9
ipl 5
May 15 20:55:48 bigmac genunix: [ID 936769 kern.info] fas1 is /sbus@7,0/SUNW,fas@3,8800000
May 15 20:55:48 bigmac genunix: [ID 408114 kern.info] /sbus@7,0/SUNW,fas@3,8800000
(fas1) online
May 15 20:55:58 bigmac sysctrl: [ID 549876 kern.notice] NOTICE: dual-sbus
board in slot 3 is configured
And, cfgadm -l shows:
bigmac# cfgadm -l
Ap_Id
Type Receptacle
Occupant Condition
ac0:bank0
memory connected
configured ok
ac0:bank1
memory connected
configured ok
c0
scsi-bus connected configured
unknown
c1
scsi-bus connected unconfigured
unknown
sysctrl0:slot1
dual-sbus connected configured
ok
sysctrl0:slot3
dual-sbus connected configured
ok
sysctrl0:slot5
unknown empty
unconfigured unknown
sysctrl0:slot7
cpu/mem connected configured
ok
The LEDs on all of the boards have returned to normal (on, off, on (blink)) and the system is ready for business!
Next up, AP (Alternative Pathing) configuration..... On another
day...