"Fossies" - the Fresh Open Source Software Archive

Member "opensaf-5.21.09/src/amf/README_SC_ABSENCE" (31 May 2021, 9140 Bytes) of package /linux/misc/opensaf-5.21.09.tar.gz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 #
    2 #      -*- OpenSAF  -*-
    3 #
    4 # (C) Copyright 2016 The OpenSAF Foundation
    5 # Copyright (C) 2017, Oracle and/or its affiliates. All rights reserved.
    6 #
    7 # This program is distributed in the hope that it will be useful, but
    8 # WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    9 # or FITNESS FOR A PARTICULAR PURPOSE. This file and program are licensed
   10 # under the GNU Lesser General Public License Version 2.1, February 1999.
   11 # The complete license can be accessed from the following location:
   12 # http://opensource.org/licenses/lgpl-license.php
   13 # See the Copying file included with the OpenSAF distribution for full
   14 # licensing terms.
   15 #
   16 # Author(s): Ericsson AB
   17 #
   18 
   19 GENERAL
   20 -------
   21 
   22 This is a description of how the AMF service suppports the SC absence feature 
   23 which allows payloads to remain running during the absence of both SCs, and 
   24 perform recovery after at least one SC comes back. 
   25 
   26 CONFIGURATION
   27 -------------
   28 
   29 AMF reads the "scAbsenceAllowed" attribute to determine if SC absence feature 
   30 is enabled. A positive integer indicates the number of seconds AMF will 
   31 tolerate the absence period of both SCs, and a zero value indicates this 
   32 feature is disabled.
   33 
   34 Normally, the AMF Node Director (amfnd) will restart a node if there is no 
   35 active AMF Director (amfd). If this feature is enabled, the Node Director will
   36 delay the restart for the duration specified in "scAbsenceAllowed". If a SC
   37 returns during the period, the restart is aborted.
   38 
   39 IMPLEMENTATION DETAILS
   40 ----------------------
   41 
   42 * Amfnd detects absence of SCs:
   43 Upon receiving NCSMDS_DOWN event which indicates the last active SC has gone,
   44 amfnd will not reboot the node and enters SC absence period (if 
   45 scAbsenceAllowed is configured)
   46 
   47 * Escalation and Recovery during SC absence period:
   48 Component and su restarts will work as normal. Any fail-over or switch-over at
   49 component, su, and node level will only cleanup faulty components. Recovery will
   50 be delayed until a SC returns: the fail-over or switch-over of SI assignments
   51 will be initiated if saAmfSGAutoRepair is enabled, the node will be reboot if 
   52 saAmfNodeAutoRepair, aAmfNodeFailfastOnTerminationFailure, or 
   53 saAmfNodeFailfastOnInstantiationFailure is enabled.
   54 
   55 * Amfnd detects return of SCs:
   56 NCSMDS_UP is the event that amfnd uses to detect the presence of an active amfd.
   57 
   58 * New sync messages 
   59 New messages (state information messages) have been introduced to carry 
   60 assignments and states from all amfnd(s), which then are sent to amfd. State 
   61 information messages also contain component and SU restart counts. These new 
   62 counter values will be updated to IMM after recovery.The operation where 
   63 amfnd(s) sends state information messages and amfd processes these messages
   64 is known as a *sync* operation.
   65 
   66 * Admin operation continuation
   67 If an admin operation on an AMF entity is still in progress when the cluster 
   68 loses both SCs, the operation will continue when a SC returns. In order to 
   69 resume the admin operation, AMF internal states that are used in the admin 
   70 operation need to be restored. In a normal cluster state, these states are
   71 *regularly* checkpointed to the standby AMFD so that the standby AMFD can 
   72 take over the active role if the active AMFD goes down. Using a similar 
   73 approach, new AMF runtime cached attributes are introduced to store the states 
   74 in IMM, as another method of restoring these states for the purpose of SC 
   75 absence recovery. The new attributes are:
   76 - osafAmfSISUFsmState:SUSI fsm state 
   77 - osafAmfSGFsmState:SG fsm state
   78 - osafAmfSGSuOperationList:SU operation list of SG
   79 - osafAmfSUSwitch:SU switch toggle.
   80 
   81 Only 2N SG is currently supported for admin operation continuation.
   82 
   83 * Node reboot during SC absence period:
   84 The event of node reboot initiated by user during SC absence period 
   85 may lead to a loss of SI assignments. When a SC returns, AMF Director
   86 will detect improper SI assignments and recover HA states of assignments. 
   87 
   88 LIMITATIONS
   89 -----------
   90 
   91 * Possible loss of RTA updates and SI assignment messages
   92 If both SCs go down abruptly (SCs are immediately powered-off for instance),
   93 AMFD could fail to update RTA to IMM, the SI assignment messages sent from
   94 AMFND could not reach to AMFD, or vice versa. In such cases, recovery could
   95 be impossible, applications may have inappropriate assignment states.
   96   
   97 * SI dependency tolerance timer 
   98 After a SC comes back, if an unassigned sponsor SI is detected, all its 
   99 dependent SI(s) assignments are removed regardless of tolerance duration. The 
  100 time of sponsor SI becoming unassigned is not recorded, so the new amfd cannot
  101 figure out how much time is left that the dependent SI(s) can tolerate.
  102 
  103 * Proxy and Proxied components are not yet supported
  104 
  105 * Alarms and notifications
  106 During the SC absence period, notifications will not be sent as the Director in
  107 charge of sending notifications is not available. For example, if a component 
  108 fails to instantiate while SC absence stage and its SU becomes disabled, a state
  109 change for the SU from ENABLED to DISABLED will not be sent.
  110 
  111 List of possible missed notifications
  112 =====================================
  113 SA_AMF_PRESENCE_STATE of a SU
  114 SA_AMF_OP_STATE of a SU 
  115 SA_AMF_HA_STATE of a SI 
  116 SA_AMF_ASSIGNMENT_STATE of a SI
  117 
  118 After the SC absence period, some redundant alarms and notifications may be sent
  119 from the Director. Initially the Director will think all PLs are down. But as 
  120 sync info is received from PLs, alarms will be cleared or set, and finally reflect
  121 the current state of the cluster. For example, an alarm may initially be raised
  122 for an unassigned SI, but later cleared as the Director learns of the SI assignment
  123  on a PL that remained running.
  124 
  125 Redundant notifications
  126 =======================
  127 SA_AMF_PRESENCE_STATE of a SU may change from SA_AMF_PRESENCE_UNINSTANTIATED to <<current state>>
  128 SA_AMF_OP_STATE of a SU may change from SA_AMF_OPERATIONAL_DISABLED to <<current state>>
  129 SA_AMF_HA_STATE of a SI may change from "" to <<current state>>
  130 SA_AMF_ASSIGNMENT_STATE of a SI may change from SA_AMF_ASSIGNMENT_UNASSIGNED to <<current state>>
  131 
  132 Redundant alarms
  133 ================
  134 An unassigned SI alarm may be raised and then cleared shortly afterwards
  135 
  136 Furthermore, some notifications may be slightly misleading. For example, if a SI
  137 becomes PARTIALLY_ASSIGNED from FULLY_ASSIGNED because a component develops a fault
  138 while SC absence period, the SI change notification may describe the SI going from
  139 UNASSIGNED to PARTIALLY_ASSIGNED. This is because the Director initially does not 
  140 know about the existence of the SIs assigned to PLs that remained running.
  141 
  142 Limited notifications
  143 =====================
  144 SA_AMF_ASSIGNMENT_STATE of a SI may change from SA_AMF_ASSIGNMENT_UNASSIGNED to 
  145 SA_AMF_ASSIGNMENT_PARTIALLY_ASSIGNED when it should be 
  146 SA_AMF_ASSIGNMENT_FULLY_ASSIGNED to SA_AMF_ASSIGNMENT_PARTIALLY_ASSIGNED
  147 
  148 * Some AMF API functions will be unavailable while SC absence period
  149 saAmfProtectionGroupTrack() and saAmfProtectionGroupTrackStop() return 
  150 SA_AMF_ERROR_TRY_AGAIN.
  151 
  152 * One payload limitation
  153 
  154 If the cluster is configured with one payload without PBE, IMM will reload from
  155 XML the second time the cluster experiences the absence of both SCs. This causes
  156 amfd to lose all objects which were created before SC absence and data 
  157 inconsistency will occur between amfnd and amfd/IMM on the SC. To avoid this 
  158 inconsistency, the payload will be rebooted.
  159 
  160 SC Status Change Callback
  161 =========================
  162 Enhancement supports two resources for application to know about
  163 SC joining and leaving the cluster.
  164 
  165 Information about the resources:
  166 * A callback that will be invoked by AMFA whenever a SC joins cluster and
  167   both SCs leaves cluster if SC Absence feature is enabled.
  168 
  169   -Callback and its argument:
  170 
  171       void (*OsafAmfSCStatusChangeCallbackT)(OsafAmfSCStatusT status)
  172       where OsafAmfSCStatusT is defined as:
  173         typedef enum {
  174           OSAF_AMF_SC_PRESENT = 1,
  175           OSAF_AMF_SC_ABSENT = 2,
  176         } OsafAmfSCStatusT;
  177 
  178   This callback can be integrated
  179   with standard AMF component(even with legacy one also).
  180 
  181   -Return codes:
  182    SA_AIS_OK - The function returned successfully.
  183    SA_AIS_ERR_LIBRARY - An unexpected problem occurred in the library (such as
  184                         corruption). The library cannot be used anymore.
  185    SA_AIS_ERR_BAD_HANDLE - The handle amfHandle is invalid, since it is corrupted,
  186                            uninitialized, or has already been finalized.
  187    SA_AIS_ERR_INVALID_PARAM - A parameter is not set correctly (callback).
  188 
  189 * An API to register/install above callback function:
  190    void osafAmfInstallSCStatusChangeCallback(SaAmfHandleT amfHandle,
  191                                              OsafAmfSCStatusChangeCallbackT callback);
  192    If 0 is passed as amfHandle, then callback will be invoked in the
  193    context of MDS thread. If a valid amfHandle is passed then callback
  194    will be invoked in the context of thread which is calling saAmfDispatch()
  195    with this handle.
  196 
  197    Note: OsafAmfSCStatusT and API is declared in saAmf.h
  198 
  199 Also two applications amf_sc_status_demo.c and amf_sc_status_dispatch_demo.c
  200 are added in samples/amf/api_demo/ to demonstrate usage.