"Fossies" - the Fresh Open Source Software Archive

Member "opensaf-5.21.09/src/log/README-HEADLESS" (31 May 2021, 10531 Bytes) of package /linux/misc/opensaf-5.21.09.tar.gz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 #
    2 #      -*- OpenSAF  -*-
    3 #
    4 # (C) Copyright 2015 The OpenSAF Foundation
    5 #
    6 # This program is distributed in the hope that it will be useful, but
    7 # WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    8 # or FITNESS FOR A PARTICULAR PURPOSE. This file and program are licensed
    9 # under the GNU Lesser General Public License Version 2.1, February 1999.
   10 # The complete license can be accessed from the following location:
   11 # http://opensource.org/licenses/lgpl-license.php
   12 # See the Copying file included with the OpenSAF distribution for full
   13 # licensing terms.
   14 #
   15 # Author(s): Ericsson AB
   16 #
   17 
   18 GENERAL
   19 -------
   20 
   21 This is a description of how the Log service handle headless (SC down) and
   22 recovery after SC up.
   23 For the LOG Service this means that all information that existed in the server
   24 on SC-nodes is lost. The concept is that the server use information left in
   25 cached runtime attributes in stream IMM runtime objects together with
   26 information in log files to recover streams and information obtained from agents
   27 to recover information about connected clients.
   28 
   29 
   30 CONFIGURATION
   31 -------------
   32 
   33 The Log service reads the "scAbsenceAllowed" attribute. If the attribute is not
   34 empty the Log service will perform recovery when SC-nodes are up after headless.
   35 If the attribute is empty the Log service will still be able to restart after
   36 headless but all handles are invalidated meaning that all APIs except initialize
   37 will return BAD HANDLE.
   38 
   39 
   40 RECOVERY HANDLING IN SERVER
   41 ---------------------------
   42 
   43 The active server will do the following recovery handling:
   44 
   45 * Search for and create a list of all runtime objects (dn)
   46   If objects are found it most likely means that we have started after a
   47   headless state.
   48 * Start a timeout timer if there are objects in the list. The timeout time is
   49   set to a long time, 10 min. The reason is that recovery may take place during
   50   a rather long time. Recovery for a specific client is not actually needed
   51   before the client sends a request. A typical use case is that a client
   52   from before the headless state wants to write a log record.
   53 * The agent keeps track of which clients that is not yet recovered. Before
   54   receiving a write request or opening a stream the server expects that the
   55   client is initialized. The next request is expected to be to open a stream.
   56   If the open request is for opening an existing stream and the stream does not
   57   exist the server will look in the list. If the stream is found it will be
   58   recreated. After it is recreated it is removed from the list. If not found in
   59   the list normal error handling apply
   60 * A stream is recreated based on the cached runtime attributes in the stream
   61   runtime IMM object. Some information however is not found there. This
   62   information is current log file, size of current log file and record Id for
   63   last written log record. This information will be recreated from the log file
   64   that was open when server down happened. This log file can be found using the
   65   stream name, relative path and the fact that the file does not have a close
   66   time stamp in its name.
   67 * When the list is empty the timeout timer is stopped. If timeout happen
   68   remaining objects in the list are deleted. Now the server works as before.
   69   The reason that there may be objects left in the list when timeout is that
   70   clients that existed before headless state no longer exist (e.g. if running on
   71   SC node) and that such a client has created a stream and no other client that
   72   has opened this stream exist either.
   73 * If recover fail; the file cannot be found or some other file problem or
   74   problem with the stream object etc. an error code is returned to the agent.
   75   The actual recovery will take place when the stream open request is received
   76   so it is most likely that this request will get the error code.
   77   If the stream object exist in the list it will be deleted and removed from the
   78   list.
   79 
   80 The standby server will do the following:
   81 
   82 * Search for and create a list of all runtime objects (dn). See active
   83 
   84 * Start a timeout timer if there are objects in the list. See active
   85 
   86 * When receiving check-point events for stream open the correponding name is
   87   removed from the list if exist
   88 
   89 * When timeout the list is deleted
   90 
   91 The list must be handled on standby in order to have a relevant list in case of
   92 standby becoming active.
   93 
   94 
   95 States in the Log server
   96 -------------------------
   97 Recovery state:
   98  Enter if runtime objects found during startup
   99   - Start recovery timer
  100   - Handle recovery
  101  Exit when recovery timer timeout
  102   - Remove remaining runtime objects if any
  103   - Go to Normal state
  104 
  105 Normal state:
  106  Enter if no runtime object found during startup or when exiting Recovery state
  107   - This is normal state of operation.
  108 
  109 
  110 RECOVERY HANDLING IN AGENT
  111 --------------------------
  112 
  113 General
  114 -------
  115 To spread out recovery communication with the server as much as possible in time
  116 the recovery actions are not started automatically by all agents in the cluster
  117 as soon as server up is detected. First, recovery is done based on when it is
  118 needed and is done when a client sends a request, most likely a write request.
  119 It may also be a request to open a stream that is assumed to exist. However it
  120 is likely that a client does not write to the log very often and the first time
  121 such a client wants to write is well after the time when recovery is no longer
  122 possible (see timeout handling in server). It is therefore necessary for the
  123 agent to make sure that recovery is done for all clients before recovery time
  124 is up. This is done using a timeout timer and when timeout a recovery thread
  125 starts to recover all clients that are not already recovered.
  126 
  127 The agent will do the following when detecting server down, during server down
  128 (headless state) and when server up detected:
  129 
  130 States in the agent
  131 -------------------
  132 Server down detected:
  133 * Mark all clients and their open streams as not recovered. Also remove id
  134   information received from the server (client id and stream ids)
  135 * Stop recovery timer if running and remove recovery thread if it exist
  136 * Set No server state
  137 
  138 No server state:
  139 * Return TRY AGAIN for all APIs except StreamClose, Finalize and Write
  140   - Finalize:
  141     Remove client by freeing all resources and remove from list
  142     (normal handling) but do not send message to server. Normal error handling
  143     and return codes apply
  144  - StreamClose:
  145    It is possible to call SaLogStreamClose API when headless.
  146    When the LOG service is up, all "abandoned" runtime stream will be cleanup,
  147    include removing IMM obj and rename cfg/log file name by appending close time to them.
  148 
  149 Server up detected:
  150 Note: This is done in the MDS thread
  151 * Start a timer and a recovery thread waiting for timeout. The timeout time is
  152   randomly selected within an interval resulting in a timeout time that is
  153   significantly shorter than the timeout in the server resulting in deletion of
  154   stream runtime objects
  155 * Set Recovery state 1
  156 
  157 Recovery state 1:
  158 * Before timeout and if the client is not recovered (client recovered flag is
  159   false) a client requesting to open an existing stream or write a log record
  160   starts a recovery sequence. This recovery sequence is done in the client
  161   thread calling the API function.
  162   If the request is to close a stream that is not marked as recovered it will
  163   just be removed from the client list of streams. No message is sent to the
  164   server.
  165   If the request is to finalize the client will be removed from the agent
  166   client list. No message is sent to the server.
  167 
  168   The recovery sequence is:
  169    - Send an Initialize request (if not already initialized) to get a client id
  170    - Send a server request to open an existing stream for the stream in the
  171      client open request or write request to get a stream id
  172    - Set stream as recovered
  173    - If all streams are recovered set the client as recovered
  174 
  175   If Fail:
  176    - Invalidate the client handle (delete the client) and return BAD HANDLE.
  177      The client and all its stream handles are lost and must be reinitialized
  178 
  179 * If all clients are fully recovered:
  180    - Stop timer
  181    - Set Normal state
  182 
  183 * If timeout:
  184    - Set Recovery state 2
  185 
  186 Recovery state 2:
  187 * When timeout a recovery sequence to recover all clients registered with the
  188   agent and not already recovered is started in a recover thread. During this
  189   recovery all requests from the client will be answered with TRY AGAIN this is
  190   also the case for Finalize and Write.
  191 
  192   The sequence is for each client not already recovered:
  193    - Initialize the client if not already initialized
  194    - Open all not already opened streams registered with the client.
  195      An open request without parameters and create flag not set is used.
  196      The server will check if the stream already has an IMM object and if so
  197      restore the stream. See [RECOVERY HANDLING IN SERVER]
  198    - If success the client is marked as recovered
  199   If Fail
  200    - Invalidate the client handle (delete the client). If the client later
  201      request an operation other than initialize BAD HANDLE will be returned.
  202      The client and all its stream handles are lost and must be reinitialized
  203 
  204 * All clients are recovered
  205    - Terminate the recovery thread
  206    - Set Normal state
  207 
  208 Normal state:
  209  Enter when server up during normal startup
  210   - This is normal state of operation.
  211 
  212 
  213 Limitations
  214 -----------
  215 There are some situations when recovery or a complete recovery cannot/is not
  216 done:
  217 
  218 * If recovery of a stream fails the client will be invalidated.
  219   This is the case also if the client has more than one stream and one or more
  220   streams already has been successfully recovered. The reason for this is to
  221   avoid resource leaks. This will happen if the client error handling is to
  222   re-initialize the log service if BAD HANDLE is received if opening a stream or
  223   writing to a stream. If this is done the "old" client and its open streams
  224   will continue living as a "zombie" client in the server.
  225 
  226 * Recovery of log record Id is done by parsing the latest log file that contains
  227   log records. The log record Id is normally a number that is in the beginning
  228   of each log record. This is always the case if the default format is used.
  229   The latest record Id for a stream is found by searching backwards from the
  230   end of the file until '\n' is found (or start of file) the first characters
  231   after that character is assumed to be the Id number. This however does not
  232   always work e.g if a log message contains a '\n'.
  233   This will not fail the recovery of the stream but record Id numbering will
  234   restart from 1.