"Fossies" - the Fresh Open Source Software Archive
Member "opensaf-5.21.09/src/log/README-HEADLESS" (31 May 2021, 10531 Bytes) of package /linux/misc/opensaf-5.21.09.tar.gz:
As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard
) with prefixed line numbers.
Alternatively you can here view
the uninterpreted source code file.
2 # -*- OpenSAF -*-
4 # (C) Copyright 2015 The OpenSAF Foundation
6 # This program is distributed in the hope that it will be useful, but
7 # WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
8 # or FITNESS FOR A PARTICULAR PURPOSE. This file and program are licensed
9 # under the GNU Lesser General Public License Version 2.1, February 1999.
10 # The complete license can be accessed from the following location:
11 # http://opensource.org/licenses/lgpl-license.php
12 # See the Copying file included with the OpenSAF distribution for full
13 # licensing terms.
15 # Author(s): Ericsson AB
21 This is a description of how the Log service handle headless (SC down) and
22 recovery after SC up.
23 For the LOG Service this means that all information that existed in the server
24 on SC-nodes is lost. The concept is that the server use information left in
25 cached runtime attributes in stream IMM runtime objects together with
26 information in log files to recover streams and information obtained from agents
27 to recover information about connected clients.
33 The Log service reads the "scAbsenceAllowed" attribute. If the attribute is not
34 empty the Log service will perform recovery when SC-nodes are up after headless.
35 If the attribute is empty the Log service will still be able to restart after
36 headless but all handles are invalidated meaning that all APIs except initialize
37 will return BAD HANDLE.
40 RECOVERY HANDLING IN SERVER
43 The active server will do the following recovery handling:
45 * Search for and create a list of all runtime objects (dn)
46 If objects are found it most likely means that we have started after a
47 headless state.
48 * Start a timeout timer if there are objects in the list. The timeout time is
49 set to a long time, 10 min. The reason is that recovery may take place during
50 a rather long time. Recovery for a specific client is not actually needed
51 before the client sends a request. A typical use case is that a client
52 from before the headless state wants to write a log record.
53 * The agent keeps track of which clients that is not yet recovered. Before
54 receiving a write request or opening a stream the server expects that the
55 client is initialized. The next request is expected to be to open a stream.
56 If the open request is for opening an existing stream and the stream does not
57 exist the server will look in the list. If the stream is found it will be
58 recreated. After it is recreated it is removed from the list. If not found in
59 the list normal error handling apply
60 * A stream is recreated based on the cached runtime attributes in the stream
61 runtime IMM object. Some information however is not found there. This
62 information is current log file, size of current log file and record Id for
63 last written log record. This information will be recreated from the log file
64 that was open when server down happened. This log file can be found using the
65 stream name, relative path and the fact that the file does not have a close
66 time stamp in its name.
67 * When the list is empty the timeout timer is stopped. If timeout happen
68 remaining objects in the list are deleted. Now the server works as before.
69 The reason that there may be objects left in the list when timeout is that
70 clients that existed before headless state no longer exist (e.g. if running on
71 SC node) and that such a client has created a stream and no other client that
72 has opened this stream exist either.
73 * If recover fail; the file cannot be found or some other file problem or
74 problem with the stream object etc. an error code is returned to the agent.
75 The actual recovery will take place when the stream open request is received
76 so it is most likely that this request will get the error code.
77 If the stream object exist in the list it will be deleted and removed from the
80 The standby server will do the following:
82 * Search for and create a list of all runtime objects (dn). See active
84 * Start a timeout timer if there are objects in the list. See active
86 * When receiving check-point events for stream open the correponding name is
87 removed from the list if exist
89 * When timeout the list is deleted
91 The list must be handled on standby in order to have a relevant list in case of
92 standby becoming active.
95 States in the Log server
97 Recovery state:
98 Enter if runtime objects found during startup
99 - Start recovery timer
100 - Handle recovery
101 Exit when recovery timer timeout
102 - Remove remaining runtime objects if any
103 - Go to Normal state
105 Normal state:
106 Enter if no runtime object found during startup or when exiting Recovery state
107 - This is normal state of operation.
110 RECOVERY HANDLING IN AGENT
115 To spread out recovery communication with the server as much as possible in time
116 the recovery actions are not started automatically by all agents in the cluster
117 as soon as server up is detected. First, recovery is done based on when it is
118 needed and is done when a client sends a request, most likely a write request.
119 It may also be a request to open a stream that is assumed to exist. However it
120 is likely that a client does not write to the log very often and the first time
121 such a client wants to write is well after the time when recovery is no longer
122 possible (see timeout handling in server). It is therefore necessary for the
123 agent to make sure that recovery is done for all clients before recovery time
124 is up. This is done using a timeout timer and when timeout a recovery thread
125 starts to recover all clients that are not already recovered.
127 The agent will do the following when detecting server down, during server down
128 (headless state) and when server up detected:
130 States in the agent
132 Server down detected:
133 * Mark all clients and their open streams as not recovered. Also remove id
134 information received from the server (client id and stream ids)
135 * Stop recovery timer if running and remove recovery thread if it exist
136 * Set No server state
138 No server state:
139 * Return TRY AGAIN for all APIs except StreamClose, Finalize and Write
140 - Finalize:
141 Remove client by freeing all resources and remove from list
142 (normal handling) but do not send message to server. Normal error handling
143 and return codes apply
144 - StreamClose:
145 It is possible to call SaLogStreamClose API when headless.
146 When the LOG service is up, all "abandoned" runtime stream will be cleanup,
147 include removing IMM obj and rename cfg/log file name by appending close time to them.
149 Server up detected:
150 Note: This is done in the MDS thread
151 * Start a timer and a recovery thread waiting for timeout. The timeout time is
152 randomly selected within an interval resulting in a timeout time that is
153 significantly shorter than the timeout in the server resulting in deletion of
154 stream runtime objects
155 * Set Recovery state 1
157 Recovery state 1:
158 * Before timeout and if the client is not recovered (client recovered flag is
159 false) a client requesting to open an existing stream or write a log record
160 starts a recovery sequence. This recovery sequence is done in the client
161 thread calling the API function.
162 If the request is to close a stream that is not marked as recovered it will
163 just be removed from the client list of streams. No message is sent to the
165 If the request is to finalize the client will be removed from the agent
166 client list. No message is sent to the server.
168 The recovery sequence is:
169 - Send an Initialize request (if not already initialized) to get a client id
170 - Send a server request to open an existing stream for the stream in the
171 client open request or write request to get a stream id
172 - Set stream as recovered
173 - If all streams are recovered set the client as recovered
175 If Fail:
176 - Invalidate the client handle (delete the client) and return BAD HANDLE.
177 The client and all its stream handles are lost and must be reinitialized
179 * If all clients are fully recovered:
180 - Stop timer
181 - Set Normal state
183 * If timeout:
184 - Set Recovery state 2
186 Recovery state 2:
187 * When timeout a recovery sequence to recover all clients registered with the
188 agent and not already recovered is started in a recover thread. During this
189 recovery all requests from the client will be answered with TRY AGAIN this is
190 also the case for Finalize and Write.
192 The sequence is for each client not already recovered:
193 - Initialize the client if not already initialized
194 - Open all not already opened streams registered with the client.
195 An open request without parameters and create flag not set is used.
196 The server will check if the stream already has an IMM object and if so
197 restore the stream. See [RECOVERY HANDLING IN SERVER]
198 - If success the client is marked as recovered
199 If Fail
200 - Invalidate the client handle (delete the client). If the client later
201 request an operation other than initialize BAD HANDLE will be returned.
202 The client and all its stream handles are lost and must be reinitialized
204 * All clients are recovered
205 - Terminate the recovery thread
206 - Set Normal state
208 Normal state:
209 Enter when server up during normal startup
210 - This is normal state of operation.
215 There are some situations when recovery or a complete recovery cannot/is not
218 * If recovery of a stream fails the client will be invalidated.
219 This is the case also if the client has more than one stream and one or more
220 streams already has been successfully recovered. The reason for this is to
221 avoid resource leaks. This will happen if the client error handling is to
222 re-initialize the log service if BAD HANDLE is received if opening a stream or
223 writing to a stream. If this is done the "old" client and its open streams
224 will continue living as a "zombie" client in the server.
226 * Recovery of log record Id is done by parsing the latest log file that contains
227 log records. The log record Id is normally a number that is in the beginning
228 of each log record. This is always the case if the default format is used.
229 The latest record Id for a stream is found by searching backwards from the
230 end of the file until '\n' is found (or start of file) the first characters
231 after that character is assumed to be the Id number. This however does not
232 always work e.g if a log message contains a '\n'.
233 This will not fail the recovery of the stream but record Id numbering will
234 restart from 1.