"Fossies" - the Fresh Open Source Software Archive 
Member "weblog_files/docs.txt" (31 Dec 2002, 24103 Bytes) of package /linux/www/old/weblog_files.zip:
As a special service "Fossies" has tried to format the requested text file into HTML format (style:
standard) with prefixed line numbers.
Alternatively you can here
view or
download the uninterpreted source code file.
1 DOCUMENTATION
2
3 WebLog 2.53 by Darryl C. Burgdorf (burgdorf@awsd.com)
4
5 http://awsd.com/scripts/weblog/
6
7 WebLog is a comprehensive access log analysis tool. It allows you to
8 keep track of activity on your site by month, week, day and hour, to
9 monitor total hits, bytes transferred and page views, and to keep track
10 of your most popular pages. It can also print out secondary reports to
11 track "user sessions," showing the paths taken through your site by your
12 visitors and giving you a rough idea of how long they spent looking at
13 your pages, and to provide you with information on referring sites, the
14 search engine keywords which brought your visitors and the agents and
15 platforms they used while visiting. It can read NCSA common or
16 combined log files, as well as Microsoft extended format log files.
17
18 ===========================================
19
20 I. THE REPORTS
21
22 The primary WebLog access report provides the following information:
23
24 A. Long-Term Statistics
25
26 1. Monthly Statistics: An overview of site activity (number
27 of hits, number of bytes transferred, and approximate number
28 of visitors) per month for each month since you started
29 running WebLog.
30 2. Daily Statistics (Past Five Weeks): An overview of site
31 activity per day for the past five weeks.
32 3. Day of Week Statistics: An overview of site activity by
33 weekday.
34 4. Hourly Statistics: An overview of site activity by hour of
35 the day.
36 5. "Record Book": A simple listing of the days on which your
37 site had the most hits, transferred the most data and saw
38 the most visitors.
39
40 Each of the "Long-Term Statistics" reports (except the "record
41 book," of course) lists four pieces of information: Hits,
42 Bytes, Visits and PViews. The number of "hits" is the total
43 number of files requested from the server. For example, if a
44 visitor loads a page which includes four inline graphics, a
45 total of five hits will be recorded in the access log. The
46 number of bytes represents the total amount of information
47 transferred by the server in filling those requests. (Note that
48 WebLog automatically factors in a bit extra in its calculations
49 to allow for the fact that "header" information -- which is not
50 recorded in the server access log -- is sent by the server along
51 with each file.) The number of "visits" is an approximation of
52 the number of actual individual visitors to your Web site. This
53 is only a *very* rough approximation, and should be regarded as
54 such. The number of "pview" shows the number of Web pages
55 viewed by your visitors. Each of the "Long-Term Statistics"
56 reports also includes a simple "bar graph" representation; the
57 graph can be configured to reflect whichever of the four items
58 you're most interested in being able to track "at a glance."
59
60 B. Statistics for The Current Month
61
62 1. Top N Files by Number of Hits (optional): A list of the
63 pages most frequently requested.
64 2. Top N Files by Volume (optional): A list of the pages which
65 resulted in the greatest number of bytes transferred.
66 3. Complete File Statistics (optional): A list of all pages
67 accessed in the current calendar month, with the date of
68 last access, number of times requested, and total number of
69 bytes transferred.
70 4. Top N Most Frequently Requested 404 Files (optional): A
71 list of the pages people are requesting most often which
72 don't actually exist on your site.
73 5. Complete 404 File Not Found Statistics (optional): A
74 complete list of those nonexistent files.
75 6. Top 25 "Entrance" Pages (optional): A list of the pages
76 most often seen first by visitors to your site.
77 7. Complete "Entrance" Page List (optional): The complete
78 version of the above list.
79 8. Top 25 "Exit" Pages (optional): A list of the pages most
80 often seen last by visitors to your site.
81 9. Complete "Exit" Page List (optional): The complete version
82 of the above list.
83 10. User ID Statistics (optional): A complete list of user IDs
84 (and the associated second-level domains) utilized by the
85 visitors to your Web site. Note that this report can, of
86 course, only be generated if at least part of your Web site
87 is password protected through your server's default system.
88 11. "Top Level" Domains: A breakdown of how many visits you've
89 had from each type of domain (.com, .net, .edu, etc.)
90 12. Top N Domains by Number of Hits (optional): A list of the
91 IP addresses (domains) from which people have visited your
92 site most often.
93 13. Top N Domains by Volume (optional): A list of the IP
94 addresses from which people have requested the greatest
95 amount of information.
96 14. Complete Domain Statistics (optional): A complete list of
97 the IP addresses from which people have visited your site
98 since the beginning of the current calendar month.
99
100 Each of the "Current Month" reports resets automatically at the
101 beginning of each month. This allows you to easily keep track
102 of things while preventing the report file from reaching too
103 ridiculous a size over time.
104
105 The optional access details report keeps track of "user sessions." It
106 will show you detailed "tracks" of the paths taken through your site by
107 visitors for however many days you specify, and will give you overview
108 information regarding how many unique visitors you've had each day and
109 how long they seem to be staying around. If logging of referring URLs
110 is enabled, it will also show you, where possible, where your visitors
111 came from. Please note that precise tracking of the number of visitors
112 is impossible; the information in this report is at best a reasonably
113 close approximation based on the information in your server access log.
114
115 The optional referring URL report logs the URLs reported by browsers as
116 the "referers" directing them to the various listed pages. You should
117 be aware that this information is far from perfect. Many browsers do
118 not provide any information on the referring page; even those that do
119 can at times provide false or misleading data. Of course, this report
120 is only available if your server log contains the necessary information.
121
122 The optional keywords report logs the keywords used by your visitors
123 to find you in the various Internet search engines and directories. The
124 major search engines are each listed individually. Again, this report
125 is only available if your server log contains the necessary information.
126
127 The optional agent and platform reports list the agents (browsers)
128 and platforms (operating systems) utilized by visitors to your pages.
129 Again, of course, this report is only available if your server log
130 contains the necessary information.
131
132 (CAVEAT:
133
134 (Like any log analysis software, WebLog is based squarely upon
135 several unfortunately questionable assumptions. Chief among these
136 is the assumption that any accesses from a specific IP address
137 within a reasonably short period of time belong to a single user,
138 and the assumption that analysis of access logs can actually tell
139 you anything useful about site visitors, anyway.
140
141 (It is possible for different users to access your site with the
142 same IP address, so a single "user session" might actually reflect
143 visits from multiple users. As well, thanks to the number of
144 systems which now employ local caching, it is quite likely that some
145 of the pages which seem to be accessed only once are in actuality
146 viewed many times by many different users.
147
148 (WebLog also assumes that the time between the loading of one page
149 and the loading of the next, so long as it is less than 30 minutes,
150 is actually spent looking at the first page. This is clearly not
151 necessarily the case. The user could have gotten up to fix himself
152 lunch or use the bathroom. He could have reloaded another page
153 already in his browser's cache, or could even have gone to look at
154 pages on other sites before returning to yours. There is no way of
155 knowing.
156
157 (Finally, WebLog assumes that the average length of time spent
158 viewing the last -- or only -- page visited in a user session is 30
159 seconds. Again, there is obviously no way to check the validity of
160 this assumption.)
161
162 ===========================================
163
164 II. SETTING UP AND RUNNING WEBLOG
165
166 The files that you need are as follows:
167
168 weblog.pl: This is the main program file. You don't actually need to
169 do anything to it; in fact, you don't even have to execute it.
170
171 config.pl: This is the configuration file. Everything you need to
172 change or modify is contained here. This is also the file that you
173 will execute. (Things are set up this way so that you can effectively
174 maintain multiple versions of the script, for example if you want to
175 run separate log analyses for different sites, just by keeping
176 separate config files for each.)
177
178 bar1.gif, bar2.gif, bar3.gif, bar4.gif, and bar5.gif: These five small
179 graphics files are used to create the bar graphs in the main access
180 report.
181
182 As noted above, the WebLog configuration file, and not the WebLog
183 program itself, should be executed. (And please note that it should
184 be executed from the telnet command prompt rather than your browser;
185 WebLog is *not* a CGI script, and most likely won't run correctly if you
186 try to access it from your browser.) The configuration file should, of
187 course, be set executable. Make sure that the first line of the script
188 matches the location of your system's Perl interpreter. As well, the
189 following variables need to be defined:
190
191 $LogFile: The path (not URL) of the Web site access log file from
192 which the log reports will be generated. Note that this file is
193 generated by your server; if you're not sure where to find it or what
194 it's called, check with your system administrators. It is possible,
195 though not likely, that you don't actually have access to log data.
196 If that is the case, then you won't be able to use WebLog at all.
197 The script can read both NCSA common ("standard") and combined
198 ("extended") format log files, as well as Microsoft extended format
199 log files. You don't need to specify the type, as WebLog determines
200 it automatically when it reads the file. Obviously, if you're
201 dealing with NCSA standard log files, or with log files which for
202 whatever reason don't include agent and referer information, WebLog
203 won't be able to generate agent or referer reports. You can use an
204 asterisk in the variable definition as a "wildcard." For example, if
205 your log files come to you named "www.YYMMDD" you can simply define
206 $LogFile as "www.*". This will tell WebLog to analyze *any* file
207 whose name matches the pattern. This can also be useful if you have
208 several log reports backed up, and want to run WebLog on all of them
209 at once.
210
211 $IPLog: The path to an optional DBM (database) file in which resolved
212 IP/domain pairs will be stored. Logging this information will allow
213 WebLog to run much faster, especially if you're running multiple
214 reports from a single log file. However, especially on a busy site,
215 the log file could become *very* large. If you define an IP log file,
216 keep an eye on its size.
217
218 $FileDir: The path of the directory in which the various report files
219 will be created.
220
221 $ReportFile, $FullListFile, $DetailsFile, $RefsFile, $KeywordsFile and
222 $AgentsFile: The file names to be used for each of the reports WebLog
223 can generate. All but the first are optional; if you don't assign a
224 file name, the report simply won't be generated. The "full list" file
225 allows you to put the "full" file, user ID and domain lists on a page
226 of their own, while keeping the "top N" lists on the main report page;
227 this makes the most interesting data easy to see, without requiring
228 that the main report page be extremely large.
229
230 $AgentListFile: An optional DBM (database) file in which a *complete*
231 list of agents (browsers) visiting your site will be maintained. In
232 most cases, there's really no reason to maintain such a list.
233
234 $DBMType: This variable determines how the DBM (database) files used
235 for storage of IP and/or agent info will be accessed. Most users can
236 leave it set to 0. If the script can't open the database file,
237 though -- and especially if you receive "Inappropriate file type
238 or format" errors -- try setting it to 1. This will replace the
239 basic tie() command with a version of the command specific to the
240 DB_File module, which produces the above error message. If all
241 else fails, set $DBMType to 2, and instead of tie() commands, the
242 more generic (but less efficient) dbmopen() commands will be
243 used.
244
245 $PrintFullAgentLists: If this variable is set to 1, and if you have
246 a DBM file containing a full list of agents, instead of printing out
247 its normal reports, WebLog will print two lists, showing exactly which
248 agents fall into the various agent and platform categories listed on
249 your agents report. Again, this is of little or no interest to most
250 of those using WebLog, and can quite safely be set to 0 and forgotten.
251
252 $EOMFile: An optional file which WebLog can "spin off" at the close
253 of each month, containing a full record of file access, etc., for the
254 month. This makes it easier for those who wish to keep permanent
255 "archive" reports to do so.
256
257 $SystemName: The name or description which you want to appear at the
258 top of your reports (e.g., "WebScripts").
259
260 $OrgName and $OrgDomain: The name and domain of the "host" organization
261 (e.g., ISP and isp.com). If these variables are defined, accesses
262 from this organization/domain will be counted separately from other
263 accesses in the details report.
264
265 $GraphURL: The URL of the directory containing the bar graph images
266 (e.g., "http://awsd.com/graphs"). Do NOT include a trailing slash!
267
268 $GraphBase: This variable defines the information on which you want the
269 bar graphs in the main report to be based. It can be set either as
270 "hits", "visits", "pviews" or "bytes"; if left undefined (or defined
271 incorrectly), graphs will be based on bytes transferred.
272
273 $IncludeOnlyRefsTo and $ExcludeRefsTo: Regexs specifying files or
274 directories to include or ignore in the files lists. For example, to
275 include only files in a "scripts" subdirectory, $IncludeOnlyRefsTo =
276 "^/scripts" would suffice. Multiple entries should be "OR"ed
277 (e.g., $IncludeOnlyRefsTo = "(^/dir1|^/dir2)").
278
279 $IncludeOnlyDomain and $ExcludeDomain: Regexs specifying domains to
280 include or ignore in the log file. If you want your log analysis to
281 ignore any visits by you to your own site, for example, set the
282 $ExcludeDomain variable to your own IP address. (Note that even if
283 you don't ignore your own visits completely, you can still track them
284 separately in the details report by using the $OrgName and $OrgDomain
285 variables.)
286
287 $IncludeQuery: If this variable is set to "0" any query information
288 contained in a URL will be stripped as the log file is processed. If
289 it is set to "1" the information will be retained.
290
291 $PrintFiles: A flag specifying whether the lists of accessed files
292 should be generated. (Normally, of course, you'd want to do so.
293 However, for example, if you generate a separate access report for
294 each site on a server, and also a report for the server as a whole,
295 you might want to suppress the files listings on the server-wide
296 report.) 0 = no; 1 = yes. As noted earlier, by defining a "full
297 list" report, you can put the full list "off to the side," to keep
298 your main report's size down, but still have WebLog generate the
299 "top N files" lists.
300
301 $Print404: A flag specifying whether the "Code 404" file lists should
302 be printed. 0 = no; 1 = yes.
303
304 $PrintDomains: A flag specifying whether or not to print lists of
305 visiting IP addresses. 0 = no; 1 = yes. This variable can also be
306 set to "2" to indicate that you want only second-level domains
307 tracked. (In other words, for example, one hit each from
308 user1.foo.com and user2.foo.com will show up simply as two hits
309 from foo.com, which can greatly reduce the size of your log file,
310 especially if your site is busy!)
311
312 $PrintUserIDs: A flag specifying whether the User ID list should be
313 generated. If no portion of your site is password protected, or if
314 you use a password system other than that which is integral to your
315 server software (.htaccess in the case of most UNIX systems), then
316 this list can be turned off, as your log file won't contain any user
317 IDs, anyway.
318
319 $PrintTopNFiles: The number of files to include in the "Top N Files"
320 lists. Set to 0 if you don't want to print the lists. The script
321 cannot generate the "top N" list if the full list isn't also being
322 stored.
323
324 $TopFileListFilter: Regex defining files to exclude from the "Top N
325 Files" lists. The default value of "(\.gif|\.jpg|\.jpeg|Code 404)"
326 will filter out most image files and any frequently-requested but non-
327 existing files.
328
329 $PrintTopNDomains: The number of domains to include in the "Top N
330 Domains" lists. (This, of course, is irrelevant if you're not
331 printing domain lists.)
332
333 $LogOnlyNew: Setting this variable to "1" will instruct WebLog to
334 ignore any entries in the log file being analyzed which date from
335 before the end of the last log file analyzed. If you're afraid that
336 you might accidentally run the script with the same log file twice in
337 a row, setting this to "1" will prevent any data duplication. If, on
338 the other hand, you won't necessarily be analyzing log files in strict
339 chronological order, you will want to keep this set to "0" so that all
340 information is parsed.
341
342 $NoSessions: If set to "1" this variable will instruct WebLog *not* to
343 include visitor counts on the monthly, daily and day-of-week lists.
344 It will also disable creation of the details report.
345
346 $NoResolve: By default, WebLog will attempt to resolve any IP numbers
347 in the log file to domain names. This can take a while, especially
348 with larger log files. If you don't want the script to bother -- if,
349 for example, you don't care whether visitors came from ".com", ".net"
350 or ".jp" sites, or if your log file already contains resolved domain
351 names wherever possible, anyway -- just set this variable to "1".
352
353 $HourOffset: If you are in one time zone and your Web host is in
354 another, you can use this variable to adjust the times shown in
355 the various reports. For example, if your server is located in the
356 Eastern time zone, but you're in the Pacific time zone, set it to
357 "-3".
358
359 $DetailsFilter: A regex defining files to exclude from the details
360 report. (It's also used to determine what qualifies as a "page view"
361 in the main report.) The default value of "(\.gif|\.jpg|\.jpeg)" will
362 filter out most image files, making it easier to follow which actual
363 pages were viewed, and allowing a (theoretically) more accurate
364 tracking of the time spent on each page.
365
366 $DetailsDays: The number of "days" past to include in the details
367 report. (This, of course, is only relevant if you're actually printing
368 the details report.) The number cannot be greater than 36. Note that
369 this only refers to literal days if you are in fact running the script
370 once per day (as most users would). Technically, this actually tells
371 the script the number of previous runs from which to still show info
372 on the report. So if you only generate a report once per week, and
373 this variable is set to 7, you'll actually end up with 7 *weeks*
374 of details data in your report. Of course, keeping that much info in
375 the report is not a good idea, and is likely to cause "out of memory"
376 errors when you try to run the script.
377
378 $DetailsSummaryDays: You can keep the "summary" data from the details
379 report longer, if you like, than you keep the actual detailed traffic
380 breakdown. The $DetailsDays variable, above, defines how many "days"
381 worth of detailed data you want in the report; this variable defines
382 the total number of "days" for which you want at least summary data.
383 For example, you might set $DetailsDays to 2, and $DetailsSummaryDays
384 to 30; that would give you a detailed look at the paths taken through
385 your site by visitors in the past two days, and general info about the
386 number of visitors and how long they spent on your site, for the
387 entire past month.
388
389 $refsexcludefrom and $refsexcludeto: If you want references to or from
390 certain files ignored in the referring URLs report, define them here.
391 You might want to exclude any references from within the same domain,
392 for example, so that you can more easily see what *outside* locations
393 are sending visitors to your site.
394
395 $RefsStripWWW: Setting this variable to "1" will instruct the script to
396 remove the "www" prefix from URLs. If you don't strip those, the same
397 URL could end up appearing twice in your referring URL list, both as
398 "www.foo.com" and as "foo.com"; if you *do* strip the prefix, though,
399 while the lists will be a bit easier to read and interpret, you'll end
400 up with some URLs which you can't actually follow unless you manually
401 put the "www" back. (On some systems, for whatever reason, it's
402 mandatory.)
403
404 $RefsFilterLists: This variable determines whether or not *all*
405 referring URLs and/or keywords will be listed in the reports. If it's
406 set to 1, the reports will automatically "filter out" less significant
407 URLs and keywords. This will of course keep the size of the reports
408 down. If you have a very busy site, and just want to know where
409 *most* people are coming from, filter your reports. On the other
410 hand, if you have a fairly quiet site, or if you're interested in
411 tracking all accesses, set this variable to 0.
412
413 $TopNRefDoms: This variable tells WebLog how many domains, if any, to
414 include in the "top referers" list. (This is just a list of the
415 domains -- not the specific pages -- from which the majority of your
416 visitors seem to be coming.)
417
418 $TopNKeywords: This variable defines the number of entries you want
419 included in your "top keywords" listing. (As with the top referring
420 domains list above, defining the variable as "0" will disable the
421 creation of the list.)
422
423 $AgentsIgnore: If you wish to ignore references to particular files in
424 your agents/platforms report, list them here. Eliminating references
425 to graphic images, for example, will prevent your report from
426 indicating an overly-high percentage of graphical browsers, since
427 only hits to actual pages will be included.
428
429 $Verbose: Setting this variable to "1" will instruct the script to
430 provide you with "status" comments as it runs. Setting it to "0"
431 will disable the comments. Any error messages, of course, will still
432 be generated.
433
434 $bodyspec: This variable defines any traits to be assigned to reports'
435 BODY tags.
436
437 $headerfile and $footerfile: These variables define the locations of
438 text files containing HTML code and text to appear at the top and
439 bottom, respectively, of the reports.
440
441 ===========================================
442
443 This documentation assumes that you have at least a general familiarity
444 with setting up Perl scripts. If you need more specific assistance,
445 check with your system administrators, consult the WebScripts FAQs
446 (frequently-asked questions) files <http://awsd.com/scripts/faqs.shtml>,
447 or post your question on the WebScripts General Support Forum
448 <http://awsd.com/scripts/forum/general/>.
449
450 -- Darryl C. Burgdorf