"Fossies" - the Fresh Open Source Software Archive

Member "weblog_files/docs.txt" (31 Dec 2002, 24103 Bytes) of package /linux/www/old/weblog_files.zip:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1                              DOCUMENTATION
    2 
    3          WebLog 2.53 by Darryl C. Burgdorf (burgdorf@awsd.com)
    4 
    5                     http://awsd.com/scripts/weblog/
    6 
    7 WebLog is a comprehensive access log analysis tool.  It allows you to
    8 keep track of activity on your site by month, week, day and hour, to
    9 monitor total hits, bytes transferred and page views, and to keep track
   10 of your most popular pages.  It can also print out secondary reports to
   11 track "user sessions," showing the paths taken through your site by your
   12 visitors and giving you a rough idea of how long they spent looking at
   13 your pages, and to provide you with information on referring sites, the
   14 search engine keywords which brought your visitors and the agents and
   15 platforms they used while visiting.  It can read NCSA common or
   16 combined log files, as well as Microsoft extended format log files.
   17 
   18               ===========================================
   19 
   20 I.  THE REPORTS
   21 
   22 The primary WebLog access report provides the following information:
   23 
   24     A.  Long-Term Statistics
   25 
   26         1.  Monthly Statistics:  An overview of site activity (number
   27             of hits, number of bytes transferred, and approximate number
   28             of visitors) per month for each month since you started
   29             running WebLog.
   30         2.  Daily Statistics (Past Five Weeks):  An overview of site
   31             activity per day for the past five weeks.
   32         3.  Day of Week Statistics:  An overview of site activity by
   33             weekday.
   34         4.  Hourly Statistics:  An overview of site activity by hour of
   35             the day.
   36         5.  "Record Book":  A simple listing of the days on which your
   37             site had the most hits, transferred the most data and saw
   38             the most visitors.
   39 
   40         Each of the "Long-Term Statistics" reports (except the "record
   41         book," of course) lists four pieces of information:  Hits,
   42         Bytes, Visits and PViews.  The number of "hits" is the total
   43         number of files requested from the server.  For example, if a
   44         visitor loads a page which includes four inline graphics, a
   45         total of five hits will be recorded in the access log.  The
   46         number of bytes represents the total amount of information
   47         transferred by the server in filling those requests.  (Note that
   48         WebLog automatically factors in a bit extra in its calculations
   49         to allow for the fact that "header" information -- which is not
   50         recorded in the server access log -- is sent by the server along
   51         with each file.)  The number of "visits" is an approximation of
   52         the number of actual individual visitors to your Web site.  This
   53         is only a *very* rough approximation, and should be regarded as
   54         such.  The number of "pview" shows the number of Web pages
   55         viewed by your visitors.  Each of the "Long-Term Statistics"
   56         reports also includes a simple "bar graph" representation; the
   57         graph can be configured to reflect whichever of the four items
   58         you're most interested in being able to track "at a glance."
   59 
   60     B.  Statistics for The Current Month
   61 
   62         1.  Top N Files by Number of Hits (optional):  A list of the
   63             pages most frequently requested.
   64         2.  Top N Files by Volume (optional):  A list of the pages which
   65             resulted in the greatest number of bytes transferred.
   66         3.  Complete File Statistics (optional):  A list of all pages
   67             accessed in the current calendar month, with the date of
   68             last access, number of times requested, and total number of
   69             bytes transferred.
   70         4.  Top N Most Frequently Requested 404 Files (optional):  A
   71             list of the pages people are requesting most often which
   72             don't actually exist on your site.
   73         5.  Complete 404 File Not Found Statistics (optional):  A
   74             complete list of those nonexistent files.
   75         6.  Top 25 "Entrance" Pages (optional): A list of the pages
   76 		    most often seen first by visitors to your site.
   77         7.  Complete "Entrance" Page List (optional): The complete
   78 		    version of the above list.
   79         8.  Top 25 "Exit" Pages (optional): A list of the pages most
   80 		    often seen last by visitors to your site.
   81         9.  Complete "Exit" Page List (optional): The complete version
   82 		    of the above list.
   83        10.  User ID Statistics (optional):  A complete list of user IDs
   84             (and the associated second-level domains) utilized by the
   85             visitors to your Web site.  Note that this report can, of
   86             course, only be generated if at least part of your Web site
   87             is password protected through your server's default system.
   88        11.  "Top Level" Domains:  A breakdown of how many visits you've
   89             had from each type of domain (.com, .net, .edu, etc.)
   90        12.  Top N Domains by Number of Hits (optional):  A list of the
   91             IP addresses (domains) from which people have visited your
   92             site most often.
   93        13.  Top N Domains by Volume (optional):  A list of the IP
   94             addresses from which people have requested the greatest
   95             amount of information.
   96        14.  Complete Domain Statistics (optional):  A complete list of
   97             the IP addresses from which people have visited your site
   98             since the beginning of the current calendar month.
   99 
  100         Each of the "Current Month" reports resets automatically at the
  101         beginning of each month.  This allows you to easily keep track
  102         of things while preventing the report file from reaching too
  103         ridiculous a size over time.
  104 
  105 The optional access details report keeps track of "user sessions."  It
  106 will show you detailed "tracks" of the paths taken through your site by
  107 visitors for however many days you specify, and will give you overview
  108 information regarding how many unique visitors you've had each day and
  109 how long they seem to be staying around.  If logging of referring URLs
  110 is enabled, it will also show you, where possible, where your visitors
  111 came from.  Please note that precise tracking of the number of visitors
  112 is impossible; the information in this report is at best a reasonably
  113 close approximation based on the information in your server access log.
  114 
  115 The optional referring URL report logs the URLs reported by browsers as
  116 the "referers" directing them to the various listed pages.  You should
  117 be aware that this information is far from perfect.  Many browsers do
  118 not provide any information on the referring page; even those that do
  119 can at times provide false or misleading data.  Of course, this report
  120 is only available if your server log contains the necessary information.
  121 
  122 The optional keywords report logs the keywords used by your visitors
  123 to find you in the various Internet search engines and directories.  The
  124 major search engines are each listed individually.  Again, this report
  125 is only available if your server log contains the necessary information.
  126 
  127 The optional agent and platform reports list the agents (browsers)
  128 and platforms (operating systems) utilized by visitors to your pages.
  129 Again, of course, this report is only available if your server log
  130 contains the necessary information.
  131 
  132     (CAVEAT:
  133 
  134     (Like any log analysis software, WebLog is based squarely upon
  135     several unfortunately questionable assumptions.  Chief among these
  136     is the assumption that any accesses from a specific IP address
  137     within a reasonably short period of time belong to a single user,
  138     and the assumption that analysis of access logs can actually tell
  139     you anything useful about site visitors, anyway.
  140 
  141     (It is possible for different users to access your site with the
  142     same IP address, so a single "user session" might actually reflect
  143     visits from multiple users.  As well, thanks to the number of
  144     systems which now employ local caching, it is quite likely that some
  145     of the pages which seem to be accessed only once are in actuality
  146     viewed many times by many different users.
  147 
  148     (WebLog also assumes that the time between the loading of one page
  149     and the loading of the next, so long as it is less than 30 minutes,
  150     is actually spent looking at the first page.  This is clearly not
  151     necessarily the case.  The user could have gotten up to fix himself
  152     lunch or use the bathroom.  He could have reloaded another page
  153     already in his browser's cache, or could even have gone to look at
  154     pages on other sites before returning to yours.  There is no way of
  155     knowing.
  156 
  157     (Finally, WebLog assumes that the average length of time spent
  158     viewing the last -- or only -- page visited in a user session is 30
  159     seconds.  Again, there is obviously no way to check the validity of
  160     this assumption.)
  161 
  162               ===========================================
  163 
  164 II.  SETTING UP AND RUNNING WEBLOG
  165 
  166 The files that you need are as follows:
  167 
  168 weblog.pl:  This is the main program file.  You don't actually need to
  169   do anything to it; in fact, you don't even have to execute it.
  170 
  171 config.pl:  This is the configuration file.  Everything you need to
  172   change or modify is contained here.  This is also the file that you
  173   will execute.  (Things are set up this way so that you can effectively
  174   maintain multiple versions of the script, for example if you want to
  175   run separate log analyses for different sites, just by keeping
  176   separate config files for each.)
  177 
  178 bar1.gif, bar2.gif, bar3.gif, bar4.gif, and bar5.gif:  These five small
  179   graphics files are used to create the bar graphs in the main access
  180   report.
  181 
  182 As noted above, the WebLog configuration file, and not the WebLog
  183 program itself, should be executed.  (And please note that it should
  184 be executed from the telnet command prompt rather than your browser;
  185 WebLog is *not* a CGI script, and most likely won't run correctly if you
  186 try to access it from your browser.)  The configuration file should, of
  187 course, be set executable.  Make sure that the first line of the script
  188 matches the location of your system's Perl interpreter.  As well, the
  189 following variables need to be defined:
  190 
  191 $LogFile:  The path (not URL) of the Web site access log file from
  192   which the log reports will be generated.  Note that this file is
  193   generated by your server; if you're not sure where to find it or what
  194   it's called, check with your system administrators.  It is possible,
  195   though not likely, that you don't actually have access to log data.
  196   If that is the case, then you won't be able to use WebLog at all.
  197   The script can read both NCSA common ("standard") and combined
  198   ("extended") format log files, as well as Microsoft extended format
  199   log files.  You don't need to specify the type, as WebLog determines
  200   it automatically when it reads the file.  Obviously, if you're
  201   dealing with NCSA standard log files, or with log files which for
  202   whatever reason don't include agent and referer information, WebLog
  203   won't be able to generate agent or referer reports.  You can use an
  204   asterisk in the variable definition as a "wildcard."  For example, if
  205   your log files come to you named "www.YYMMDD" you can simply define
  206   $LogFile as "www.*".  This will tell WebLog to analyze *any* file
  207   whose name matches the pattern.  This can also be useful if you have
  208   several log reports backed up, and want to run WebLog on all of them
  209   at once.
  210 
  211 $IPLog:  The path to an optional DBM (database) file in which resolved
  212   IP/domain pairs will be stored.  Logging this information will allow
  213   WebLog to run much faster, especially if you're running multiple
  214   reports from a single log file.  However, especially on a busy site,
  215   the log file could become *very* large.  If you define an IP log file,
  216   keep an eye on its size.
  217 
  218 $FileDir:  The path of the directory in which the various report files
  219   will be created.
  220 
  221 $ReportFile, $FullListFile, $DetailsFile, $RefsFile, $KeywordsFile and
  222   $AgentsFile:  The file names to be used for each of the reports WebLog
  223   can generate.  All but the first are optional; if you don't assign a
  224   file name, the report simply won't be generated.  The "full list" file
  225   allows you to put the "full" file, user ID and domain lists on a page
  226   of their own, while keeping the "top N" lists on the main report page;
  227   this makes the most interesting data easy to see, without requiring
  228   that the main report page be extremely large.
  229 
  230 $AgentListFile:  An optional DBM (database) file in which a *complete*
  231   list of agents (browsers) visiting your site will be maintained.  In
  232   most cases, there's really no reason to maintain such a list.
  233 
  234 $DBMType:  This variable determines how the DBM (database) files used
  235   for storage of IP and/or agent info will be accessed.  Most users can
  236   leave it set to 0.  If the script can't open the database file,
  237   though -- and especially if you receive "Inappropriate file type
  238   or format" errors -- try setting it to 1.  This will replace the
  239   basic tie() command with a version of the command specific to the
  240   DB_File module, which produces the above error message.  If all
  241   else fails, set $DBMType to 2, and instead of tie() commands, the
  242   more generic (but less efficient) dbmopen() commands will be
  243   used.
  244 
  245 $PrintFullAgentLists:  If this variable is set to 1, and if you have
  246   a DBM file containing a full list of agents, instead of printing out
  247   its normal reports, WebLog will print two lists, showing exactly which
  248   agents fall into the various agent and platform categories listed on
  249   your agents report.  Again, this is of little or no interest to most
  250   of those using WebLog, and can quite safely be set to 0 and forgotten.
  251 
  252 $EOMFile:  An optional file which WebLog can "spin off" at the close
  253   of each month, containing a full record of file access, etc., for the
  254   month.  This makes it easier for those who wish to keep permanent
  255   "archive" reports to do so.
  256 
  257 $SystemName:  The name or description which you want to appear at the
  258   top of your reports (e.g., "WebScripts").
  259 
  260 $OrgName and $OrgDomain:  The name and domain of the "host" organization
  261   (e.g., ISP and isp.com).  If these variables are defined, accesses
  262   from this organization/domain will be counted separately from other
  263   accesses in the details report.
  264 
  265 $GraphURL:  The URL of the directory containing the bar graph images
  266   (e.g., "http://awsd.com/graphs").  Do NOT include a trailing slash!
  267 
  268 $GraphBase:  This variable defines the information on which you want the
  269   bar graphs in the main report to be based.  It can be set either as
  270   "hits", "visits", "pviews" or "bytes"; if left undefined (or defined
  271   incorrectly), graphs will be based on bytes transferred.
  272 
  273 $IncludeOnlyRefsTo and $ExcludeRefsTo:  Regexs specifying files or
  274   directories to include or ignore in the files lists.  For example, to
  275   include only files in a "scripts" subdirectory, $IncludeOnlyRefsTo =
  276   "^/scripts" would suffice.  Multiple entries should be "OR"ed
  277   (e.g., $IncludeOnlyRefsTo = "(^/dir1|^/dir2)").
  278 
  279 $IncludeOnlyDomain and $ExcludeDomain:  Regexs specifying domains to
  280   include or ignore in the log file.  If you want your log analysis to
  281   ignore any visits by you to your own site, for example, set the
  282   $ExcludeDomain variable to your own IP address.  (Note that even if
  283   you don't ignore your own visits completely, you can still track them
  284   separately in the details report by using the $OrgName and $OrgDomain
  285   variables.)
  286 
  287 $IncludeQuery:  If this variable is set to "0" any query information
  288   contained in a URL will be stripped as the log file is processed.  If
  289   it is set to "1" the information will be retained.
  290 
  291 $PrintFiles:  A flag specifying whether the lists of accessed files
  292   should be generated.  (Normally, of course, you'd want to do so.
  293   However, for example, if you generate a separate access report for
  294   each site on a server, and also a report for the server as a whole,
  295   you might want to suppress the files listings on the server-wide
  296   report.)  0 = no; 1 = yes.  As noted earlier, by defining a "full
  297   list" report, you can put the full list "off to the side," to keep
  298   your main report's size down, but still have WebLog generate the
  299   "top N files" lists.
  300 
  301 $Print404:  A flag specifying whether the "Code 404" file lists should
  302   be printed.  0 = no; 1 = yes.
  303 
  304 $PrintDomains:  A flag specifying whether or not to print lists of
  305   visiting IP addresses.  0 = no; 1 = yes.  This variable can also be
  306   set to "2" to indicate that you want only second-level domains
  307   tracked.  (In other words, for example, one hit each from
  308   user1.foo.com and user2.foo.com will show up simply as two hits
  309   from foo.com, which can greatly reduce the size of your log file,
  310   especially if your site is busy!)
  311 
  312 $PrintUserIDs:  A flag specifying whether the User ID list should be
  313   generated.  If no portion of your site is password protected, or if
  314   you use a password system other than that which is integral to your
  315   server software (.htaccess in the case of most UNIX systems), then
  316   this list can be turned off, as your log file won't contain any user
  317   IDs, anyway.
  318 
  319 $PrintTopNFiles:  The number of files to include in the "Top N Files"
  320   lists.  Set to 0 if you don't want to print the lists.  The script
  321   cannot generate the "top N" list if the full list isn't also being
  322   stored.
  323 
  324 $TopFileListFilter:  Regex defining files to exclude from the "Top N
  325   Files" lists.  The default value of "(\.gif|\.jpg|\.jpeg|Code 404)"
  326   will filter out most image files and any frequently-requested but non-
  327   existing files.
  328 
  329 $PrintTopNDomains:  The number of domains to include in the "Top N
  330   Domains" lists.  (This, of course, is irrelevant if you're not
  331   printing domain lists.)
  332 
  333 $LogOnlyNew:  Setting this variable to "1" will instruct WebLog to
  334   ignore any entries in the log file being analyzed which date from
  335   before the end of the last log file analyzed.  If you're afraid that
  336   you might accidentally run the script with the same log file twice in
  337   a row, setting this to "1" will prevent any data duplication.  If, on
  338   the other hand, you won't necessarily be analyzing log files in strict
  339   chronological order, you will want to keep this set to "0" so that all
  340   information is parsed.
  341 
  342 $NoSessions:  If set to "1" this variable will instruct WebLog *not* to
  343   include visitor counts on the monthly, daily and day-of-week lists.
  344   It will also disable creation of the details report.
  345 
  346 $NoResolve:  By default, WebLog will attempt to resolve any IP numbers
  347   in the log file to domain names.  This can take a while, especially
  348   with larger log files.  If you don't want the script to bother -- if,
  349   for example, you don't care whether visitors came from ".com", ".net"
  350   or ".jp" sites, or if your log file already contains resolved domain
  351   names wherever possible, anyway -- just set this variable to "1".
  352 
  353 $HourOffset:  If you are in one time zone and your Web host is in
  354   another, you can use this variable to adjust the times shown in
  355   the various reports.  For example, if your server is located in the
  356   Eastern time zone, but you're in the Pacific time zone, set it to
  357   "-3".
  358 
  359 $DetailsFilter:  A regex defining files to exclude from the details
  360   report.  (It's also used to determine what qualifies as a "page view"
  361   in the main report.)  The default value of "(\.gif|\.jpg|\.jpeg)" will
  362   filter out most image files, making it easier to follow which actual
  363   pages were viewed, and allowing a (theoretically) more accurate
  364   tracking of the time spent on each page.
  365 
  366 $DetailsDays:  The number of "days" past to include in the details
  367   report. (This, of course, is only relevant if you're actually printing
  368   the details report.)  The number cannot be greater than 36.  Note that
  369   this only refers to literal days if you are in fact running the script
  370   once per day (as most users would).  Technically, this actually tells
  371   the script the number of previous runs from which to still show info
  372   on the report.  So if you only generate a report once per week, and
  373   this variable is set to 7, you'll actually end up with 7 *weeks*
  374   of details data in your report.  Of course, keeping that much info in
  375   the report is not a good idea, and is likely to cause "out of memory"
  376   errors when you try to run the script.
  377 
  378 $DetailsSummaryDays:  You can keep the "summary" data from the details
  379   report longer, if you like, than you keep the actual detailed traffic
  380   breakdown.  The $DetailsDays variable, above, defines how many "days"
  381   worth of detailed data you want in the report; this variable defines
  382   the total number of "days" for which you want at least summary data.
  383   For example, you might set $DetailsDays to 2, and $DetailsSummaryDays
  384   to 30; that would give you a detailed look at the paths taken through
  385   your site by visitors in the past two days, and general info about the
  386   number of visitors and how long they spent on your site, for the
  387   entire past month.
  388 
  389 $refsexcludefrom and $refsexcludeto:  If you want references to or from
  390   certain files ignored in the referring URLs report, define them here.
  391   You might want to exclude any references from within the same domain,
  392   for example, so that you can more easily see what *outside* locations
  393   are sending visitors to your site.
  394 
  395 $RefsStripWWW:  Setting this variable to "1" will instruct the script to
  396   remove the "www" prefix from URLs.  If you don't strip those, the same
  397   URL could end up appearing twice in your referring URL list, both as
  398   "www.foo.com" and as "foo.com"; if you *do* strip the prefix, though,
  399   while the lists will be a bit easier to read and interpret, you'll end
  400   up with some URLs which you can't actually follow unless you manually
  401   put the "www" back.  (On some systems, for whatever reason, it's
  402   mandatory.)
  403 
  404 $RefsFilterLists:  This variable determines whether or not *all*
  405   referring URLs and/or keywords will be listed in the reports.  If it's
  406   set to 1, the reports will automatically "filter out" less significant
  407   URLs and keywords.  This will of course keep the size of the reports
  408   down.  If you have a very busy site, and just want to know where
  409   *most* people are coming from, filter your reports.  On the other
  410   hand, if you have a fairly quiet site, or if you're interested in
  411   tracking all accesses, set this variable to 0.
  412 
  413 $TopNRefDoms:  This variable tells WebLog how many domains, if any, to
  414   include in the "top referers" list.  (This is just a list of the
  415   domains -- not the specific pages -- from which the majority of your
  416   visitors seem to be coming.)
  417 
  418 $TopNKeywords:  This variable defines the number of entries you want
  419   included in your "top keywords" listing.  (As with the top referring
  420   domains list above, defining the variable as "0" will disable the
  421   creation of the list.)
  422 
  423 $AgentsIgnore:  If you wish to ignore references to particular files in
  424   your agents/platforms report, list them here.  Eliminating references
  425   to graphic images, for example, will prevent your report from
  426   indicating an overly-high percentage of graphical browsers, since
  427   only hits to actual pages will be included.
  428 
  429 $Verbose:  Setting this variable to "1" will instruct the script to
  430   provide you with "status" comments as it runs.  Setting it to "0"
  431   will disable the comments.  Any error messages, of course, will still
  432   be generated.
  433 
  434 $bodyspec:  This variable defines any traits to be assigned to reports'
  435   BODY tags.
  436   
  437 $headerfile and $footerfile:  These variables define the locations of
  438   text files containing HTML code and text to appear at the top and
  439   bottom, respectively, of the reports.
  440 
  441               ===========================================
  442 
  443 This documentation assumes that you have at least a general familiarity
  444 with setting up Perl scripts.  If you need more specific assistance,
  445 check with your system administrators, consult the WebScripts FAQs
  446 (frequently-asked questions) files <http://awsd.com/scripts/faqs.shtml>,
  447 or post your question on the WebScripts General Support Forum
  448 <http://awsd.com/scripts/forum/general/>.
  449 
  450 -- Darryl C. Burgdorf