"Fossies" - the Fresh Open Source Software Archive

Member "libgcgi.a-0.9.5/doc/rfc2396.txt" (22 Jun 2002, 84503 Bytes) of package /linux/www/old/gcgi-0.9.5.tar.gz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 
    2 
    3 
    4 
    5 
    6 
    7 Network Working Group                                     T. Berners-Lee
    8 Request for Comments: 2396                                       MIT/LCS
    9 Updates: 1808, 1738                                          R. Fielding
   10 Category: Standards Track                                    U.C. Irvine
   11                                                              L. Masinter
   12                                                        Xerox Corporation
   13                                                              August 1998
   14 
   15 
   16            Uniform Resource Identifiers (URI): Generic Syntax
   17 
   18 Status of this Memo
   19 
   20    This document specifies an Internet standards track protocol for the
   21    Internet community, and requests discussion and suggestions for
   22    improvements.  Please refer to the current edition of the "Internet
   23    Official Protocol Standards" (STD 1) for the standardization state
   24    and status of this protocol.  Distribution of this memo is unlimited.
   25 
   26 Copyright Notice
   27 
   28    Copyright (C) The Internet Society (1998).  All Rights Reserved.
   29 
   30 IESG Note
   31 
   32    This paper describes a "superset" of operations that can be applied
   33    to URI.  It consists of both a grammar and a description of basic
   34    functionality for URI.  To understand what is a valid URI, both the
   35    grammar and the associated description have to be studied.  Some of
   36    the functionality described is not applicable to all URI schemes, and
   37    some operations are only possible when certain media types are
   38    retrieved using the URI, regardless of the scheme used.
   39 
   40 Abstract
   41 
   42    A Uniform Resource Identifier (URI) is a compact string of characters
   43    for identifying an abstract or physical resource.  This document
   44    defines the generic syntax of URI, including both absolute and
   45    relative forms, and guidelines for their use; it revises and replaces
   46    the generic definitions in RFC 1738 and RFC 1808.
   47 
   48    This document defines a grammar that is a superset of all valid URI,
   49    such that an implementation can parse the common components of a URI
   50    reference without knowing the scheme-specific requirements of every
   51    possible identifier type.  This document does not define a generative
   52    grammar for URI; that task will be performed by the individual
   53    specifications of each URI scheme.
   54 
   55 
   56 
   57 
   58 Berners-Lee, et. al.        Standards Track                     [Page 1]
   59 
   60 RFC 2396                   URI Generic Syntax                August 1998
   61 
   62 
   63 1. Introduction
   64 
   65    Uniform Resource Identifiers (URI) provide a simple and extensible
   66    means for identifying a resource.  This specification of URI syntax
   67    and semantics is derived from concepts introduced by the World Wide
   68    Web global information initiative, whose use of such objects dates
   69    from 1990 and is described in "Universal Resource Identifiers in WWW"
   70    [RFC1630].  The specification of URI is designed to meet the
   71    recommendations laid out in "Functional Recommendations for Internet
   72    Resource Locators" [RFC1736] and "Functional Requirements for Uniform
   73    Resource Names" [RFC1737].
   74 
   75    This document updates and merges "Uniform Resource Locators"
   76    [RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in order
   77    to define a single, generic syntax for all URI.  It excludes those
   78    portions of RFC 1738 that defined the specific syntax of individual
   79    URL schemes; those portions will be updated as separate documents, as
   80    will the process for registration of new URI schemes.  This document
   81    does not discuss the issues and recommendation for dealing with
   82    characters outside of the US-ASCII character set [ASCII]; those
   83    recommendations are discussed in a separate document.
   84 
   85    All significant changes from the prior RFCs are noted in Appendix G.
   86 
   87 1.1 Overview of URI
   88 
   89    URI are characterized by the following definitions:
   90 
   91       Uniform
   92          Uniformity provides several benefits: it allows different types
   93          of resource identifiers to be used in the same context, even
   94          when the mechanisms used to access those resources may differ;
   95          it allows uniform semantic interpretation of common syntactic
   96          conventions across different types of resource identifiers; it
   97          allows introduction of new types of resource identifiers
   98          without interfering with the way that existing identifiers are
   99          used; and, it allows the identifiers to be reused in many
  100          different contexts, thus permitting new applications or
  101          protocols to leverage a pre-existing, large, and widely-used
  102          set of resource identifiers.
  103 
  104       Resource
  105          A resource can be anything that has identity.  Familiar
  106          examples include an electronic document, an image, a service
  107          (e.g., "today's weather report for Los Angeles"), and a
  108          collection of other resources.  Not all resources are network
  109          "retrievable"; e.g., human beings, corporations, and bound
  110          books in a library can also be considered resources.
  111 
  112 
  113 
  114 Berners-Lee, et. al.        Standards Track                     [Page 2]
  115 
  116 RFC 2396                   URI Generic Syntax                August 1998
  117 
  118 
  119          The resource is the conceptual mapping to an entity or set of
  120          entities, not necessarily the entity which corresponds to that
  121          mapping at any particular instance in time.  Thus, a resource
  122          can remain constant even when its content---the entities to
  123          which it currently corresponds---changes over time, provided
  124          that the conceptual mapping is not changed in the process.
  125 
  126       Identifier
  127          An identifier is an object that can act as a reference to
  128          something that has identity.  In the case of URI, the object is
  129          a sequence of characters with a restricted syntax.
  130 
  131    Having identified a resource, a system may perform a variety of
  132    operations on the resource, as might be characterized by such words
  133    as `access', `update', `replace', or `find attributes'.
  134 
  135 1.2. URI, URL, and URN
  136 
  137    A URI can be further classified as a locator, a name, or both.  The
  138    term "Uniform Resource Locator" (URL) refers to the subset of URI
  139    that identify resources via a representation of their primary access
  140    mechanism (e.g., their network "location"), rather than identifying
  141    the resource by name or by some other attribute(s) of that resource.
  142    The term "Uniform Resource Name" (URN) refers to the subset of URI
  143    that are required to remain globally unique and persistent even when
  144    the resource ceases to exist or becomes unavailable.
  145 
  146    The URI scheme (Section 3.1) defines the namespace of the URI, and
  147    thus may further restrict the syntax and semantics of identifiers
  148    using that scheme.  This specification defines those elements of the
  149    URI syntax that are either required of all URI schemes or are common
  150    to many URI schemes.  It thus defines the syntax and semantics that
  151    are needed to implement a scheme-independent parsing mechanism for
  152    URI references, such that the scheme-dependent handling of a URI can
  153    be postponed until the scheme-dependent semantics are needed.  We use
  154    the term URL below when describing syntax or semantics that only
  155    apply to locators.
  156 
  157    Although many URL schemes are named after protocols, this does not
  158    imply that the only way to access the URL's resource is via the named
  159    protocol.  Gateways, proxies, caches, and name resolution services
  160    might be used to access some resources, independent of the protocol
  161    of their origin, and the resolution of some URL may require the use
  162    of more than one protocol (e.g., both DNS and HTTP are typically used
  163    to access an "http" URL's resource when it can't be found in a local
  164    cache).
  165 
  166 
  167 
  168 
  169 
  170 Berners-Lee, et. al.        Standards Track                     [Page 3]
  171 
  172 RFC 2396                   URI Generic Syntax                August 1998
  173 
  174 
  175    A URN differs from a URL in that it's primary purpose is persistent
  176    labeling of a resource with an identifier.  That identifier is drawn
  177    from one of a set of defined namespaces, each of which has its own
  178    set name structure and assignment procedures.  The "urn" scheme has
  179    been reserved to establish the requirements for a standardized URN
  180    namespace, as defined in "URN Syntax" [RFC2141] and its related
  181    specifications.
  182 
  183    Most of the examples in this specification demonstrate URL, since
  184    they allow the most varied use of the syntax and often have a
  185    hierarchical namespace.  A parser of the URI syntax is capable of
  186    parsing both URL and URN references as a generic URI; once the scheme
  187    is determined, the scheme-specific parsing can be performed on the
  188    generic URI components.  In other words, the URI syntax is a superset
  189    of the syntax of all URI schemes.
  190 
  191 1.3. Example URI
  192 
  193    The following examples illustrate URI that are in common use.
  194 
  195    ftp://ftp.is.co.za/rfc/rfc1808.txt
  196       -- ftp scheme for File Transfer Protocol services
  197 
  198    gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles
  199       -- gopher scheme for Gopher and Gopher+ Protocol services
  200 
  201    http://www.math.uio.no/faq/compression-faq/part1.html
  202       -- http scheme for Hypertext Transfer Protocol services
  203 
  204    mailto:mduerst@ifi.unizh.ch
  205       -- mailto scheme for electronic mail addresses
  206 
  207    news:comp.infosystems.www.servers.unix
  208       -- news scheme for USENET news groups and articles
  209 
  210    telnet://melvyl.ucop.edu/
  211       -- telnet scheme for interactive services via the TELNET Protocol
  212 
  213 1.4. Hierarchical URI and Relative Forms
  214 
  215    An absolute identifier refers to a resource independent of the
  216    context in which the identifier is used.  In contrast, a relative
  217    identifier refers to a resource by describing the difference within a
  218    hierarchical namespace between the current context and an absolute
  219    identifier of the resource.
  220 
  221 
  222 
  223 
  224 
  225 
  226 Berners-Lee, et. al.        Standards Track                     [Page 4]
  227 
  228 RFC 2396                   URI Generic Syntax                August 1998
  229 
  230 
  231    Some URI schemes support a hierarchical naming system, where the
  232    hierarchy of the name is denoted by a "/" delimiter separating the
  233    components in the scheme. This document defines a scheme-independent
  234    `relative' form of URI reference that can be used in conjunction with
  235    a `base' URI (of a hierarchical scheme) to produce another URI. The
  236    syntax of hierarchical URI is described in Section 3; the relative
  237    URI calculation is described in Section 5.
  238 
  239 1.5. URI Transcribability
  240 
  241    The URI syntax was designed with global transcribability as one of
  242    its main concerns. A URI is a sequence of characters from a very
  243    limited set, i.e. the letters of the basic Latin alphabet, digits,
  244    and a few special characters.  A URI may be represented in a variety
  245    of ways: e.g., ink on paper, pixels on a screen, or a sequence of
  246    octets in a coded character set.  The interpretation of a URI depends
  247    only on the characters used and not how those characters are
  248    represented in a network protocol.
  249 
  250    The goal of transcribability can be described by a simple scenario.
  251    Imagine two colleagues, Sam and Kim, sitting in a pub at an
  252    international conference and exchanging research ideas.  Sam asks Kim
  253    for a location to get more information, so Kim writes the URI for the
  254    research site on a napkin.  Upon returning home, Sam takes out the
  255    napkin and types the URI into a computer, which then retrieves the
  256    information to which Kim referred.
  257 
  258    There are several design concerns revealed by the scenario:
  259 
  260       o  A URI is a sequence of characters, which is not always
  261          represented as a sequence of octets.
  262 
  263       o  A URI may be transcribed from a non-network source, and thus
  264          should consist of characters that are most likely to be able to
  265          be typed into a computer, within the constraints imposed by
  266          keyboards (and related input devices) across languages and
  267          locales.
  268 
  269       o  A URI often needs to be remembered by people, and it is easier
  270          for people to remember a URI when it consists of meaningful
  271          components.
  272 
  273    These design concerns are not always in alignment.  For example, it
  274    is often the case that the most meaningful name for a URI component
  275    would require characters that cannot be typed into some systems.  The
  276    ability to transcribe the resource identifier from one medium to
  277    another was considered more important than having its URI consist of
  278    the most meaningful of components.  In local and regional contexts
  279 
  280 
  281 
  282 Berners-Lee, et. al.        Standards Track                     [Page 5]
  283 
  284 RFC 2396                   URI Generic Syntax                August 1998
  285 
  286 
  287    and with improving technology, users might benefit from being able to
  288    use a wider range of characters; such use is not defined in this
  289    document.
  290 
  291 1.6. Syntax Notation and Common Elements
  292 
  293    This document uses two conventions to describe and define the syntax
  294    for URI.  The first, called the layout form, is a general description
  295    of the order of components and component separators, as in
  296 
  297       <first>/<second>;<third>?<fourth>
  298 
  299    The component names are enclosed in angle-brackets and any characters
  300    outside angle-brackets are literal separators.  Whitespace should be
  301    ignored.  These descriptions are used informally and do not define
  302    the syntax requirements.
  303 
  304    The second convention is a BNF-like grammar, used to define the
  305    formal URI syntax.  The grammar is that of [RFC822], except that "|"
  306    is used to designate alternatives.  Briefly, rules are separated from
  307    definitions by an equal "=", indentation is used to continue a rule
  308    definition over more than one line, literals are quoted with "",
  309    parentheses "(" and ")" are used to group elements, optional elements
  310    are enclosed in "[" and "]" brackets, and elements may be preceded
  311    with <n>* to designate n or more repetitions of the following
  312    element; n defaults to 0.
  313 
  314    Unlike many specifications that use a BNF-like grammar to define the
  315    bytes (octets) allowed by a protocol, the URI grammar is defined in
  316    terms of characters.  Each literal in the grammar corresponds to the
  317    character it represents, rather than to the octet encoding of that
  318    character in any particular coded character set.  How a URI is
  319    represented in terms of bits and bytes on the wire is dependent upon
  320    the character encoding of the protocol used to transport it, or the
  321    charset of the document which contains it.
  322 
  323    The following definitions are common to many elements:
  324 
  325       alpha    = lowalpha | upalpha
  326 
  327       lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
  328                  "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
  329                  "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
  330 
  331       upalpha  = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
  332                  "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
  333                  "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
  334 
  335 
  336 
  337 
  338 Berners-Lee, et. al.        Standards Track                     [Page 6]
  339 
  340 RFC 2396                   URI Generic Syntax                August 1998
  341 
  342 
  343       digit    = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
  344                  "8" | "9"
  345 
  346       alphanum = alpha | digit
  347 
  348    The complete URI syntax is collected in Appendix A.
  349 
  350 2. URI Characters and Escape Sequences
  351 
  352    URI consist of a restricted set of characters, primarily chosen to
  353    aid transcribability and usability both in computer systems and in
  354    non-computer communications. Characters used conventionally as
  355    delimiters around URI were excluded.  The restricted set of
  356    characters consists of digits, letters, and a few graphic symbols
  357    were chosen from those common to most of the character encodings and
  358    input facilities available to Internet users.
  359 
  360       uric          = reserved | unreserved | escaped
  361 
  362    Within a URI, characters are either used as delimiters, or to
  363    represent strings of data (octets) within the delimited portions.
  364    Octets are either represented directly by a character (using the US-
  365    ASCII character for that octet [ASCII]) or by an escape encoding.
  366    This representation is elaborated below.
  367 
  368 2.1 URI and non-ASCII characters
  369 
  370    The relationship between URI and characters has been a source of
  371    confusion for characters that are not part of US-ASCII. To describe
  372    the relationship, it is useful to distinguish between a "character"
  373    (as a distinguishable semantic entity) and an "octet" (an 8-bit
  374    byte). There are two mappings, one from URI characters to octets, and
  375    a second from octets to original characters:
  376 
  377    URI character sequence->octet sequence->original character sequence
  378 
  379    A URI is represented as a sequence of characters, not as a sequence
  380    of octets. That is because URI might be "transported" by means that
  381    are not through a computer network, e.g., printed on paper, read over
  382    the radio, etc.
  383 
  384    A URI scheme may define a mapping from URI characters to octets;
  385    whether this is done depends on the scheme. Commonly, within a
  386    delimited component of a URI, a sequence of characters may be used to
  387    represent a sequence of octets. For example, the character "a"
  388    represents the octet 97 (decimal), while the character sequence "%",
  389    "0", "a" represents the octet 10 (decimal).
  390 
  391 
  392 
  393 
  394 Berners-Lee, et. al.        Standards Track                     [Page 7]
  395 
  396 RFC 2396                   URI Generic Syntax                August 1998
  397 
  398 
  399    There is a second translation for some resources: the sequence of
  400    octets defined by a component of the URI is subsequently used to
  401    represent a sequence of characters. A 'charset' defines this mapping.
  402    There are many charsets in use in Internet protocols. For example,
  403    UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
  404    of characters in the repertoire of ISO 10646.
  405 
  406    In the simplest case, the original character sequence contains only
  407    characters that are defined in US-ASCII, and the two levels of
  408    mapping are simple and easily invertible: each 'original character'
  409    is represented as the octet for the US-ASCII code for it, which is,
  410    in turn, represented as either the US-ASCII character, or else the
  411    "%" escape sequence for that octet.
  412 
  413    For original character sequences that contain non-ASCII characters,
  414    however, the situation is more difficult. Internet protocols that
  415    transmit octet sequences intended to represent character sequences
  416    are expected to provide some way of identifying the charset used, if
  417    there might be more than one [RFC2277].  However, there is currently
  418    no provision within the generic URI syntax to accomplish this
  419    identification. An individual URI scheme may require a single
  420    charset, define a default charset, or provide a way to indicate the
  421    charset used.
  422 
  423    It is expected that a systematic treatment of character encoding
  424    within URI will be developed as a future modification of this
  425    specification.
  426 
  427 2.2. Reserved Characters
  428 
  429    Many URI include components consisting of or delimited by, certain
  430    special characters.  These characters are called "reserved", since
  431    their usage within the URI component is limited to their reserved
  432    purpose.  If the data for a URI component would conflict with the
  433    reserved purpose, then the conflicting data must be escaped before
  434    forming the URI.
  435 
  436       reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
  437                     "$" | ","
  438 
  439    The "reserved" syntax class above refers to those characters that are
  440    allowed within a URI, but which may not be allowed within a
  441    particular component of the generic URI syntax; they are used as
  442    delimiters of the components described in Section 3.
  443 
  444 
  445 
  446 
  447 
  448 
  449 
  450 Berners-Lee, et. al.        Standards Track                     [Page 8]
  451 
  452 RFC 2396                   URI Generic Syntax                August 1998
  453 
  454 
  455    Characters in the "reserved" set are not reserved in all contexts.
  456    The set of characters actually reserved within any given URI
  457    component is defined by that component. In general, a character is
  458    reserved if the semantics of the URI changes if the character is
  459    replaced with its escaped US-ASCII encoding.
  460 
  461 2.3. Unreserved Characters
  462 
  463    Data characters that are allowed in a URI but do not have a reserved
  464    purpose are called unreserved.  These include upper and lower case
  465    letters, decimal digits, and a limited set of punctuation marks and
  466    symbols.
  467 
  468       unreserved  = alphanum | mark
  469 
  470       mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
  471 
  472    Unreserved characters can be escaped without changing the semantics
  473    of the URI, but this should not be done unless the URI is being used
  474    in a context that does not allow the unescaped character to appear.
  475 
  476 2.4. Escape Sequences
  477 
  478    Data must be escaped if it does not have a representation using an
  479    unreserved character; this includes data that does not correspond to
  480    a printable character of the US-ASCII coded character set, or that
  481    corresponds to any US-ASCII character that is disallowed, as
  482    explained below.
  483 
  484 2.4.1. Escaped Encoding
  485 
  486    An escaped octet is encoded as a character triplet, consisting of the
  487    percent character "%" followed by the two hexadecimal digits
  488    representing the octet code. For example, "%20" is the escaped
  489    encoding for the US-ASCII space character.
  490 
  491       escaped     = "%" hex hex
  492       hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
  493                             "a" | "b" | "c" | "d" | "e" | "f"
  494 
  495 2.4.2. When to Escape and Unescape
  496 
  497    A URI is always in an "escaped" form, since escaping or unescaping a
  498    completed URI might change its semantics.  Normally, the only time
  499    escape encodings can safely be made is when the URI is being created
  500    from its component parts; each component may have its own set of
  501    characters that are reserved, so only the mechanism responsible for
  502    generating or interpreting that component can determine whether or
  503 
  504 
  505 
  506 Berners-Lee, et. al.        Standards Track                     [Page 9]
  507 
  508 RFC 2396                   URI Generic Syntax                August 1998
  509 
  510 
  511    not escaping a character will change its semantics. Likewise, a URI
  512    must be separated into its components before the escaped characters
  513    within those components can be safely decoded.
  514 
  515    In some cases, data that could be represented by an unreserved
  516    character may appear escaped; for example, some of the unreserved
  517    "mark" characters are automatically escaped by some systems.  If the
  518    given URI scheme defines a canonicalization algorithm, then
  519    unreserved characters may be unescaped according to that algorithm.
  520    For example, "%7e" is sometimes used instead of "~" in an http URL
  521    path, but the two are equivalent for an http URL.
  522 
  523    Because the percent "%" character always has the reserved purpose of
  524    being the escape indicator, it must be escaped as "%25" in order to
  525    be used as data within a URI.  Implementers should be careful not to
  526    escape or unescape the same string more than once, since unescaping
  527    an already unescaped string might lead to misinterpreting a percent
  528    data character as another escaped character, or vice versa in the
  529    case of escaping an already escaped string.
  530 
  531 2.4.3. Excluded US-ASCII Characters
  532 
  533    Although they are disallowed within the URI syntax, we include here a
  534    description of those US-ASCII characters that have been excluded and
  535    the reasons for their exclusion.
  536 
  537    The control characters in the US-ASCII coded character set are not
  538    used within a URI, both because they are non-printable and because
  539    they are likely to be misinterpreted by some control mechanisms.
  540 
  541    control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
  542 
  543    The space character is excluded because significant spaces may
  544    disappear and insignificant spaces may be introduced when URI are
  545    transcribed or typeset or subjected to the treatment of word-
  546    processing programs.  Whitespace is also used to delimit URI in many
  547    contexts.
  548 
  549    space       = <US-ASCII coded character 20 hexadecimal>
  550 
  551    The angle-bracket "<" and ">" and double-quote (") characters are
  552    excluded because they are often used as the delimiters around URI in
  553    text documents and protocol fields.  The character "#" is excluded
  554    because it is used to delimit a URI from a fragment identifier in URI
  555    references (Section 4). The percent character "%" is excluded because
  556    it is used for the encoding of escaped characters.
  557 
  558    delims      = "<" | ">" | "#" | "%" | <">
  559 
  560 
  561 
  562 Berners-Lee, et. al.        Standards Track                    [Page 10]
  563 
  564 RFC 2396                   URI Generic Syntax                August 1998
  565 
  566 
  567    Other characters are excluded because gateways and other transport
  568    agents are known to sometimes modify such characters, or they are
  569    used as delimiters.
  570 
  571    unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
  572 
  573    Data corresponding to excluded characters must be escaped in order to
  574    be properly represented within a URI.
  575 
  576 3. URI Syntactic Components
  577 
  578    The URI syntax is dependent upon the scheme.  In general, absolute
  579    URI are written as follows:
  580 
  581       <scheme>:<scheme-specific-part>
  582 
  583    An absolute URI contains the name of the scheme being used (<scheme>)
  584    followed by a colon (":") and then a string (the <scheme-specific-
  585    part>) whose interpretation depends on the scheme.
  586 
  587    The URI syntax does not require that the scheme-specific-part have
  588    any general structure or set of semantics which is common among all
  589    URI.  However, a subset of URI do share a common syntax for
  590    representing hierarchical relationships within the namespace.  This
  591    "generic URI" syntax consists of a sequence of four main components:
  592 
  593       <scheme>://<authority><path>?<query>
  594 
  595    each of which, except <scheme>, may be absent from a particular URI.
  596    For example, some URI schemes do not allow an <authority> component,
  597    and others do not use a <query> component.
  598 
  599       absoluteURI   = scheme ":" ( hier_part | opaque_part )
  600 
  601    URI that are hierarchical in nature use the slash "/" character for
  602    separating hierarchical components.  For some file systems, a "/"
  603    character (used to denote the hierarchical structure of a URI) is the
  604    delimiter used to construct a file name hierarchy, and thus the URI
  605    path will look similar to a file pathname.  This does NOT imply that
  606    the resource is a file or that the URI maps to an actual filesystem
  607    pathname.
  608 
  609       hier_part     = ( net_path | abs_path ) [ "?" query ]
  610 
  611       net_path      = "//" authority [ abs_path ]
  612 
  613       abs_path      = "/"  path_segments
  614 
  615 
  616 
  617 
  618 Berners-Lee, et. al.        Standards Track                    [Page 11]
  619 
  620 RFC 2396                   URI Generic Syntax                August 1998
  621 
  622 
  623    URI that do not make use of the slash "/" character for separating
  624    hierarchical components are considered opaque by the generic URI
  625    parser.
  626 
  627       opaque_part   = uric_no_slash *uric
  628 
  629       uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
  630                       "&" | "=" | "+" | "$" | ","
  631 
  632    We use the term <path> to refer to both the <abs_path> and
  633    <opaque_part> constructs, since they are mutually exclusive for any
  634    given URI and can be parsed as a single component.
  635 
  636 3.1. Scheme Component
  637 
  638    Just as there are many different methods of access to resources,
  639    there are a variety of schemes for identifying such resources.  The
  640    URI syntax consists of a sequence of components separated by reserved
  641    characters, with the first component defining the semantics for the
  642    remainder of the URI string.
  643 
  644    Scheme names consist of a sequence of characters beginning with a
  645    lower case letter and followed by any combination of lower case
  646    letters, digits, plus ("+"), period ("."), or hyphen ("-").  For
  647    resiliency, programs interpreting URI should treat upper case letters
  648    as equivalent to lower case in scheme names (e.g., allow "HTTP" as
  649    well as "http").
  650 
  651       scheme        = alpha *( alpha | digit | "+" | "-" | "." )
  652 
  653    Relative URI references are distinguished from absolute URI in that
  654    they do not begin with a scheme name.  Instead, the scheme is
  655    inherited from the base URI, as described in Section 5.2.
  656 
  657 3.2. Authority Component
  658 
  659    Many URI schemes include a top hierarchical element for a naming
  660    authority, such that the namespace defined by the remainder of the
  661    URI is governed by that authority.  This authority component is
  662    typically defined by an Internet-based server or a scheme-specific
  663    registry of naming authorities.
  664 
  665       authority     = server | reg_name
  666 
  667    The authority component is preceded by a double slash "//" and is
  668    terminated by the next slash "/", question-mark "?", or by the end of
  669    the URI.  Within the authority component, the characters ";", ":",
  670    "@", "?", and "/" are reserved.
  671 
  672 
  673 
  674 Berners-Lee, et. al.        Standards Track                    [Page 12]
  675 
  676 RFC 2396                   URI Generic Syntax                August 1998
  677 
  678 
  679    An authority component is not required for a URI scheme to make use
  680    of relative references.  A base URI without an authority component
  681    implies that any relative reference will also be without an authority
  682    component.
  683 
  684 3.2.1. Registry-based Naming Authority
  685 
  686    The structure of a registry-based naming authority is specific to the
  687    URI scheme, but constrained to the allowed characters for an
  688    authority component.
  689 
  690       reg_name      = 1*( unreserved | escaped | "$" | "," |
  691                           ";" | ":" | "@" | "&" | "=" | "+" )
  692 
  693 3.2.2. Server-based Naming Authority
  694 
  695    URL schemes that involve the direct use of an IP-based protocol to a
  696    specified server on the Internet use a common syntax for the server
  697    component of the URI's scheme-specific data:
  698 
  699       <userinfo>@<host>:<port>
  700 
  701    where <userinfo> may consist of a user name and, optionally, scheme-
  702    specific information about how to gain authorization to access the
  703    server.  The parts "<userinfo>@" and ":<port>" may be omitted.
  704 
  705       server        = [ [ userinfo "@" ] hostport ]
  706 
  707    The user information, if present, is followed by a commercial at-sign
  708    "@".
  709 
  710       userinfo      = *( unreserved | escaped |
  711                          ";" | ":" | "&" | "=" | "+" | "$" | "," )
  712 
  713    Some URL schemes use the format "user:password" in the userinfo
  714    field. This practice is NOT RECOMMENDED, because the passing of
  715    authentication information in clear text (such as URI) has proven to
  716    be a security risk in almost every case where it has been used.
  717 
  718    The host is a domain name of a network host, or its IPv4 address as a
  719    set of four decimal digit groups separated by ".".  Literal IPv6
  720    addresses are not supported.
  721 
  722       hostport      = host [ ":" port ]
  723       host          = hostname | IPv4address
  724       hostname      = *( domainlabel "." ) toplabel [ "." ]
  725       domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
  726       toplabel      = alpha | alpha *( alphanum | "-" ) alphanum
  727 
  728 
  729 
  730 Berners-Lee, et. al.        Standards Track                    [Page 13]
  731 
  732 RFC 2396                   URI Generic Syntax                August 1998
  733 
  734 
  735       IPv4address   = 1*digit "." 1*digit "." 1*digit "." 1*digit
  736       port          = *digit
  737 
  738    Hostnames take the form described in Section 3 of [RFC1034] and
  739    Section 2.1 of [RFC1123]: a sequence of domain labels separated by
  740    ".", each domain label starting and ending with an alphanumeric
  741    character and possibly also containing "-" characters.  The rightmost
  742    domain label of a fully qualified domain name will never start with a
  743    digit, thus syntactically distinguishing domain names from IPv4
  744    addresses, and may be followed by a single "." if it is necessary to
  745    distinguish between the complete domain name and any local domain.
  746    To actually be "Uniform" as a resource locator, a URL hostname should
  747    be a fully qualified domain name.  In practice, however, the host
  748    component may be a local domain literal.
  749 
  750       Note: A suitable representation for including a literal IPv6
  751       address as the host part of a URL is desired, but has not yet been
  752       determined or implemented in practice.
  753 
  754    The port is the network port number for the server.  Most schemes
  755    designate protocols that have a default port number.  Another port
  756    number may optionally be supplied, in decimal, separated from the
  757    host by a colon.  If the port is omitted, the default port number is
  758    assumed.
  759 
  760 3.3. Path Component
  761 
  762    The path component contains data, specific to the authority (or the
  763    scheme if there is no authority component), identifying the resource
  764    within the scope of that scheme and authority.
  765 
  766       path          = [ abs_path | opaque_part ]
  767 
  768       path_segments = segment *( "/" segment )
  769       segment       = *pchar *( ";" param )
  770       param         = *pchar
  771 
  772       pchar         = unreserved | escaped |
  773                       ":" | "@" | "&" | "=" | "+" | "$" | ","
  774 
  775    The path may consist of a sequence of path segments separated by a
  776    single slash "/" character.  Within a path segment, the characters
  777    "/", ";", "=", and "?" are reserved.  Each path segment may include a
  778    sequence of parameters, indicated by the semicolon ";" character.
  779    The parameters are not significant to the parsing of relative
  780    references.
  781 
  782 
  783 
  784 
  785 
  786 Berners-Lee, et. al.        Standards Track                    [Page 14]
  787 
  788 RFC 2396                   URI Generic Syntax                August 1998
  789 
  790 
  791 3.4. Query Component
  792 
  793    The query component is a string of information to be interpreted by
  794    the resource.
  795 
  796       query         = *uric
  797 
  798    Within a query component, the characters ";", "/", "?", ":", "@",
  799    "&", "=", "+", ",", and "$" are reserved.
  800 
  801 4. URI References
  802 
  803    The term "URI-reference" is used here to denote the common usage of a
  804    resource identifier.  A URI reference may be absolute or relative,
  805    and may have additional information attached in the form of a
  806    fragment identifier.  However, "the URI" that results from such a
  807    reference includes only the absolute URI after the fragment
  808    identifier (if any) is removed and after any relative URI is resolved
  809    to its absolute form.  Although it is possible to limit the
  810    discussion of URI syntax and semantics to that of the absolute
  811    result, most usage of URI is within general URI references, and it is
  812    impossible to obtain the URI from such a reference without also
  813    parsing the fragment and resolving the relative form.
  814 
  815       URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
  816 
  817    The syntax for relative URI is a shortened form of that for absolute
  818    URI, where some prefix of the URI is missing and certain path
  819    components ("." and "..") have a special meaning when, and only when,
  820    interpreting a relative path.  The relative URI syntax is defined in
  821    Section 5.
  822 
  823 4.1. Fragment Identifier
  824 
  825    When a URI reference is used to perform a retrieval action on the
  826    identified resource, the optional fragment identifier, separated from
  827    the URI by a crosshatch ("#") character, consists of additional
  828    reference information to be interpreted by the user agent after the
  829    retrieval action has been successfully completed.  As such, it is not
  830    part of a URI, but is often used in conjunction with a URI.
  831 
  832       fragment      = *uric
  833 
  834    The semantics of a fragment identifier is a property of the data
  835    resulting from a retrieval action, regardless of the type of URI used
  836    in the reference.  Therefore, the format and interpretation of
  837    fragment identifiers is dependent on the media type [RFC2046] of the
  838    retrieval result.  The character restrictions described in Section 2
  839 
  840 
  841 
  842 Berners-Lee, et. al.        Standards Track                    [Page 15]
  843 
  844 RFC 2396                   URI Generic Syntax                August 1998
  845 
  846 
  847    for URI also apply to the fragment in a URI-reference.  Individual
  848    media types may define additional restrictions or structure within
  849    the fragment for specifying different types of "partial views" that
  850    can be identified within that media type.
  851 
  852    A fragment identifier is only meaningful when a URI reference is
  853    intended for retrieval and the result of that retrieval is a document
  854    for which the identified fragment is consistently defined.
  855 
  856 4.2. Same-document References
  857 
  858    A URI reference that does not contain a URI is a reference to the
  859    current document.  In other words, an empty URI reference within a
  860    document is interpreted as a reference to the start of that document,
  861    and a reference containing only a fragment identifier is a reference
  862    to the identified fragment of that document.  Traversal of such a
  863    reference should not result in an additional retrieval action.
  864    However, if the URI reference occurs in a context that is always
  865    intended to result in a new request, as in the case of HTML's FORM
  866    element, then an empty URI reference represents the base URI of the
  867    current document and should be replaced by that URI when transformed
  868    into a request.
  869 
  870 4.3. Parsing a URI Reference
  871 
  872    A URI reference is typically parsed according to the four main
  873    components and fragment identifier in order to determine what
  874    components are present and whether the reference is relative or
  875    absolute.  The individual components are then parsed for their
  876    subparts and, if not opaque, to verify their validity.
  877 
  878    Although the BNF defines what is allowed in each component, it is
  879    ambiguous in terms of differentiating between an authority component
  880    and a path component that begins with two slash characters.  The
  881    greedy algorithm is used for disambiguation: the left-most matching
  882    rule soaks up as much of the URI reference string as it is capable of
  883    matching.  In other words, the authority component wins.
  884 
  885    Readers familiar with regular expressions should see Appendix B for a
  886    concrete parsing example and test oracle.
  887 
  888 5. Relative URI References
  889 
  890    It is often the case that a group or "tree" of documents has been
  891    constructed to serve a common purpose; the vast majority of URI in
  892    these documents point to resources within the tree rather than
  893 
  894 
  895 
  896 
  897 
  898 Berners-Lee, et. al.        Standards Track                    [Page 16]
  899 
  900 RFC 2396                   URI Generic Syntax                August 1998
  901 
  902 
  903    outside of it.  Similarly, documents located at a particular site are
  904    much more likely to refer to other resources at that site than to
  905    resources at remote sites.
  906 
  907    Relative addressing of URI allows document trees to be partially
  908    independent of their location and access scheme.  For instance, it is
  909    possible for a single set of hypertext documents to be simultaneously
  910    accessible and traversable via each of the "file", "http", and "ftp"
  911    schemes if the documents refer to each other using relative URI.
  912    Furthermore, such document trees can be moved, as a whole, without
  913    changing any of the relative references.  Experience within the WWW
  914    has demonstrated that the ability to perform relative referencing is
  915    necessary for the long-term usability of embedded URI.
  916 
  917    The syntax for relative URI takes advantage of the <hier_part> syntax
  918    of <absoluteURI> (Section 3) in order to express a reference that is
  919    relative to the namespace of another hierarchical URI.
  920 
  921       relativeURI   = ( net_path | abs_path | rel_path ) [ "?" query ]
  922 
  923    A relative reference beginning with two slash characters is termed a
  924    network-path reference, as defined by <net_path> in Section 3.  Such
  925    references are rarely used.
  926 
  927    A relative reference beginning with a single slash character is
  928    termed an absolute-path reference, as defined by <abs_path> in
  929    Section 3.
  930 
  931    A relative reference that does not begin with a scheme name or a
  932    slash character is termed a relative-path reference.
  933 
  934       rel_path      = rel_segment [ abs_path ]
  935 
  936       rel_segment   = 1*( unreserved | escaped |
  937                           ";" | "@" | "&" | "=" | "+" | "$" | "," )
  938 
  939    Within a relative-path reference, the complete path segments "." and
  940    ".." have special meanings: "the current hierarchy level" and "the
  941    level above this hierarchy level", respectively.  Although this is
  942    very similar to their use within Unix-based filesystems to indicate
  943    directory levels, these path components are only considered special
  944    when resolving a relative-path reference to its absolute form
  945    (Section 5.2).
  946 
  947    Authors should be aware that a path segment which contains a colon
  948    character cannot be used as the first segment of a relative URI path
  949    (e.g., "this:that"), because it would be mistaken for a scheme name.
  950 
  951 
  952 
  953 
  954 Berners-Lee, et. al.        Standards Track                    [Page 17]
  955 
  956 RFC 2396                   URI Generic Syntax                August 1998
  957 
  958 
  959    It is therefore necessary to precede such segments with other
  960    segments (e.g., "./this:that") in order for them to be referenced as
  961    a relative path.
  962 
  963    It is not necessary for all URI within a given scheme to be
  964    restricted to the <hier_part> syntax, since the hierarchical
  965    properties of that syntax are only necessary when relative URI are
  966    used within a particular document.  Documents can only make use of
  967    relative URI when their base URI fits within the <hier_part> syntax.
  968    It is assumed that any document which contains a relative reference
  969    will also have a base URI that obeys the syntax.  In other words,
  970    relative URI cannot be used within a document that has an unsuitable
  971    base URI.
  972 
  973    Some URI schemes do not allow a hierarchical syntax matching the
  974    <hier_part> syntax, and thus cannot use relative references.
  975 
  976 5.1. Establishing a Base URI
  977 
  978    The term "relative URI" implies that there exists some absolute "base
  979    URI" against which the relative reference is applied.  Indeed, the
  980    base URI is necessary to define the semantics of any relative URI
  981    reference; without it, a relative reference is meaningless.  In order
  982    for relative URI to be usable within a document, the base URI of that
  983    document must be known to the parser.
  984 
  985    The base URI of a document can be established in one of four ways,
  986    listed below in order of precedence.  The order of precedence can be
  987    thought of in terms of layers, where the innermost defined base URI
  988    has the highest precedence.  This can be visualized graphically as:
  989 
  990       .----------------------------------------------------------.
  991       |  .----------------------------------------------------.  |
  992       |  |  .----------------------------------------------.  |  |
  993       |  |  |  .----------------------------------------.  |  |  |
  994       |  |  |  |  .----------------------------------.  |  |  |  |
  995       |  |  |  |  |       <relative_reference>       |  |  |  |  |
  996       |  |  |  |  `----------------------------------'  |  |  |  |
  997       |  |  |  | (5.1.1) Base URI embedded in the       |  |  |  |
  998       |  |  |  |         document's content             |  |  |  |
  999       |  |  |  `----------------------------------------'  |  |  |
 1000       |  |  | (5.1.2) Base URI of the encapsulating entity |  |  |
 1001       |  |  |         (message, document, or none).        |  |  |
 1002       |  |  `----------------------------------------------'  |  |
 1003       |  | (5.1.3) URI used to retrieve the entity            |  |
 1004       |  `----------------------------------------------------'  |
 1005       | (5.1.4) Default Base URI is application-dependent        |
 1006       `----------------------------------------------------------'
 1007 
 1008 
 1009 
 1010 Berners-Lee, et. al.        Standards Track                    [Page 18]
 1011 
 1012 RFC 2396                   URI Generic Syntax                August 1998
 1013 
 1014 
 1015 5.1.1. Base URI within Document Content
 1016 
 1017    Within certain document media types, the base URI of the document can
 1018    be embedded within the content itself such that it can be readily
 1019    obtained by a parser.  This can be useful for descriptive documents,
 1020    such as tables of content, which may be transmitted to others through
 1021    protocols other than their usual retrieval context (e.g., E-Mail or
 1022    USENET news).
 1023 
 1024    It is beyond the scope of this document to specify how, for each
 1025    media type, the base URI can be embedded.  It is assumed that user
 1026    agents manipulating such media types will be able to obtain the
 1027    appropriate syntax from that media type's specification.  An example
 1028    of how the base URI can be embedded in the Hypertext Markup Language
 1029    (HTML) [RFC1866] is provided in Appendix D.
 1030 
 1031    A mechanism for embedding the base URI within MIME container types
 1032    (e.g., the message and multipart types) is defined by MHTML
 1033    [RFC2110].  Protocols that do not use the MIME message header syntax,
 1034    but which do allow some form of tagged metainformation to be included
 1035    within messages, may define their own syntax for defining the base
 1036    URI as part of a message.
 1037 
 1038 5.1.2. Base URI from the Encapsulating Entity
 1039 
 1040    If no base URI is embedded, the base URI of a document is defined by
 1041    the document's retrieval context.  For a document that is enclosed
 1042    within another entity (such as a message or another document), the
 1043    retrieval context is that entity; thus, the default base URI of the
 1044    document is the base URI of the entity in which the document is
 1045    encapsulated.
 1046 
 1047 5.1.3. Base URI from the Retrieval URI
 1048 
 1049    If no base URI is embedded and the document is not encapsulated
 1050    within some other entity (e.g., the top level of a composite entity),
 1051    then, if a URI was used to retrieve the base document, that URI shall
 1052    be considered the base URI.  Note that if the retrieval was the
 1053    result of a redirected request, the last URI used (i.e., that which
 1054    resulted in the actual retrieval of the document) is the base URI.
 1055 
 1056 5.1.4. Default Base URI
 1057 
 1058    If none of the conditions described in Sections 5.1.1--5.1.3 apply,
 1059    then the base URI is defined by the context of the application.
 1060    Since this definition is necessarily application-dependent, failing
 1061 
 1062 
 1063 
 1064 
 1065 
 1066 Berners-Lee, et. al.        Standards Track                    [Page 19]
 1067 
 1068 RFC 2396                   URI Generic Syntax                August 1998
 1069 
 1070 
 1071    to define the base URI using one of the other methods may result in
 1072    the same content being interpreted differently by different types of
 1073    application.
 1074 
 1075    It is the responsibility of the distributor(s) of a document
 1076    containing relative URI to ensure that the base URI for that document
 1077    can be established.  It must be emphasized that relative URI cannot
 1078    be used reliably in situations where the document's base URI is not
 1079    well-defined.
 1080 
 1081 5.2. Resolving Relative References to Absolute Form
 1082 
 1083    This section describes an example algorithm for resolving URI
 1084    references that might be relative to a given base URI.
 1085 
 1086    The base URI is established according to the rules of Section 5.1 and
 1087    parsed into the four main components as described in Section 3.  Note
 1088    that only the scheme component is required to be present in the base
 1089    URI; the other components may be empty or undefined.  A component is
 1090    undefined if its preceding separator does not appear in the URI
 1091    reference; the path component is never undefined, though it may be
 1092    empty.  The base URI's query component is not used by the resolution
 1093    algorithm and may be discarded.
 1094 
 1095    For each URI reference, the following steps are performed in order:
 1096 
 1097    1) The URI reference is parsed into the potential four components and
 1098       fragment identifier, as described in Section 4.3.
 1099 
 1100    2) If the path component is empty and the scheme, authority, and
 1101       query components are undefined, then it is a reference to the
 1102       current document and we are done.  Otherwise, the reference URI's
 1103       query and fragment components are defined as found (or not found)
 1104       within the URI reference and not inherited from the base URI.
 1105 
 1106    3) If the scheme component is defined, indicating that the reference
 1107       starts with a scheme name, then the reference is interpreted as an
 1108       absolute URI and we are done.  Otherwise, the reference URI's
 1109       scheme is inherited from the base URI's scheme component.
 1110 
 1111       Due to a loophole in prior specifications [RFC1630], some parsers
 1112       allow the scheme name to be present in a relative URI if it is the
 1113       same as the base URI scheme.  Unfortunately, this can conflict
 1114       with the correct parsing of non-hierarchical URI.  For backwards
 1115       compatibility, an implementation may work around such references
 1116       by removing the scheme if it matches that of the base URI and the
 1117       scheme is known to always use the <hier_part> syntax.  The parser
 1118 
 1119 
 1120 
 1121 
 1122 Berners-Lee, et. al.        Standards Track                    [Page 20]
 1123 
 1124 RFC 2396                   URI Generic Syntax                August 1998
 1125 
 1126 
 1127       can then continue with the steps below for the remainder of the
 1128       reference components.  Validating parsers should mark such a
 1129       misformed relative reference as an error.
 1130 
 1131    4) If the authority component is defined, then the reference is a
 1132       network-path and we skip to step 7.  Otherwise, the reference
 1133       URI's authority is inherited from the base URI's authority
 1134       component, which will also be undefined if the URI scheme does not
 1135       use an authority component.
 1136 
 1137    5) If the path component begins with a slash character ("/"), then
 1138       the reference is an absolute-path and we skip to step 7.
 1139 
 1140    6) If this step is reached, then we are resolving a relative-path
 1141       reference.  The relative path needs to be merged with the base
 1142       URI's path.  Although there are many ways to do this, we will
 1143       describe a simple method using a separate string buffer.
 1144 
 1145       a) All but the last segment of the base URI's path component is
 1146          copied to the buffer.  In other words, any characters after the
 1147          last (right-most) slash character, if any, are excluded.
 1148 
 1149       b) The reference's path component is appended to the buffer
 1150          string.
 1151 
 1152       c) All occurrences of "./", where "." is a complete path segment,
 1153          are removed from the buffer string.
 1154 
 1155       d) If the buffer string ends with "." as a complete path segment,
 1156          that "." is removed.
 1157 
 1158       e) All occurrences of "<segment>/../", where <segment> is a
 1159          complete path segment not equal to "..", are removed from the
 1160          buffer string.  Removal of these path segments is performed
 1161          iteratively, removing the leftmost matching pattern on each
 1162          iteration, until no matching pattern remains.
 1163 
 1164       f) If the buffer string ends with "<segment>/..", where <segment>
 1165          is a complete path segment not equal to "..", that
 1166          "<segment>/.." is removed.
 1167 
 1168       g) If the resulting buffer string still begins with one or more
 1169          complete path segments of "..", then the reference is
 1170          considered to be in error.  Implementations may handle this
 1171          error by retaining these components in the resolved path (i.e.,
 1172          treating them as part of the final URI), by removing them from
 1173          the resolved path (i.e., discarding relative levels above the
 1174          root), or by avoiding traversal of the reference.
 1175 
 1176 
 1177 
 1178 Berners-Lee, et. al.        Standards Track                    [Page 21]
 1179 
 1180 RFC 2396                   URI Generic Syntax                August 1998
 1181 
 1182 
 1183       h) The remaining buffer string is the reference URI's new path
 1184          component.
 1185 
 1186    7) The resulting URI components, including any inherited from the
 1187       base URI, are recombined to give the absolute form of the URI
 1188       reference.  Using pseudocode, this would be
 1189 
 1190          result = ""
 1191 
 1192          if scheme is defined then
 1193              append scheme to result
 1194              append ":" to result
 1195 
 1196          if authority is defined then
 1197              append "//" to result
 1198              append authority to result
 1199 
 1200          append path to result
 1201 
 1202          if query is defined then
 1203              append "?" to result
 1204              append query to result
 1205 
 1206          if fragment is defined then
 1207              append "#" to result
 1208              append fragment to result
 1209 
 1210          return result
 1211 
 1212       Note that we must be careful to preserve the distinction between a
 1213       component that is undefined, meaning that its separator was not
 1214       present in the reference, and a component that is empty, meaning
 1215       that the separator was present and was immediately followed by the
 1216       next component separator or the end of the reference.
 1217 
 1218    The above algorithm is intended to provide an example by which the
 1219    output of implementations can be tested -- implementation of the
 1220    algorithm itself is not required.  For example, some systems may find
 1221    it more efficient to implement step 6 as a pair of segment stacks
 1222    being merged, rather than as a series of string pattern replacements.
 1223 
 1224       Note: Some WWW client applications will fail to separate the
 1225       reference's query component from its path component before merging
 1226       the base and reference paths in step 6 above.  This may result in
 1227       a loss of information if the query component contains the strings
 1228       "/../" or "/./".
 1229 
 1230    Resolution examples are provided in Appendix C.
 1231 
 1232 
 1233 
 1234 Berners-Lee, et. al.        Standards Track                    [Page 22]
 1235 
 1236 RFC 2396                   URI Generic Syntax                August 1998
 1237 
 1238 
 1239 6. URI Normalization and Equivalence
 1240 
 1241    In many cases, different URI strings may actually identify the
 1242    identical resource. For example, the host names used in URL are
 1243    actually case insensitive, and the URL <http://www.XEROX.com> is
 1244    equivalent to <http://www.xerox.com>. In general, the rules for
 1245    equivalence and definition of a normal form, if any, are scheme
 1246    dependent. When a scheme uses elements of the common syntax, it will
 1247    also use the common syntax equivalence rules, namely that the scheme
 1248    and hostname are case insensitive and a URL with an explicit ":port",
 1249    where the port is the default for the scheme, is equivalent to one
 1250    where the port is elided.
 1251 
 1252 7. Security Considerations
 1253 
 1254    A URI does not in itself pose a security threat.  Users should beware
 1255    that there is no general guarantee that a URL, which at one time
 1256    located a given resource, will continue to do so.  Nor is there any
 1257    guarantee that a URL will not locate a different resource at some
 1258    later point in time, due to the lack of any constraint on how a given
 1259    authority apportions its namespace.  Such a guarantee can only be
 1260    obtained from the person(s) controlling that namespace and the
 1261    resource in question.  A specific URI scheme may include additional
 1262    semantics, such as name persistence, if those semantics are required
 1263    of all naming authorities for that scheme.
 1264 
 1265    It is sometimes possible to construct a URL such that an attempt to
 1266    perform a seemingly harmless, idempotent operation, such as the
 1267    retrieval of an entity associated with the resource, will in fact
 1268    cause a possibly damaging remote operation to occur.  The unsafe URL
 1269    is typically constructed by specifying a port number other than that
 1270    reserved for the network protocol in question.  The client
 1271    unwittingly contacts a site that is in fact running a different
 1272    protocol.  The content of the URL contains instructions that, when
 1273    interpreted according to this other protocol, cause an unexpected
 1274    operation.  An example has been the use of a gopher URL to cause an
 1275    unintended or impersonating message to be sent via a SMTP server.
 1276 
 1277    Caution should be used when using any URL that specifies a port
 1278    number other than the default for the protocol, especially when it is
 1279    a number within the reserved space.
 1280 
 1281    Care should be taken when a URL contains escaped delimiters for a
 1282    given protocol (for example, CR and LF characters for telnet
 1283    protocols) that these are not unescaped before transmission.  This
 1284    might violate the protocol, but avoids the potential for such
 1285 
 1286 
 1287 
 1288 
 1289 
 1290 Berners-Lee, et. al.        Standards Track                    [Page 23]
 1291 
 1292 RFC 2396                   URI Generic Syntax                August 1998
 1293 
 1294 
 1295    characters to be used to simulate an extra operation or parameter in
 1296    that protocol, which might lead to an unexpected and possibly harmful
 1297    remote operation to be performed.
 1298 
 1299    It is clearly unwise to use a URL that contains a password which is
 1300    intended to be secret. In particular, the use of a password within
 1301    the 'userinfo' component of a URL is strongly disrecommended except
 1302    in those rare cases where the 'password' parameter is intended to be
 1303    public.
 1304 
 1305 8. Acknowledgements
 1306 
 1307    This document was derived from RFC 1738 [RFC1738] and RFC 1808
 1308    [RFC1808]; the acknowledgements in those specifications still apply.
 1309    In addition, contributions by Gisle Aas, Martin Beet, Martin Duerst,
 1310    Jim Gettys, Martijn Koster, Dave Kristol, Daniel LaLiberte, Foteos
 1311    Macrides, James Marshall, Ryan Moats, Keith Moore, and Lauren Wood
 1312    are gratefully acknowledged.
 1313 
 1314 9. References
 1315 
 1316    [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
 1317              Languages", BCP 18, RFC 2277, January 1998.
 1318 
 1319    [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A
 1320              Unifying Syntax for the Expression of Names and Addresses
 1321              of Objects on the Network as used in the World-Wide Web",
 1322              RFC 1630, June 1994.
 1323 
 1324    [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, Editors,
 1325              "Uniform Resource Locators (URL)", RFC 1738, December 1994.
 1326 
 1327    [RFC1866] Berners-Lee T., and D. Connolly, "HyperText Markup Language
 1328              Specification -- 2.0", RFC 1866, November 1995.
 1329 
 1330    [RFC1123] Braden, R., Editor, "Requirements for Internet Hosts --
 1331              Application and Support", STD 3, RFC 1123, October 1989.
 1332 
 1333    [RFC822]  Crocker, D., "Standard for the Format of ARPA Internet Text
 1334              Messages", STD 11, RFC 822, August 1982.
 1335 
 1336    [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC
 1337              1808, June 1995.
 1338 
 1339    [RFC2046] Freed, N., and N. Borenstein, "Multipurpose Internet Mail
 1340              Extensions (MIME) Part Two: Media Types", RFC 2046,
 1341              November 1996.
 1342 
 1343 
 1344 
 1345 
 1346 Berners-Lee, et. al.        Standards Track                    [Page 24]
 1347 
 1348 RFC 2396                   URI Generic Syntax                August 1998
 1349 
 1350 
 1351    [RFC1736] Kunze, J., "Functional Recommendations for Internet
 1352              Resource Locators", RFC 1736, February 1995.
 1353 
 1354    [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
 1355 
 1356    [RFC1034] Mockapetris, P., "Domain Names - Concepts and Facilities",
 1357              STD 13, RFC 1034, November 1987.
 1358 
 1359    [RFC2110] Palme, J., and A. Hopmann, "MIME E-mail Encapsulation of
 1360              Aggregate Documents, such as HTML (MHTML)", RFC 2110, March
 1361              1997.
 1362 
 1363    [RFC1737] Sollins, K., and L. Masinter, "Functional Requirements for
 1364              Uniform Resource Names", RFC 1737, December 1994.
 1365 
 1366    [ASCII]   US-ASCII. "Coded Character Set -- 7-bit American Standard
 1367              Code for Information Interchange", ANSI X3.4-1986.
 1368 
 1369    [UTF-8]   Yergeau, F., "UTF-8, a transformation format of ISO 10646",
 1370              RFC 2279, January 1998.
 1371 
 1372 
 1373 
 1374 
 1375 
 1376 
 1377 
 1378 
 1379 
 1380 
 1381 
 1382 
 1383 
 1384 
 1385 
 1386 
 1387 
 1388 
 1389 
 1390 
 1391 
 1392 
 1393 
 1394 
 1395 
 1396 
 1397 
 1398 
 1399 
 1400 
 1401 
 1402 Berners-Lee, et. al.        Standards Track                    [Page 25]
 1403 
 1404 RFC 2396                   URI Generic Syntax                August 1998
 1405 
 1406 
 1407 10. Authors' Addresses
 1408 
 1409    Tim Berners-Lee
 1410    World Wide Web Consortium
 1411    MIT Laboratory for Computer Science, NE43-356
 1412    545 Technology Square
 1413    Cambridge, MA 02139
 1414 
 1415    Fax: +1(617)258-8682
 1416    EMail: timbl@w3.org
 1417 
 1418 
 1419    Roy T. Fielding
 1420    Department of Information and Computer Science
 1421    University of California, Irvine
 1422    Irvine, CA  92697-3425
 1423 
 1424    Fax: +1(949)824-1715
 1425    EMail: fielding@ics.uci.edu
 1426 
 1427 
 1428    Larry Masinter
 1429    Xerox PARC
 1430    3333 Coyote Hill Road
 1431    Palo Alto, CA 94034
 1432 
 1433    Fax: +1(415)812-4333
 1434    EMail: masinter@parc.xerox.com
 1435 
 1436 
 1437 
 1438 
 1439 
 1440 
 1441 
 1442 
 1443 
 1444 
 1445 
 1446 
 1447 
 1448 
 1449 
 1450 
 1451 
 1452 
 1453 
 1454 
 1455 
 1456 
 1457 
 1458 Berners-Lee, et. al.        Standards Track                    [Page 26]
 1459 
 1460 RFC 2396                   URI Generic Syntax                August 1998
 1461 
 1462 
 1463 A. Collected BNF for URI
 1464 
 1465       URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
 1466       absoluteURI   = scheme ":" ( hier_part | opaque_part )
 1467       relativeURI   = ( net_path | abs_path | rel_path ) [ "?" query ]
 1468 
 1469       hier_part     = ( net_path | abs_path ) [ "?" query ]
 1470       opaque_part   = uric_no_slash *uric
 1471 
 1472       uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
 1473                       "&" | "=" | "+" | "$" | ","
 1474 
 1475       net_path      = "//" authority [ abs_path ]
 1476       abs_path      = "/"  path_segments
 1477       rel_path      = rel_segment [ abs_path ]
 1478 
 1479       rel_segment   = 1*( unreserved | escaped |
 1480                           ";" | "@" | "&" | "=" | "+" | "$" | "," )
 1481 
 1482       scheme        = alpha *( alpha | digit | "+" | "-" | "." )
 1483 
 1484       authority     = server | reg_name
 1485 
 1486       reg_name      = 1*( unreserved | escaped | "$" | "," |
 1487                           ";" | ":" | "@" | "&" | "=" | "+" )
 1488 
 1489       server        = [ [ userinfo "@" ] hostport ]
 1490       userinfo      = *( unreserved | escaped |
 1491                          ";" | ":" | "&" | "=" | "+" | "$" | "," )
 1492 
 1493       hostport      = host [ ":" port ]
 1494       host          = hostname | IPv4address
 1495       hostname      = *( domainlabel "." ) toplabel [ "." ]
 1496       domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
 1497       toplabel      = alpha | alpha *( alphanum | "-" ) alphanum
 1498       IPv4address   = 1*digit "." 1*digit "." 1*digit "." 1*digit
 1499       port          = *digit
 1500 
 1501       path          = [ abs_path | opaque_part ]
 1502       path_segments = segment *( "/" segment )
 1503       segment       = *pchar *( ";" param )
 1504       param         = *pchar
 1505       pchar         = unreserved | escaped |
 1506                       ":" | "@" | "&" | "=" | "+" | "$" | ","
 1507 
 1508       query         = *uric
 1509 
 1510       fragment      = *uric
 1511 
 1512 
 1513 
 1514 Berners-Lee, et. al.        Standards Track                    [Page 27]
 1515 
 1516 RFC 2396                   URI Generic Syntax                August 1998
 1517 
 1518 
 1519       uric          = reserved | unreserved | escaped
 1520       reserved      = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
 1521                       "$" | ","
 1522       unreserved    = alphanum | mark
 1523       mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
 1524                       "(" | ")"
 1525 
 1526       escaped       = "%" hex hex
 1527       hex           = digit | "A" | "B" | "C" | "D" | "E" | "F" |
 1528                               "a" | "b" | "c" | "d" | "e" | "f"
 1529 
 1530       alphanum      = alpha | digit
 1531       alpha         = lowalpha | upalpha
 1532 
 1533       lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
 1534                  "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
 1535                  "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
 1536       upalpha  = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
 1537                  "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
 1538                  "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
 1539       digit    = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
 1540                  "8" | "9"
 1541 
 1542 
 1543 
 1544 
 1545 
 1546 
 1547 
 1548 
 1549 
 1550 
 1551 
 1552 
 1553 
 1554 
 1555 
 1556 
 1557 
 1558 
 1559 
 1560 
 1561 
 1562 
 1563 
 1564 
 1565 
 1566 
 1567 
 1568 
 1569 
 1570 Berners-Lee, et. al.        Standards Track                    [Page 28]
 1571 
 1572 RFC 2396                   URI Generic Syntax                August 1998
 1573 
 1574 
 1575 B. Parsing a URI Reference with a Regular Expression
 1576 
 1577    As described in Section 4.3, the generic URI syntax is not sufficient
 1578    to disambiguate the components of some forms of URI.  Since the
 1579    "greedy algorithm" described in that section is identical to the
 1580    disambiguation method used by POSIX regular expressions, it is
 1581    natural and commonplace to use a regular expression for parsing the
 1582    potential four components and fragment identifier of a URI reference.
 1583 
 1584    The following line is the regular expression for breaking-down a URI
 1585    reference into its components.
 1586 
 1587       ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 1588        12            3  4          5       6  7        8 9
 1589 
 1590    The numbers in the second line above are only to assist readability;
 1591    they indicate the reference points for each subexpression (i.e., each
 1592    paired parenthesis).  We refer to the value matched for subexpression
 1593    <n> as $<n>.  For example, matching the above expression to
 1594 
 1595       http://www.ics.uci.edu/pub/ietf/uri/#Related
 1596 
 1597    results in the following subexpression matches:
 1598 
 1599       $1 = http:
 1600       $2 = http
 1601       $3 = //www.ics.uci.edu
 1602       $4 = www.ics.uci.edu
 1603       $5 = /pub/ietf/uri/
 1604       $6 = <undefined>
 1605       $7 = <undefined>
 1606       $8 = #Related
 1607       $9 = Related
 1608 
 1609    where <undefined> indicates that the component is not present, as is
 1610    the case for the query component in the above example.  Therefore, we
 1611    can determine the value of the four components and fragment as
 1612 
 1613       scheme    = $2
 1614       authority = $4
 1615       path      = $5
 1616       query     = $7
 1617       fragment  = $9
 1618 
 1619    and, going in the opposite direction, we can recreate a URI reference
 1620    from its components using the algorithm in step 7 of Section 5.2.
 1621 
 1622 
 1623 
 1624 
 1625 
 1626 Berners-Lee, et. al.        Standards Track                    [Page 29]
 1627 
 1628 RFC 2396                   URI Generic Syntax                August 1998
 1629 
 1630 
 1631 C. Examples of Resolving Relative URI References
 1632 
 1633    Within an object with a well-defined base URI of
 1634 
 1635       http://a/b/c/d;p?q
 1636 
 1637    the relative URI would be resolved as follows:
 1638 
 1639 C.1.  Normal Examples
 1640 
 1641       g:h           =  g:h
 1642       g             =  http://a/b/c/g
 1643       ./g           =  http://a/b/c/g
 1644       g/            =  http://a/b/c/g/
 1645       /g            =  http://a/g
 1646       //g           =  http://g
 1647       ?y            =  http://a/b/c/?y
 1648       g?y           =  http://a/b/c/g?y
 1649       #s            =  (current document)#s
 1650       g#s           =  http://a/b/c/g#s
 1651       g?y#s         =  http://a/b/c/g?y#s
 1652       ;x            =  http://a/b/c/;x
 1653       g;x           =  http://a/b/c/g;x
 1654       g;x?y#s       =  http://a/b/c/g;x?y#s
 1655       .             =  http://a/b/c/
 1656       ./            =  http://a/b/c/
 1657       ..            =  http://a/b/
 1658       ../           =  http://a/b/
 1659       ../g          =  http://a/b/g
 1660       ../..         =  http://a/
 1661       ../../        =  http://a/
 1662       ../../g       =  http://a/g
 1663 
 1664 C.2.  Abnormal Examples
 1665 
 1666    Although the following abnormal examples are unlikely to occur in
 1667    normal practice, all URI parsers should be capable of resolving them
 1668    consistently.  Each example uses the same base as above.
 1669 
 1670    An empty reference refers to the start of the current document.
 1671 
 1672       <>            =  (current document)
 1673 
 1674    Parsers must be careful in handling the case where there are more
 1675    relative path ".." segments than there are hierarchical levels in the
 1676    base URI's path.  Note that the ".." syntax cannot be used to change
 1677    the authority component of a URI.
 1678 
 1679 
 1680 
 1681 
 1682 Berners-Lee, et. al.        Standards Track                    [Page 30]
 1683 
 1684 RFC 2396                   URI Generic Syntax                August 1998
 1685 
 1686 
 1687       ../../../g    =  http://a/../g
 1688       ../../../../g =  http://a/../../g
 1689 
 1690    In practice, some implementations strip leading relative symbolic
 1691    elements (".", "..") after applying a relative URI calculation, based
 1692    on the theory that compensating for obvious author errors is better
 1693    than allowing the request to fail.  Thus, the above two references
 1694    will be interpreted as "http://a/g" by some implementations.
 1695 
 1696    Similarly, parsers must avoid treating "." and ".." as special when
 1697    they are not complete components of a relative path.
 1698 
 1699       /./g          =  http://a/./g
 1700       /../g         =  http://a/../g
 1701       g.            =  http://a/b/c/g.
 1702       .g            =  http://a/b/c/.g
 1703       g..           =  http://a/b/c/g..
 1704       ..g           =  http://a/b/c/..g
 1705 
 1706    Less likely are cases where the relative URI uses unnecessary or
 1707    nonsensical forms of the "." and ".." complete path segments.
 1708 
 1709       ./../g        =  http://a/b/g
 1710       ./g/.         =  http://a/b/c/g/
 1711       g/./h         =  http://a/b/c/g/h
 1712       g/../h        =  http://a/b/c/h
 1713       g;x=1/./y     =  http://a/b/c/g;x=1/y
 1714       g;x=1/../y    =  http://a/b/c/y
 1715 
 1716    All client applications remove the query component from the base URI
 1717    before resolving relative URI.  However, some applications fail to
 1718    separate the reference's query and/or fragment components from a
 1719    relative path before merging it with the base path.  This error is
 1720    rarely noticed, since typical usage of a fragment never includes the
 1721    hierarchy ("/") character, and the query component is not normally
 1722    used within relative references.
 1723 
 1724       g?y/./x       =  http://a/b/c/g?y/./x
 1725       g?y/../x      =  http://a/b/c/g?y/../x
 1726       g#s/./x       =  http://a/b/c/g#s/./x
 1727       g#s/../x      =  http://a/b/c/g#s/../x
 1728 
 1729 
 1730 
 1731 
 1732 
 1733 
 1734 
 1735 
 1736 
 1737 
 1738 Berners-Lee, et. al.        Standards Track                    [Page 31]
 1739 
 1740 RFC 2396                   URI Generic Syntax                August 1998
 1741 
 1742 
 1743    Some parsers allow the scheme name to be present in a relative URI if
 1744    it is the same as the base URI scheme.  This is considered to be a
 1745    loophole in prior specifications of partial URI [RFC1630]. Its use
 1746    should be avoided.
 1747 
 1748       http:g        =  http:g           ; for validating parsers
 1749                     |  http://a/b/c/g   ; for backwards compatibility
 1750 
 1751 
 1752 
 1753 
 1754 
 1755 
 1756 
 1757 
 1758 
 1759 
 1760 
 1761 
 1762 
 1763 
 1764 
 1765 
 1766 
 1767 
 1768 
 1769 
 1770 
 1771 
 1772 
 1773 
 1774 
 1775 
 1776 
 1777 
 1778 
 1779 
 1780 
 1781 
 1782 
 1783 
 1784 
 1785 
 1786 
 1787 
 1788 
 1789 
 1790 
 1791 
 1792 
 1793 
 1794 Berners-Lee, et. al.        Standards Track                    [Page 32]
 1795 
 1796 RFC 2396                   URI Generic Syntax                August 1998
 1797 
 1798 
 1799 D. Embedding the Base URI in HTML documents
 1800 
 1801    It is useful to consider an example of how the base URI of a document
 1802    can be embedded within the document's content.  In this appendix, we
 1803    describe how documents written in the Hypertext Markup Language
 1804    (HTML) [RFC1866] can include an embedded base URI.  This appendix
 1805    does not form a part of the URI specification and should not be
 1806    considered as anything more than a descriptive example.
 1807 
 1808    HTML defines a special element "BASE" which, when present in the
 1809    "HEAD" portion of a document, signals that the parser should use the
 1810    BASE element's "HREF" attribute as the base URI for resolving any
 1811    relative URI.  The "HREF" attribute must be an absolute URI.  Note
 1812    that, in HTML, element and attribute names are case-insensitive.  For
 1813    example:
 1814 
 1815       <!doctype html public "-//IETF//DTD HTML//EN">
 1816       <HTML><HEAD>
 1817       <TITLE>An example HTML document</TITLE>
 1818       <BASE href="http://www.ics.uci.edu/Test/a/b/c">
 1819       </HEAD><BODY>
 1820       ... <A href="../x">a hypertext anchor</A> ...
 1821       </BODY></HTML>
 1822 
 1823    A parser reading the example document should interpret the given
 1824    relative URI "../x" as representing the absolute URI
 1825 
 1826       <http://www.ics.uci.edu/Test/a/x>
 1827 
 1828    regardless of the context in which the example document was obtained.
 1829 
 1830 
 1831 
 1832 
 1833 
 1834 
 1835 
 1836 
 1837 
 1838 
 1839 
 1840 
 1841 
 1842 
 1843 
 1844 
 1845 
 1846 
 1847 
 1848 
 1849 
 1850 Berners-Lee, et. al.        Standards Track                    [Page 33]
 1851 
 1852 RFC 2396                   URI Generic Syntax                August 1998
 1853 
 1854 
 1855 E. Recommendations for Delimiting URI in Context
 1856 
 1857    URI are often transmitted through formats that do not provide a clear
 1858    context for their interpretation.  For example, there are many
 1859    occasions when URI are included in plain text; examples include text
 1860    sent in electronic mail, USENET news messages, and, most importantly,
 1861    printed on paper.  In such cases, it is important to be able to
 1862    delimit the URI from the rest of the text, and in particular from
 1863    punctuation marks that might be mistaken for part of the URI.
 1864 
 1865    In practice, URI are delimited in a variety of ways, but usually
 1866    within double-quotes "http://test.com/", angle brackets
 1867    <http://test.com/>, or just using whitespace
 1868 
 1869                              http://test.com/
 1870 
 1871    These wrappers do not form part of the URI.
 1872 
 1873    In the case where a fragment identifier is associated with a URI
 1874    reference, the fragment would be placed within the brackets as well
 1875    (separated from the URI with a "#" character).
 1876 
 1877    In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may
 1878    need to be added to break long URI across lines. The whitespace
 1879    should be ignored when extracting the URI.
 1880 
 1881    No whitespace should be introduced after a hyphen ("-") character.
 1882    Because some typesetters and printers may (erroneously) introduce a
 1883    hyphen at the end of line when breaking a line, the interpreter of a
 1884    URI containing a line break immediately after a hyphen should ignore
 1885    all unescaped whitespace around the line break, and should be aware
 1886    that the hyphen may or may not actually be part of the URI.
 1887 
 1888    Using <> angle brackets around each URI is especially recommended as
 1889    a delimiting style for URI that contain whitespace.
 1890 
 1891    The prefix "URL:" (with or without a trailing space) was recommended
 1892    as a way to used to help distinguish a URL from other bracketed
 1893    designators, although this is not common in practice.
 1894 
 1895    For robustness, software that accepts user-typed URI should attempt
 1896    to recognize and strip both delimiters and embedded whitespace.
 1897 
 1898    For example, the text:
 1899 
 1900 
 1901 
 1902 
 1903 
 1904 
 1905 
 1906 Berners-Lee, et. al.        Standards Track                    [Page 34]
 1907 
 1908 RFC 2396                   URI Generic Syntax                August 1998
 1909 
 1910 
 1911       Yes, Jim, I found it under "http://www.w3.org/Addressing/",
 1912       but you can probably pick it up from <ftp://ds.internic.
 1913       net/rfc/>.  Note the warning in <http://www.ics.uci.edu/pub/
 1914       ietf/uri/historical.html#WARNING>.
 1915 
 1916    contains the URI references
 1917 
 1918       http://www.w3.org/Addressing/
 1919       ftp://ds.internic.net/rfc/
 1920       http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING
 1921 
 1922 
 1923 
 1924 
 1925 
 1926 
 1927 
 1928 
 1929 
 1930 
 1931 
 1932 
 1933 
 1934 
 1935 
 1936 
 1937 
 1938 
 1939 
 1940 
 1941 
 1942 
 1943 
 1944 
 1945 
 1946 
 1947 
 1948 
 1949 
 1950 
 1951 
 1952 
 1953 
 1954 
 1955 
 1956 
 1957 
 1958 
 1959 
 1960 
 1961 
 1962 Berners-Lee, et. al.        Standards Track                    [Page 35]
 1963 
 1964 RFC 2396                   URI Generic Syntax                August 1998
 1965 
 1966 
 1967 F. Abbreviated URLs
 1968 
 1969    The URL syntax was designed for unambiguous reference to network
 1970    resources and extensibility via the URL scheme.  However, as URL
 1971    identification and usage have become commonplace, traditional media
 1972    (television, radio, newspapers, billboards, etc.) have increasingly
 1973    used abbreviated URL references.  That is, a reference consisting of
 1974    only the authority and path portions of the identified resource, such
 1975    as
 1976 
 1977       www.w3.org/Addressing/
 1978 
 1979    or simply the DNS hostname on its own.  Such references are primarily
 1980    intended for human interpretation rather than machine, with the
 1981    assumption that context-based heuristics are sufficient to complete
 1982    the URL (e.g., most hostnames beginning with "www" are likely to have
 1983    a URL prefix of "http://").  Although there is no standard set of
 1984    heuristics for disambiguating abbreviated URL references, many client
 1985    implementations allow them to be entered by the user and
 1986    heuristically resolved.  It should be noted that such heuristics may
 1987    change over time, particularly when new URL schemes are introduced.
 1988 
 1989    Since an abbreviated URL has the same syntax as a relative URL path,
 1990    abbreviated URL references cannot be used in contexts where relative
 1991    URLs are expected.  This limits the use of abbreviated URLs to places
 1992    where there is no defined base URL, such as dialog boxes and off-line
 1993    advertisements.
 1994 
 1995 
 1996 
 1997 
 1998 
 1999 
 2000 
 2001 
 2002 
 2003 
 2004 
 2005 
 2006 
 2007 
 2008 
 2009 
 2010 
 2011 
 2012 
 2013 
 2014 
 2015 
 2016 
 2017 
 2018 Berners-Lee, et. al.        Standards Track                    [Page 36]
 2019 
 2020 RFC 2396                   URI Generic Syntax                August 1998
 2021 
 2022 
 2023 G. Summary of Non-editorial Changes
 2024 
 2025 G.1. Additions
 2026 
 2027    Section 4 (URI References) was added to stem the confusion regarding
 2028    "what is a URI" and how to describe fragment identifiers given that
 2029    they are not part of the URI, but are part of the URI syntax and
 2030    parsing concerns.  In addition, it provides a reference definition
 2031    for use by other IETF specifications (HTML, HTTP, etc.) that have
 2032    previously attempted to redefine the URI syntax in order to account
 2033    for the presence of fragment identifiers in URI references.
 2034 
 2035    Section 2.4 was rewritten to clarify a number of misinterpretations
 2036    and to leave room for fully internationalized URI.
 2037 
 2038    Appendix F on abbreviated URLs was added to describe the shortened
 2039    references often seen on television and magazine advertisements and
 2040    explain why they are not used in other contexts.
 2041 
 2042 G.2. Modifications from both RFC 1738 and RFC 1808
 2043 
 2044    Changed to URI syntax instead of just URL.
 2045 
 2046    Confusion regarding the terms "character encoding", the URI
 2047    "character set", and the escaping of characters with %<hex><hex>
 2048    equivalents has (hopefully) been reduced.  Many of the BNF rule names
 2049    regarding the character sets have been changed to more accurately
 2050    describe their purpose and to encompass all "characters" rather than
 2051    just US-ASCII octets.  Unless otherwise noted here, these
 2052    modifications do not affect the URI syntax.
 2053 
 2054    Both RFC 1738 and RFC 1808 refer to the "reserved" set of characters
 2055    as if URI-interpreting software were limited to a single set of
 2056    characters with a reserved purpose (i.e., as meaning something other
 2057    than the data to which the characters correspond), and that this set
 2058    was fixed by the URI scheme.  However, this has not been true in
 2059    practice; any character that is interpreted differently when it is
 2060    escaped is, in effect, reserved.  Furthermore, the interpreting
 2061    engine on a HTTP server is often dependent on the resource, not just
 2062    the URI scheme.  The description of reserved characters has been
 2063    changed accordingly.
 2064 
 2065    The plus "+", dollar "$", and comma "," characters have been added to
 2066    those in the "reserved" set, since they are treated as reserved
 2067    within the query component.
 2068 
 2069 
 2070 
 2071 
 2072 
 2073 
 2074 Berners-Lee, et. al.        Standards Track                    [Page 37]
 2075 
 2076 RFC 2396                   URI Generic Syntax                August 1998
 2077 
 2078 
 2079    The tilde "~" character was added to those in the "unreserved" set,
 2080    since it is extensively used on the Internet in spite of the
 2081    difficulty to transcribe it with some keyboards.
 2082 
 2083    The syntax for URI scheme has been changed to require that all
 2084    schemes begin with an alpha character.
 2085 
 2086    The "user:password" form in the previous BNF was changed to a
 2087    "userinfo" token, and the possibility that it might be
 2088    "user:password" made scheme specific. In particular, the use of
 2089    passwords in the clear is not even suggested by the syntax.
 2090 
 2091    The question-mark "?" character was removed from the set of allowed
 2092    characters for the userinfo in the authority component, since testing
 2093    showed that many applications treat it as reserved for separating the
 2094    query component from the rest of the URI.
 2095 
 2096    The semicolon ";" character was added to those stated as being
 2097    reserved within the authority component, since several new schemes
 2098    are using it as a separator within userinfo to indicate the type of
 2099    user authentication.
 2100 
 2101    RFC 1738 specified that the path was separated from the authority
 2102    portion of a URI by a slash.  RFC 1808 followed suit, but with a
 2103    fudge of carrying around the separator as a "prefix" in order to
 2104    describe the parsing algorithm.  RFC 1630 never had this problem,
 2105    since it considered the slash to be part of the path.  In writing
 2106    this specification, it was found to be impossible to accurately
 2107    describe and retain the difference between the two URI
 2108       <foo:/bar>   and   <foo:bar>
 2109    without either considering the slash to be part of the path (as
 2110    corresponds to actual practice) or creating a separate component just
 2111    to hold that slash.  We chose the former.
 2112 
 2113 G.3. Modifications from RFC 1738
 2114 
 2115    The definition of specific URL schemes and their scheme-specific
 2116    syntax and semantics has been moved to separate documents.
 2117 
 2118    The URL host was defined as a fully-qualified domain name.  However,
 2119    many URLs are used without fully-qualified domain names (in contexts
 2120    for which the full qualification is not necessary), without any host
 2121    (as in some file URLs), or with a host of "localhost".
 2122 
 2123    The URL port is now *digit instead of 1*digit, since systems are
 2124    expected to handle the case where the ":" separator between host and
 2125    port is supplied without a port.
 2126 
 2127 
 2128 
 2129 
 2130 Berners-Lee, et. al.        Standards Track                    [Page 38]
 2131 
 2132 RFC 2396                   URI Generic Syntax                August 1998
 2133 
 2134 
 2135    The recommendations for delimiting URI in context (Appendix E) have
 2136    been adjusted to reflect current practice.
 2137 
 2138 G.4. Modifications from RFC 1808
 2139 
 2140    RFC 1808 (Section 4) defined an empty URL reference (a reference
 2141    containing nothing aside from the fragment identifier) as being a
 2142    reference to the base URL.  Unfortunately, that definition could be
 2143    interpreted, upon selection of such a reference, as a new retrieval
 2144    action on that resource.  Since the normal intent of such references
 2145    is for the user agent to change its view of the current document to
 2146    the beginning of the specified fragment within that document, not to
 2147    make an additional request of the resource, a description of how to
 2148    correctly interpret an empty reference has been added in Section 4.
 2149 
 2150    The description of the mythical Base header field has been replaced
 2151    with a reference to the Content-Location header field defined by
 2152    MHTML [RFC2110].
 2153 
 2154    RFC 1808 described various schemes as either having or not having the
 2155    properties of the generic URI syntax.  However, the only requirement
 2156    is that the particular document containing the relative references
 2157    have a base URI that abides by the generic URI syntax, regardless of
 2158    the URI scheme, so the associated description has been updated to
 2159    reflect that.
 2160 
 2161    The BNF term <net_loc> has been replaced with <authority>, since the
 2162    latter more accurately describes its use and purpose.  Likewise, the
 2163    authority is no longer restricted to the IP server syntax.
 2164 
 2165    Extensive testing of current client applications demonstrated that
 2166    the majority of deployed systems do not use the ";" character to
 2167    indicate trailing parameter information, and that the presence of a
 2168    semicolon in a path segment does not affect the relative parsing of
 2169    that segment.  Therefore, parameters have been removed as a separate
 2170    component and may now appear in any path segment.  Their influence
 2171    has been removed from the algorithm for resolving a relative URI
 2172    reference.  The resolution examples in Appendix C have been modified
 2173    to reflect this change.
 2174 
 2175    Implementations are now allowed to work around misformed relative
 2176    references that are prefixed by the same scheme as the base URI, but
 2177    only for schemes known to use the <hier_part> syntax.
 2178 
 2179 
 2180 
 2181 
 2182 
 2183 
 2184 
 2185 
 2186 Berners-Lee, et. al.        Standards Track                    [Page 39]
 2187 
 2188 RFC 2396                   URI Generic Syntax                August 1998
 2189 
 2190 
 2191 H.  Full Copyright Statement
 2192 
 2193    Copyright (C) The Internet Society (1998).  All Rights Reserved.
 2194 
 2195    This document and translations of it may be copied and furnished to
 2196    others, and derivative works that comment on or otherwise explain it
 2197    or assist in its implementation may be prepared, copied, published
 2198    and distributed, in whole or in part, without restriction of any
 2199    kind, provided that the above copyright notice and this paragraph are
 2200    included on all such copies and derivative works.  However, this
 2201    document itself may not be modified in any way, such as by removing
 2202    the copyright notice or references to the Internet Society or other
 2203    Internet organizations, except as needed for the purpose of
 2204    developing Internet standards in which case the procedures for
 2205    copyrights defined in the Internet Standards process must be
 2206    followed, or as required to translate it into languages other than
 2207    English.
 2208 
 2209    The limited permissions granted above are perpetual and will not be
 2210    revoked by the Internet Society or its successors or assigns.
 2211 
 2212    This document and the information contained herein is provided on an
 2213    "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
 2214    TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
 2215    BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
 2216    HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
 2217    MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
 2218 
 2219 
 2220 
 2221 
 2222 
 2223 
 2224 
 2225 
 2226 
 2227 
 2228 
 2229 
 2230 
 2231 
 2232 
 2233 
 2234 
 2235 
 2236 
 2237 
 2238 
 2239 
 2240 
 2241 
 2242 Berners-Lee, et. al.        Standards Track                    [Page 40]
 2243