Samenvatting Web Technology 2015
Web essentials
The internet
The internet is a network of networks connected via the public backbone, communicating using the
TCP/IP communication protocol. Note: rules and protocols that make the internet work are the
evolving results of human effort, not laws of nature.
The internet’s technical origin was ARPANET in the late 1960’s. It was one of the earliest attempts to
network heterogeneous, geographically dispersed computers. E-mail was available on ARPANET in
1972 and it quickly became very popular. ARPANET access was limited to a select group of DoD-
funded organizations.
Internet protocols were developed as a part of ARPANET research: Vinton Cerf et al. wrote the first
TCP specification in 1974. ARPANET began using TCP/IP in 1982. These protocols were designed for
using both within local area networks (LAN) and between networks.
IP is the fundamental protocol defining the internet (as the name implies). An IP-address is a 32-bit
number (in IPv4), and is associated with at most one device at a time (although a device may have
more than one IP-address). It is written as four dot-separated bytes, e.g. 192.0.34.166. Since a byte
can only reach 255, all numbers must be lower than or equal to this number. An IP-address is
assigned by IANA (Internet Assigned Numbers Authority).
The function of IP is transferring data packets from a source device to a destination device. IP source
software creates these packets representing the data: first you have a header (containing source and
destination IP-addresses, length of data etc.), followed by the data itself. If the destination is on
another LAN, then the packet is sent to a gateway that connects to more than one network. Possibly,
the packet is sent through a route of multiple gateways before it reaches its destination.
There are some limitations of IP: there is no guarantee of packet delivery (packets can be dropped),
and the communication is one-way (source to destination). Transmission Control Protocol (TCP) adds
the concept of a connection on top of IP: it provides a guarantee that packets are delivered, and it
provides a two-way (full duplex) communication. Furthermore, TCP also adds the concept of a port:
the TCP header contains a port number, representing an application program on the destination
computer. Some port numbers have standard meanings (assigned by IANA): for example, port 25 is
normally used for e-mail transmitted using the Simple Mail Transfer Protocol. Other port numbers
are available first-come first-served to any application.
The User Datagram Protocol (UDP) is in some way like TCP: it builds on IP and it provides a port
concept. However, it is unlike TCP in that there is no connection concept, and no transmission
guarantee. The advantage of UDP over TCP is that it is lightweight. It is faster, and can therefore be
used for one-time messages.
The Domain Name Service (DNS) is the ‘phone book’ of the internet. It is basically a map between
host names and IP-addresses. DNS often uses UDP for communication. Host names are labels
separated by dots, for example: www.example.org. The final label is the top-level domain. There are
all kinds of top-level domains: generic (.com, .org, etc.), country-code (.us, .nl, etc.) and many more.
The Internet Corporation for Assigned Names and Numbers (ICANN) is responsible for this.
There are many higher-level protocols. Many of these build on TCP. (An analogy to the telephone:
TCP specifies how we initiate and terminate the phone call, but some other protocol specifies how
we carry on the actual conversation.) In many network applications, the roles of the two pieces of
software communicating over the network are very different. However, one piece of software always
takes the initiative, and is called the client, and a second piece of software is always reactive, and is
called the server.
,World Wide Web
The World Wide Web was invented at CERN in 1989 by Tim Berners-Lee. Its goal was to simplify the
sharing of research results over the internet. Originally, it was one of several systems for organizing
internet-based information. However, the distinctive feature of the WWW was the support for
hypertext: you could locate documents using Uniform Resource Locators (URL), there was document
representation using HyperText Markup Language (HTML), and there was communication via
HyperText Transfer Protocol (HTTP).
The Web is the collection of machines (web servers) on the internet that provide information,
particularly HTML documents, via HTTP. Machines that access information on the Web are known as
web clients. A web browser is software used by an end user to access the Web. The protocols of the
Web are simple, open (patent-free) and extensible.
There are two types of URI’s. There is Uniform Resource Name (URN), which can be used to identify
resources with unique names (the same as books which have unique ISBN’s). The scheme for this is
urn. Also, there is Uniform Resource Locator (URL), which specifies the location at which a resource
can be found. In addition to http, some other URL schemes are https, ftp, mailto and file.
For example, take the following HTTP URL:
http://www.example.org:56789/a/b/c.txt?t=win&s=chess#para5. The first part (domain name, host)
and the port number (light blue) are the authority. This is being followed by a path (red) and a query
(brown), which is the Request-URI. The last part (pink) indicates a fragment of the page, and shows
which part of the page is initially shown.
Content Markup: HTML
Web documents consist of the actual content (e.g. text, video), and markup for interpreting the
content. Markup helps web browsers and other web tools to render the content in the intended way.
Content markup indicates what part a piece of content plays in the document, e.g.: section,
paragraph, figure, or table. This is also sometimes called ‘semantic’ markup. HTML is the content-
markup language used on the Web.
HTML was a crucial part of the Web architecture designed by Tim Berners-Lee. It was non-
proprietary, and it had a non-binary format. It has gone through several cycles of development, for
example: once, a web page was all about text, now it is more and more about multimedia.
HTML consists of tags. Any string of the form <...> is a tag: <head> is a start tag, </head> is an end
tag. Tags are treated specially by the browser. A tag marks a document element. Everything from
start tag to matching end tag, including the tags itself, is an element. Content of the elements
exclude its start and end tags. Any non-markup text is called character data.
HTML document elements form a tree. The <html> tag includes both the <head> and <body> tags.
Inside each of those, other kind of tags can be found. You can put all these tags underneath each
other to form a tree.
The <head> element specifies the document head: it contains meta information about the document
(author, title, style used etc.). The <body> element specifies the document body and contains the
actual document content.
In HTML, there are four white space characters: carriage return, line feed, space and horizontal tab.
Normally, character data is normalized: all white space is converted to space characters, leading and
trailing spaces are trimmed, and multiple consecutive space characters are replaced by a single space
character.
Of course, HTML documents can contain errors. Browsers ignore tags with, for example, unknown or
misspelled element names. This implicates that an HTML document may have errors, even if it
displays properly.
There are also some problems with HTML: since < marks the beginning of a tag, how do you include a
< in an HTML document? The solution is simple: you use markup known as a reference. There are
two types of references: the entity reference, which specifies a character by an HTML-defined name
, (< for <, & for &), and character reference, which specifies a character by its Unicode code
point.
Semantic markup is new in HTML5. These are elements that do not contribute to the content itself,
but make it clearer for tools processing the content how to interpret/present/... it. Examples are the
section tag (indicating a document section, typically with a heading), the article tag (independent
part, e.g. a blog entry), the aside tag (a note on the side, not central to content) and the nav tag
(content used for document navigation). Furthermore, the class attribute can be used to define
custom classes of HTML elements, which can, for example, be used to define class-specific
presentation styles. This means that some paragraphs on a web page can look entirely different from
other paragraphs.
The type of HTML element is called an inline element, as it does not start a new HTML block (e.g. a
paragraph). So, HTML elements such as <div>, <p> and <h1> are called block elements, while
elements such as <b>, <span> and <code> are called inline elements.
Style Markup: CSS
Style markup indicates how the content should be presented to users. The presentation format can
vary: some people use a large screen (desktop/notebook), others use a mobile phone etc. Cascading
Style Sheets (CSS) is the language used for style markup of Web documents. In the past, HTML was
also used for style markup, but this is now considered bad practice.
CSS is the dominant language for style markup. CSS is a special-purpose language for specifying
presentation styles of HTML elements. Browsers have built-in style files for default presentation.
A style rule consists of three elements: the selector, which specifies the HTML element(s) to which
the rule applies, the property, which is a style property of the selected element (e.g. text-align), and
the value, which is the value of the style property (e.g. justify).
Style can be defined in three ways: one way is to define the style in a separate document, and link to
that document in the HTML head. A common way to do this is providing a <link> to a CSS file in the
head element. The other two options are inside an HTML file. Option one is to put style rules inside a
<style> element in the document head. Otherwise, you can define a style attribute on an HTML
element. You should only use the last option if it is a one-off case.
Style rules cascading decide which style sheet(s) applies. If one style sheet is more specific than the
other, then it has a higher preference. If both have the same
specificity, then the last one is preferred. Here are the priority
levels of competing style rules, ordered from lowest to highest:
user agent CSS rules (e.g. browser), user CSS rules, author CSS
rules, ‘important’ author CSS rules, and ‘important’ user CSS rules.
CSS also allows the specification of styles for different media (e.g.
screen, speech synthesizer, printer). This can be done in different
stylesheets, but also within one style sheet (with @media blocks).
The CSS box model is shown in the image to the right.
In normal flow processing, each displayed element has a corresponding box. The HTML element box
is called the “initial containing block” and corresponds with the entire document. Boxes of child
elements are contained in the boxes of the parent. Sibling block elements are laid out one on top of
the other, and sibling inline elements come one after another.
HTML and CSS are different languages with different syntaxes. They also have a different way of
including comments: in HTML, a comment looks like this: <!—this is a comment -->, while a CSS
comment looks like this: /* this is a comment */. However, there is a clear separation of roles: HTML
is for content, and CSS is for style. This was less true in the past, although semantic markup helps
web tools to behave smarter.