Hoorcollege 1 (6 feb)
The internet vs the web
● the internet → provides the underlying infrastructure, the global network of
interconnected computers, communicating over standardized protocols (IP)
● the web → an application, a system of interlinked hypertext documents, for sharing
data and information over the internet
○ the web is document-centric
○ hyperlinks
○ (most) of it makes sense to humans assuming they speak the language
○ Even worse for machines
○ machines that “understand” the web?
Searle’s Chinese room:
● vb. Chat GPT
○ you give a prompt and Chat GPT gives a response
Searle’s Chinese room (natural language):
● multiple names, one thing…
○ vb. Ireland, IE, Irlanda, Rep. of Ireland
● one name, multiple things…
○ vb. Dublin (stad en kroeg)
● Multiple ways to say the same thing…
○ vb. Dublin’s population is one million / Dublin’s population is 1.000.000
● multiple meanings for the same saying…
○ vb. Sherlock saw the man using binoculars
● not saying what is meant…
○ vb. It’s raining cats and dogs
What if we could “structure” everything
● one symbol, one meaning…
● one (simple) way to say one thing…
Semantic web → data, logic, query, output
,The semantic web
● “The Semantic Web will bring structure to the meaningful content of Web pages,
creating an environment where software agents roaming from page to page can
readily carry out sophisticated tasks for users.”
○ The semantic web is hidden within the web
○ wikidata → a wikipedia for data
○ problem 1 → different language versions manually edited by users
○ problem 2 → complex lists of things manually edited by users
○ solution → wikidata
○ use-case:
■ info-boxes
■ quality checks
■ doing a report for university
■ query service
● SPARQL query
○ Used in applications like:
■ Siri
■ Google’s Knowledge Panel
■ Using Semantic web knowledge-bases
■ Google’s Rich Snippets
○ Publishers add structured data
○ JSON-LD (Schema.org)
○ X (Twitter) Cards
○ Facebook - Open Graph
○ Semantic web is broadly adopted
,Hoofdstuk 1: Introduction
The Latent web
● there are webpages available that explicitly state information
● however, a lot of information is left implicit on the Web. This sort of information can
often require much more work to acquire
● The Web is quite specific → there is not a lot of demand for that precise information
● The lack of automated methods to combine and process information from various
webpages also implies costs for the publisher of content, since it encourages high
levels of redundancy in order to make information available to users on a single
webpage in the language of their preference
● Given that machines are unable to automatically find, process and adapt information
to a particular user’s needs publishers will rather often replicate redundant
information across different webpages for the convenience of users
● Given that the content of the Web is primarily human readable, machines cannot
piece together information from multiple sources
● this is turn puts the burden on users to manually integrate the information they need
from various webpages, and conversely, on publishers to redundantly package the
same information into different individual webpages to meet the most common
demands of (potential) users of the website
● latent web → a way to refer to the sum of the factual information that cannot be
gained from a single webpage accessible to users, but that can only be concluded
from multiple webpages
The current Web
● The web is predicated on agreement
○ first form of agreement on the Web relates to the protocol called Hypertext
Transfer Protocol (HTTP) used to request and send documents
○ second form of agreement relates to how documents can be identified and
located, which is enabled through the Uniform Resource Locator (URL)
specification and other related concepts
○ third form of agreement relates to how the content of webpages should be
specified, which is codified by the Hypertext Markup Language (HTML)
specification
, Hypertext Markup Language (HTML)
● HTML documents use a lightweight and broadly agreed-upon SYNTAX, MODEL and
SEMANTICS to communicate rendering instructions to machines, conveying how the
author of the document intends the page to be displayed in a browser on the client
side
○ SYNTAX → involves use of, for example, angle brackets and slashes to
indicate tags, such as <title>, that are not part of the primary content
○ MODEL → is tree-based, allowing elements to be nested inside other
elements
■ child → directly nested within
■ ancestor → recursively nested within
○ SEMANTICS → is hard-coded into a specification for developers to follow,
where it states.
■ developers of browser can then read the documentation and
hard-code interpretation of these semantics into their engines
○ content of the Web is decentralized → links are of fundamental importance for
recommending, connecting, locating and traversing webpages in an ad hoc
manner, weaving HTML documents into a Web.
○ HTML documents are machine readable, but in a limited sense → a machine
can automatically interpret and act upon the content of these documents, but
only for displaying the document and supporting its links.
Interpreting HTML Content
● The primary content of a typical Web document is still trapped in a format intended
for human consumption → the bulk of information on the WEb is still opaque to
machines.
● In order to organize the content of such HTML webpages we could instruct a
machine to parse out individual words between spaces, index which words appear in
which documents
● principles upon which modern search engines are based → inverted indexes that
map words to the documents, relevance measures based on the density of query
terms in each such document compared to the average density, and importance
measures such as how well-linked a document is.
● problems that machines face:
○ there are many ways to express equivalent information
○ the same referent can have multiple possible references
○ different referents may share the same name
○ many words and phrases that are written the same way have multiple
meanings
○ other words may have subtly different meanings in everyday language
○ information may be split over multiple clauses that use references such as
pronouns that may be difficult to resolve