✔
Web Data Processing Systems
Created @October 21, 2020 3:02 PM
Class S2
Type S2
Materials
Lecture 1 (27 Okt 2020)
Introduction to course
Goals
This course is system-oriented. We familiarize with some of the topics that drive research on the Web. With
focus on 2 main themes:
1. Knowledge acquisition from the Web
2. Knowledge consumption
Lecture 2 (27 Okt)
Introduction to Knowledge Bases
limits of text
information retrieval is a field in CS about how to retrieve subset doc from large database.
Traditionally, the retrieval was keyword-based (search engines).
what are good string similarities?
what are good criteria to rank
how can we diversify the results
Recently it changed from keyword-based to entity-based retrieval.
text documents contain data (=latin word for sth that is given). However knowledge is familiarity,
awareness or understanding of someone or something, such as facts, information, descriptions or skills.
in essence what we like to do in order to implement the vision of entity-based search, is to build knowledge
repositories. We want to move away from collection of data/text and build that.
how are knowledge repositories build? 2 methods
Manifest knowledge
meaning accessible to humans, those repositories are called knowledge bases or knowledge
graphs. So content can be described using a graph
Web Data Processing Systems 1
, typically constructed manually or from unstructured sources
we need for this a language we understand, so less ambiguous is best. Logic is the language that
humans designed to express knowledge.
opinion: knowledge is something we can interpret without ambiguities
knowledge base: crystallization of factual knowledge in the form of associations between
entities and relations.
can be expresses as first order logic
recently google re-branded knowledge bases as knowledge-graphs (= same as base but
represented in graphs)
Latent knowledge
Other people suggest that we not need to understand the repository as long as knowledge can be
processed by machine and be used by tasks needing intelligence. Meaning is hidden to us. latent
models or latent feature models
typically learned using machine learning techniques
recently, latent models became very popular due to the rise of deep learning.
see more detail by statistical inference
which is better? different opinions. for some everything can be learned for other knowledge is only
what can be understood.
knowledge bases available on the web
WordNet (NLP oriented), most popular lexical db for english.
idea: create knowledge base of meanings of words. Groups of words into sets of synonyms (synsets)
two important languages related to knowledge bases
RDF
standardized language to exchange knowledge on the Web
RDF is a standard used to report statements that describe properties of resources
Properties are represented by IRIs while resources are either special IRIs, labels or special
placeholders called blank nodes(when you dont know sth)
The statements can be represented as triples of the form (subject predicate object) and serialized
with different formats: RDF/XML, N3, Turtle
RDF dataset can be represented as a directed graph
Web Data Processing Systems 2
, SPARQL
another standardized a specific language of W3C
SPARQL is a query language which has a SQL-inspired syntax. Finding
answers to a SPARQL query corresponds to find all possible graph homomorphisms between the
query and the graph[=knowledge base].
most important knowledge base right now is Wikidata. Created as data function of wikipedia, because that
was mainly text with which verify and keeping consistency is difficult.
Data is validated by the community
Keeps provenance (=herkomst) of the data
Multilingual by design
Supports plurality
high quality knowledge
DBpedia:
Project to convert Wikipedia pages into RDF,
Contains links to other KBs
Fairly large ontology but not rich in terms of expressiveness
YAGO
Unify Wikipedia and Wordnet
Web Data Processing Systems 3
, Exploit Wikipedia Info boxes to extract clean facts
Check the plausibility of facts via type checking
Freebase: discontinued
Lecture 3 (1 Nov)
Knowledge Acquisition
process to extract knowledge (to be integrated into knowledge bases) from unstructured
text or other data
In this course we only look at extracting entities and relations between them from unstructured data.
another important form of extraction consists of detecting events and other temporal expressions. We are not
going to talk about them
NLP Preprocessing
NLP : natural language processing
Typical distinction:
structured data: "databases"
unstructured data: "information retrieval", typically refers to "text"
Semi-structured data, because there is always some structure like title and bullets.
Before we can use some text, we must pre-process it. Std. tasks are:
tokenization
Goal
given a character sequence, split it into subsequences called tokens
tokens are often loosely referred as terms/words
Type vs Token
Token: instance of a sequence of characters in some particular document that are grouped
together as a useful semantic unit ⇒
multiset
Type: class of all tokens containing the same character sequence ⇒ set
Web Data Processing Systems 4
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper MeldaMalkoc. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €5,99. Je zit daarna nergens aan vast.