Summary

Summary Web Data Processing Systems (X_400418), Master Vu Business Analytics/AI/Computer Science/Econometrie

Name: Samenvatting Web Data Processing Systems (X_400418), Master Vu Business Analytics/AI/Computer Science/Econometrie
SKU: doc_2179192
Rating: 4.00 (3 reviews)
Author: thomezechiels

3 reviews

46 views 3 purchases

Course
Web Data Processing Systems (X_400418)

Institution
Vrije Universiteit Amsterdam (VU)

A summary of all lectures (1 to 12) of the Web Data Processing Systems course at VU Amsterdam. Brief and clearly summarized with relevant images where necessary.

[Show more]

Preview 3 out of 22 pages

View example

Uploaded on December 14, 2022
Number of pages 22
Written in 2022/2023
Type Summary

web data
business analytics
computer science
web systems
named entities
knowledge bases
entity linking
word sense disambiguation

Institution
Vrije Universiteit Amsterdam (VU)
Education
Business Analytics
Course
Web Data Processing Systems (X_400418)

3 reviews

By: semwierdsma • 9 months ago

By: julioraulcordal • 11 months ago

By: kaunpark • 1 year ago

thomezechiels

Member since 1 year 3 documents sold

$6.50

Added

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Knowledge bases
First Information Retrieval was based on keywords. Now it is based on entities.

Symbolic Knowledge Bases (KBs)

● Meaning accessible to humans
● Constructed manually or from unstructured sources
● Can be expressed using first-order logic (knowledge graphs):

Latent Models

● Meaning is hidden
● Learned using machine learning techniques
● Prominent example: Google’s word2vec

RDF (Resource Description Framework)

● Standard used to report statements that describe properties of resources
● Statements can be represented as triplets of the form <s p o> (subject predicate object) and
serialized with different formats (RDF/XML, N3, Turtle)
● RDF dataset can be represented as a directed graph
● SPARQL is used to query RDF databases (inspired by SQL)
○ Finding answers to a query corresponds to finding all possible graph homomorphisms
between the query and the graph

Knowledge bases on the web
WordNet

● Groups words into sets of synonyms called synets.
● Words can be monosemous (one meaning) or polysemous (multiple meanings)
● Each synet has a gloss (short description) and is connected to other synets using relations. Most
important:
○ Hypernyms/Hyponums (isA)
○ Meronym/Holonyms (partOf)

DBpedia

● Project to convert Wikipedia pages to RDF
● Uses structured data on the pages
● Contains links to other KBs (widely popular in the “linked-data-cloud”
● Fairly large ontology but not rich in terms of expressiveness
● Alignment between infoboxes and ontologies is done via community-provided mappings

Yago (Yet another great ontology)

● Goals:
○ Unify Wikipedia and Wordnet

, ○ Extract clean facts
○ Check plausibility of facts via type checking
● High standard in terms of quality

Freebase

● Collaborative knowledge base by its community
● Acquired by Google, but shutdown in 2014

Wikidata

● Mainly text → hard to verify and keep consistency
● “Data version” of Wikipedia
○ Validated by community
○ Keeps provenance of the data
○ Multilingual
○ Supports plurality
● High quality knowledge

Natural Language Processing (NLP)
Knowledge acquisition: process to extract knowledge (to be integrated
into knowledge bases) from unstructured text or other data

Preprocessing
Tokenization

Split sequence into tokens (terms/words)
● Token: instance of a sequence of characters in some particular document that are grouped
together as a useful semantic unit
● Type: class of all tokens containing the same character sequence
● Example: “A rose is a rose is a rose”
○ Tokens: 8
○ Types: 3 ({a, is, rose})
Queries and documents have to be preprocessed identically. It determines which queries match.
Problems:
● Hyphens (Co-education, drag-and-drop)
● Names (San Francisco, Los Angeles)
● Language (compound nouns in German v.s. separate nouns in English)

Lemmatization

Goal: reduce words to base form (Lemma; as defined in dictionary)

, ● Am, are, be, is → be
● Car, cars, car’s, cars’ → car
Stemming

Goal: reduce words to their “roots”
● Are → ar
● Automate, automates, automatic, automation → automat

Stop word removal

Based on a stop list, remove all stop words. All words that are not part of the IR system’s dictionary.
● Saves memory
● Makes query processing faster

Part-of-speech (POS)

Assign a label to each token that indicates what the function is in the context.
● Function words: used to make sentences grammatically correct
○ Prepositions, conjunctions, pronouns, etc.
● Content words: used to carry the meaning of a sentence
○ Nouns, verbs, adjectives, adverbs
Part-of-speech tags allow for a higher degree of abstraction to estimate likelihoods.
How do they work?
● Rule-based taggers
● Stochastic taggers. Most used and rely on Hidden Markov Models. Based on likelihood.

Other NLP tasks
Parsing

Construct a tree that represents the syntactic structure of the string according to some grammars.

Constituency parsing

Breaks the phrase into sub-phrases. Nonterminals in the tree are types of phrases, the terminals are the
words in the sentence, and the edges are unlabeled.

Dependency parsing

Connect the words according to their relationships. Each vertex in the tree represents a word, child
nodes are words that are dependent on the parent, and edges are labeled
by the relationship.

Information Extraction
Two types of information extraction: Named Entity Recognition (NER) and Relation Extraction (RE).

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller thomezechiels. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $6.50. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

64438 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Seller

Summary

Summary Web Data Processing Systems (X_400418), Master Vu Business Analytics/AI/Computer Science/Econometrie

Document information

Subjects

Written for

3 reviews

Seller

Reviews received

Content preview

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Quick and easy check-out

Focus on what matters

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?