100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Samenvatting - Online data collection (880491-M-3) €6,56   In winkelwagen

Samenvatting

Samenvatting - Online data collection (880491-M-3)

 7 keer bekeken  0 keer verkocht

Samenvatting totale vak Online data collection; literatuur + lectures 7.0 for exam!

Voorbeeld 4 van de 37  pagina's

  • 7 november 2024
  • 37
  • 2023/2024
  • Samenvatting
Alle documenten voor dit vak (1)
avatar-seller
brittvandewouw23
Online data collection
Components
4 pass/fail assignments -> need to pass all
Final digital exam (100%) -> need to be 5.5
Boek:
https://web.p.ebscohost.com/ehost/detail/detail?vid=0&sid=d0390386-8334-4c3a-bd14-7f5ec2b
b851e%40redis&bdata=JnNpdGU9ZWhvc3QtbGl2ZQ%3d%3d#AN=1738375&db=nlebk



EXAM information
Final exam topics; programming knowledge, HTML, XML and JSON, data cleaning and Text
mining, HTTP, Web scraping and APIs, ethics

Offline -> 40 MC and 14 open questions

Do: read the slides, book chapters, and assignments/worksheets (including
HTML_website_building_guide.IPYNB) -> understand the core syntax of HTML, XML and JSON

Dont: learn all python codes by heart (but at least know what certain functions could do),
produce python codes.

Example question
Which of the following statement(s) is true?
Statement 1: all XML documents must have a root element
Statement 2: an API is the messenger that takes a request and tells a system what you want,
and returns the response back to you
a) Only statement 1 is true
b) Only statement 2 is true
c) Both statements are true
d) Both statements are false

Which functions can you use to convert all letters of a string to uppercase?
a) upper()
b) uppercase()
c) capitalize()
d) capslock()

What is the purpose of HTML?
a) To read data over the web
b) To exchange data over the web
c) To present data on the web
d) To parse data over the web

,Consider the following fragment of XML code:




What is/are the attributes name(s) on the fourth line of this code?
Src, width, height


Lecture 1
Online data collection: obtaining data from the web
“Also known as cyberresearch or web-based data collection, this is an evolving type of research
methodology that utilizes the internet as a medium for the collection of data” (Pagani, 2008)

Important difference between
Online data collection -> gathering information form online sources
Online data analytics -> processing online data to get useful insight from it
Online data management -> the process of ingesting, storign, organizing, maintaining, and
publishing the data that was created and collected.

What is online data? (in this course) any data that is available (and obtainable) online.
Examples: Website data; news, blogs, forms.
Social media data; twitter, facebook, instagram, youtube.
Open data: governments, companies, NGOs (not in this course)

What can you do with online data?
Research question: there are words you see in crossword puzzles all the time, but never
encounter in real life. Which words are the most “crosswordy”?
-> Online data collection: NYT daily crossword puzzle, Frequencies of all words in google
books.
-> Online data analytics: crosswordyness (w) = crosswords answer / all answers x google books
frequency %
-> Online data management: crossword data (raw and processed) on a repository. Publishing
the results in a journal with examples.

Why use online data?
Enormous amounts of data, publicly available, easily collectible (?)

,Ecologically valid: you can observe people ‘in their natural habitat’
-> but is it ethical?? (discussed in last lecture)
Anonymity: people feel freer to discuss ‘taboo’ topics
More diverse and more representative samples (Gosling et al., 2004)

What shouldn't you do with online data? -> final lecture goes into detail:
Using personal data, storing data, sharing data
GDPR and research ethics and ethics issues


How can we obtain online data?
Some core concepts: manually, application programming interface (API), scraping (tools or
scripts).

Online data research cycle
1. General research question
2. Manually inspect the data
3. Refine your research question
4. Determine your strategy
a. Bulk download
b. Queries
c. Tools
d. APIs -> in the course
e. Programming -> in the course
5. Store the data
6. Analyze your data
→ during all steps important to keep ethical issues in mind

Basics of python
See programming course material



Chapter 1 - your first web scraper
Urllib
From urllib.request import urlopen
HTML = urlopen(‘http:// .html’)
print(HTML.read())

Python script does not have the logic to go back and request multiple files (yet); it can only read
the single HTML file that you’ve directly requested.

Urllib is a standard python library and contains functions for requesting data across ,
handling cookies, and even changing metadata such as headers and your user content.

, Urlopen is a remote object across a network and reads it, because it is a fairly generic function
that can read HTML, image, and other file streams.

BeautifulSoup
BeautifulSoup library tries to make sense of the nonsensical; helps format and organize the
messy web by fixing bad HTML and presenting us with reality traversable python objects
representing XML structures.

- Installing BeautifulSoup
(Not default in the python library) (mac: $ sudo easy_install pip)
Pip instal beautifulsoup 4
From bs4 import BeautifulSoup

- Running BeautifulSoup
Most used object in BeautifulSoup is the BeautifulSoup object.
From urllib.request import urlopen
From bs4 import BeautifulSoup

HTML = urlopen(‘http://’)
Bs = BeautifulSoup(html.read(), ‘html.parser’)
print(bs.h1)
Output <h1> title <h1>

Returns the first instance of the H1 tag found on the page.
BeautifulSoup can also use the file object directly returned by urlopen without needing to call
.read() first -> bs= BeautifulSoup(html, ‘html.parser’)
Two arguments are passed, first is the HTML text the object is based on, and the second
specifies the parser that you want BeautifulSoup to use in order to create that object.

Another parser is lxml (pip install lmxl)
Lxml has some advantages over html.parser in that it is generally better at parsing ‘messy’ or
malformed HTML code. It is forgiving and fixes problems like unclosed tags, tags that are
improperly nested, and missing head or body tags. One disadvantage is that it has to be
installed separately and depends on third-party C libraries to function. This can cause problems.

Another parser is html5lib. Extremely forgiving parser that takes even more initiative
corresponding to broken HTML. it also depends on external dependency, and is slower than
both html.parser and lxml. Good option if you're working with messy/ handwritten html sites.

Connecting reliable and handling exceptions
The page is not found on the server (HTTP error). May be 404 page not found, 500 internal
server error, etc.
Handle this error:

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper brittvandewouw23. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €6,56. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 67232 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€6,56
  • (0)
  Kopen