Online data collection
Components
4 pass/fail assignments -> need to pass all
Final digital exam (100%) -> need to be 5.5
Boek:
https://web.p.ebscohost.com/ehost/detail/detail?vid=0&sid=d0390386-8334-4c3a-bd14-7f5ec2b
b851e%40redis&bdata=JnNpdGU9ZWhvc3QtbGl2ZQ%3d%3d#AN=1738375&db=nlebk
EXAM information
Final exam topics; programming knowledge, HTML, XML and JSON, data cleaning and Text
mining, HTTP, Web scraping and APIs, ethics
Offline -> 40 MC and 14 open questions
Do: read the slides, book chapters, and assignments/worksheets (including
HTML_website_building_guide.IPYNB) -> understand the core syntax of HTML, XML and JSON
Dont: learn all python codes by heart (but at least know what certain functions could do),
produce python codes.
Example question
Which of the following statement(s) is true?
Statement 1: all XML documents must have a root element
Statement 2: an API is the messenger that takes a request and tells a system what you want,
and returns the response back to you
a) Only statement 1 is true
b) Only statement 2 is true
c) Both statements are true
d) Both statements are false
Which functions can you use to convert all letters of a string to uppercase?
a) upper()
b) uppercase()
c) capitalize()
d) capslock()
What is the purpose of HTML?
a) To read data over the web
b) To exchange data over the web
c) To present data on the web
d) To parse data over the web
,Consider the following fragment of XML code:
What is/are the attributes name(s) on the fourth line of this code?
Src, width, height
Lecture 1
Online data collection: obtaining data from the web
“Also known as cyberresearch or web-based data collection, this is an evolving type of research
methodology that utilizes the internet as a medium for the collection of data” (Pagani, 2008)
Important difference between
Online data collection -> gathering information form online sources
Online data analytics -> processing online data to get useful insight from it
Online data management -> the process of ingesting, storign, organizing, maintaining, and
publishing the data that was created and collected.
What is online data? (in this course) any data that is available (and obtainable) online.
Examples: Website data; news, blogs, forms.
Social media data; twitter, facebook, instagram, youtube.
Open data: governments, companies, NGOs (not in this course)
What can you do with online data?
Research question: there are words you see in crossword puzzles all the time, but never
encounter in real life. Which words are the most “crosswordy”?
-> Online data collection: NYT daily crossword puzzle, Frequencies of all words in google
books.
-> Online data analytics: crosswordyness (w) = crosswords answer / all answers x google books
frequency %
-> Online data management: crossword data (raw and processed) on a repository. Publishing
the results in a journal with examples.
Why use online data?
Enormous amounts of data, publicly available, easily collectible (?)
,Ecologically valid: you can observe people ‘in their natural habitat’
-> but is it ethical?? (discussed in last lecture)
Anonymity: people feel freer to discuss ‘taboo’ topics
More diverse and more representative samples (Gosling et al., 2004)
What shouldn't you do with online data? -> final lecture goes into detail:
Using personal data, storing data, sharing data
GDPR and research ethics and ethics issues
How can we obtain online data?
Some core concepts: manually, application programming interface (API), scraping (tools or
scripts).
Online data research cycle
1. General research question
2. Manually inspect the data
3. Refine your research question
4. Determine your strategy
a. Bulk download
b. Queries
c. Tools
d. APIs -> in the course
e. Programming -> in the course
5. Store the data
6. Analyze your data
→ during all steps important to keep ethical issues in mind
Basics of python
See programming course material
Chapter 1 - your first web scraper
Urllib
From urllib.request import urlopen
HTML = urlopen(‘http:// .html’)
print(HTML.read())
Python script does not have the logic to go back and request multiple files (yet); it can only read
the single HTML file that you’ve directly requested.
Urllib is a standard python library and contains functions for requesting data across ,
handling cookies, and even changing metadata such as headers and your user content.
, Urlopen is a remote object across a network and reads it, because it is a fairly generic function
that can read HTML, image, and other file streams.
BeautifulSoup
BeautifulSoup library tries to make sense of the nonsensical; helps format and organize the
messy web by fixing bad HTML and presenting us with reality traversable python objects
representing XML structures.
- Installing BeautifulSoup
(Not default in the python library) (mac: $ sudo easy_install pip)
Pip instal beautifulsoup 4
From bs4 import BeautifulSoup
- Running BeautifulSoup
Most used object in BeautifulSoup is the BeautifulSoup object.
From urllib.request import urlopen
From bs4 import BeautifulSoup
HTML = urlopen(‘http://’)
Bs = BeautifulSoup(html.read(), ‘html.parser’)
print(bs.h1)
Output <h1> title <h1>
Returns the first instance of the H1 tag found on the page.
BeautifulSoup can also use the file object directly returned by urlopen without needing to call
.read() first -> bs= BeautifulSoup(html, ‘html.parser’)
Two arguments are passed, first is the HTML text the object is based on, and the second
specifies the parser that you want BeautifulSoup to use in order to create that object.
Another parser is lxml (pip install lmxl)
Lxml has some advantages over html.parser in that it is generally better at parsing ‘messy’ or
malformed HTML code. It is forgiving and fixes problems like unclosed tags, tags that are
improperly nested, and missing head or body tags. One disadvantage is that it has to be
installed separately and depends on third-party C libraries to function. This can cause problems.
Another parser is html5lib. Extremely forgiving parser that takes even more initiative
corresponding to broken HTML. it also depends on external dependency, and is slower than
both html.parser and lxml. Good option if you're working with messy/ handwritten html sites.
Connecting reliable and handling exceptions
The page is not found on the server (HTTP error). May be 404 page not found, 500 internal
server error, etc.
Handle this error:
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller brittvandewouw23. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $7.13. You're not tied to anything after your purchase.