Web mining and analytics
summary
Week 1
Articles
Obtaining Data from the Internet: A Guide to Data Crawling in
Management Research
Claussen & Peukert, 2019
Data crawling, sometimes also called web scraping or spidering, is the method that allows to
automatically extract data from the Internet. Using automated systems (“bots”) to extract data has
many practical applications. Popular services such as search engines, price comparison websites, or
news aggregators are essentially huge data crawling operations, but bots are also used for malicious
means, e.g. advertising fraud and cyber attacks.
Data crawling can be on product-level (e.g. scrape product variety data of different cameras), firm-
level (strategic positioning information from corporate websites) and individual level (e.g.
stakeholder strategies, user innovation communities, salary information websites).
Things to consider with data crawling: (Boundaries for data crawling)
Check if it’s the most efficient way, sometimes there are databases already available
Consider the number of observations that need to be collected. If this is relatively low then
manually might be better.
The more structured a website is, how easier the scraping is.
Some sites exclude robots from their websites
If you need to be logged in before you can crawl. Look at terms and conditions if you are
allowed to crawl if this is the case.
Tip is to look for the sitemap of a website to see how the webpages are named to be able to scrape
the data easily from the different webpages. Also, you have directories that list through all the
subpages. Otherwise, you can most of the times figure out the URL structure like start=30, start=60.
Lastly, you can resort to APIs.
Parsing: Extracting the desired pieces of information from a webpage. So selecting the exact
elements to be extracted.
Obtaining the content:
First, you will have to request the HTML with the request get function. Then you create a soup out of
that HTML.
There are two main methods of parsing a website: the direct way through regular expressions or
alternatively through higher-level parsing frameworks such as BeautifulSoup for Python
Regular expressions are available in most programming languages and in many text editors
and are a powerful tool for defining search patterns.
Alternatively, one can also parse websites with higher-level frameworks. In these
frameworks, the hierarchy created by the opening and closing tags throughout each website
, is used to create a tree-like structure. One can then identify each element within a website
by specifying the location within the tree
Text data can include a huge variety of non-Latin characters (ASCII), such as language-specific
alphabets, currency symbols, or emoticons. The Unicode standard ensures correct representation
and handling of these characters across devices and software systems. Characters are encoded
according to standardized rules, much like in pre-Computer systems like the Morse code. For
example, the Unicode translation of the ampersand “&” is “U+0026”.
More than 90% of the world wide web is encoded in UTF-8.10 When saving the information
to a local file or database system, it is important to select the according encoding scheme, or
convert the encoding scheme when needed.
Otherwise you can get up messed up data with misinterpretations of symbols in the text.
So far, we have only considered the case of static websites, i.e. where the entire content is loaded
immediately, or contents do not vary systematically for different users. Most websites however
nowadays personalize the experience.
In many cases to solve this you need to enhance the web crawler. An easy method to do so is
to “remote control” a fully-featured web browser like Chrome or Firefox using the Selenium
package in Python or R.
Speeding up the crawling process is done with parallelization, where you run multiple crawlers at the
same time.
Instead of running 1 crawler through 100 pages of URLs, you can better run 2 crawlers
through 50 pages.
Sometimes, you also want to gather information from multiple sources
This is a problem because if the structure differs then your dataset will be messed up. You
can solve this issue with machine learning approach fuzzy matching. These predicts whether
a value match with another value and puts them in the same column.
Crawling panel data can be done by revisiting a website in different time periods, or download
snapshots of the data in bulk.
Scraping metadata can be interesting as well. You can identify price changes for example on a web
shop or look at how advertising services work.
Advantages of data crawling include:
Higher exclusivity for researchers than using public datasets that are used by many others
Can be very fast if you have the skills
You are as researcher less dependent on publication biases from websites.
You have room to create enormous datasets with crawling
Disadvantages of crawling include:
You need to learn how to crawl
Can be prone to errors which results in low data quality
Robot exclusion protocols prevent crawling
Terms and conditions may legally implicate researchers.
, Design of review systems – A strategic instrument to shape online
reviewing behavior and economic outcomes
Dominik Gutt , Jürgen Neumann , Steffen Zimmermann , Dennis Kundisch , Jianqing Chen, 2019
Online reviews provide firms with strategic knowledge that is pivotal for price setting, demand
forecasting, product quality assessment, and customer relationship management.
We argue similarly that review systems, when populated with online reviews, represent specialized
assets in the form of the (reviewing) consumers’ experiences and knowledge. According to the
resource-based theory of the firm, review systems thus fulfill the necessary condition of representing
a valuable, rare, inimitable, and non substitutable resource to the firm, with which it can obtain a
competitive advantage.
From the perspective of the resource-based view of the firm, review systems populated with online
reviews constitute a specialized asset. Still, there are at least two fundamental differences:
First, specialized assets are usually created and used within the firm. Review systems,
however, leverage a power shift from within the firm to outside the firm. E-commerce
platforms running these systems source, accumulate, and aggregate consumption experience
from people outside the firm as their main providers of strategically important knowledge.
Second, online reviews may be collected, processed, aggregated, and presented in quite
different ways and many design features have been identified in the literature that influence
the drivers and economic outcomes of online reviews. It is the unique combination of design
features of a specific review system that may increase its potential to become a specialized
asset for the firm hosting it.
Previous literature reviews of online reviews have started synthesizing the current state of
knowledge and presented research findings regarding two aspects:
(1) The impact of online reviews on economic outcomes – which we refer to as direct outcome
effect in the following – such as prices and sales, and
(2) The factors that drive online review generation – which we refer to as direct driver effect in
the following – such as reviewing motivation or reviewer self-selection. In this paper, the
writer presents a literature study on these researches.
Drivers refers to any effects that influence individual online reviews or any online review metric (i.e.,
direct driver effect (b)) and, in turn, can be review related or reviewer-related. For example, the
social influence bias is a review-related driver which suggests that reviewers change their own
reviewing behavior when exposed to existing reviews