Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load
the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.
Homework 3 is due Thursday, 9/13 at 11:59pm. Start early so that you can come to office hours if you're stuck.
Check the website for the office hours schedule. You will receive an early submission bonus point if you turn in
your final submission by Wednesday, 9/12 at 11:59pm. Late work will not be accepted as per the policies
(http://data8.org/fa18/policies.html) of this course.
Throughout this homework and all future ones, please be sure to not re-assign variables throughout the
notebook! For example, if you use max_temperature in your answer to one question, do not reassign it later
on. Moreover, please be sure to only put your written answers in the provided cells.
In [6]: # Don't change this cell; just run it.
import numpy as np
from datascience import *
# These lines do some fancy plotting magic.\n",
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
from client.api.notebook import Notebook
ok = Notebook('hw03.ok')
_ = ok.auth(inline=True)
=====================================================================
Assignment: Homework 3: Table Manipulation and Visualization
OK, version v1.12.5
=====================================================================
Question 1. Suppose you're choosing a university to attend, and you'd like to quantify how dissimilar any two
universities are. You rate each university you're considering on several numerical traits. You decide on a very
detailed list of 1000 traits, and you measure all of them! Some examples:
The cost to attend (per year).
The average Yelp review of nearby Thai restaurants.
The USA Today ranking of the Medical school.
The USA Today ranking of the Engineering school.
You decide that the dissimilarity between two universities is the total of the differences in their traits. That is, the
dissimilarity is:
the sum of
the absolute values of
the 1000 differences in their trait values.
In the next cell, we've loaded arrays containing the 1000 trait values for Stanford and Berkeley. Compute the
dissimilarity (according to the above technique) between Stanford and Berkeley. Call your answer
dissimilarity . Use a single line of code to compute the answer.
Note: The data we're using aren't real -- we made them up for this exercise, except for the cost-of-attendance
numbers, which were found online.
In [7]: stanford = Table.read_table("stanford.csv").column("Trait value")
berkeley = Table.read_table("berkeley.csv").column("Trait value")
When subtracting the differences in trait value, the value can be either positive or negative. But our goal is to
determine the dissimilarity so a -4 trait means that Berkeley is higher for that item and +4 trait value means that
Standford is higher for that item but both value show the same value for dissimilarity.
Weighing the traits
After computing dissimilarities between several schools, you notice a problem with your method: the scale of the
traits matters a lot.
Since schools cost tens of thousands of dollars to attend, the cost-to-attend trait is always a much bigger number
than most other traits. That makes it affect the dissimilarity a lot more than other traits. Two schools that differ in
cost-to-attend by $900 , but are otherwise identical, get a dissimilarity of 900. But two schools that differ in
graduation rate by 0.9 (a huge difference!), but are otherwise identical, get a dissimilarity of only 0.9.
One way to fix this problem is to assign different "weights" to different traits. For example, we could fix the
problem above by multiplying the difference in the cost-to-attend traits by .001, so that a difference of $900 in
the attendance cost results in a dissimilarity of $900 × .001, or 0.9.
Here's a revised method that does that for every trait:
1. For each trait, subtract the two schools' trait values.
2. Then take the absolute value of that difference.
3. Now multiply that absolute value by a trait-specific number, like .001 or 2 .
4. Now, sum the 1000 resulting numbers.
Question 3. Suppose you've already decided on a weight for each trait. These are loaded into an array called
weights in the cell below. weights.item(0) is the weight for the first trait, weights.item(1) is the weight
for the second trait, and so on. Use the revised method to compute a revised dissimilarity between Berkeley and
Stanford.
Hint: Using array arithmetic, your answer should be almost as short as in question 1.
In [9]: weights = Table.read_table("weights.csv").column("Weight")
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller solutions. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $7.49. You're not tied to anything after your purchase.