Garantie de satisfaction à 100% Disponible immédiatement après paiement En ligne et en PDF Tu n'es attaché à rien
logo-home
Summary Spiekbrief midterm Data Science (stof week 1-3) €7,66
Ajouter au panier

Resume

Summary Spiekbrief midterm Data Science (stof week 1-3)

 1 fois vendu
  • Cours
  • Établissement

Goede spiekbrief die alle stof bevat voor de midterm data science. Met voorbeelden van pandas code en somuitwerkingen. Ingedeeld per Hoorcollege met de belangrijkste begrippen dikgedrukt zodat ze gemakkelijk te vinden zijn.

Aperçu 1 sur 2  pages

  • 24 février 2025
  • 2
  • 2024/2025
  • Resume
avatar-seller
Lecture 2: Data science pipeline Frame problems in the real world, we need to define and frame the problems first Collect data in the real
world, you may need to collect data using sensors, crowdsourcing, mobile apps. There are also other sources for getting public datasets,
such as Hugging Face, Zenodo, Google Dataset Search, etc Preprocess Data Filtering → can reduce a set of data based on specific
criteria vb. left table can be reduced to the right table using a population threshold. df[df[“population”]>500000]. Aggregation → reduces a
set of data to a descriptive statistic. vb. left table is reduced to a single number by computing the mean value. df[“population”].mean().
Grouping → divides a table into groups by column values, which can be chained with data aggregation to produce descriptive statistics for
each group. vb. df.groupby(“province”).sum(). Sorting → rearranges data based on values in a column, which can be useful for inspection.
vb. right table is sorted by population. df.sort_values(by=[“population””]). Concatenation → combines multiple datasets that have the same
variables. vb. two left tables can be concatenated into the right table. pandas.concat([df_A, df_B]). Merging and joining → method to
merge multiple data tables which have an overlapping set of instances. vb. use “city” as the key to merge A and B. A.merge(B,
how=”inner/left/right/outer”, on=”city”). Quantization → transforms a continuous set of values (e.g. integers) into a discrete set (e.g.
categories) . vb. age is quantized to age range. bin = [0,20,50,200]. L=[“1-20”, “21-50”, “51+”]. pandas.cut(D[“age”], bin, labels=L). Scaling
→ transforms variables to have another distribution, which puts variables at the same scale and makes the data work better on many
models. vb. Z-score scaling → represents how many standard deviations from the mean. (df-df.mean()) / df.std(). vb. min-max scaling →
making the value range between 0 and 1. (df-df.min()) / (df.max()-df.min()). Resample time series data to a different frequency using
different aggregation methods. vb. resample to hourly frequency using mean. df.resample(“60min”, label=”right”).mean(). Rolling → to
transform time series data using different aggregation methods. vb. df[“new_column”] = df[“column1”].rolling(window=3).sum().
Transformation → can be applied to rows or columns in a dataframe. df[“wind_sine”] = np.sin(np.deg2rad(D[“wind_deg”])). Extract data
→ from text or match text patterns with regular expression → language to specify search patterns. vb. df[“year”] =
df[“venue”].str.extract(r’([0-9]{4})’). Drop → data we don’t need, such as duplicate data records or those that are irrelevant to our research
question. pandas.drop(columns=[“year”]. can drop rows or columns. Replace missing values → with a constant, mean, median or the
most frequent value along the same column. constant imputation → -1. mean imputation. Model missing values → where y is the
variable/column that has the missing values, X means other variables, and F is a regression function. vb. y = F(X). different missing data
may require different data cleaning methods. MCAR → missing at completely random. missing data is completely random subset of the
entire dataset. MAR → missing at random. missing data is only related to variables other than the one having missing data. MNAR →
Missing Not At Random. missing data is related to the variable that has the missing data. Explore data. Information visualization is a good
way for both experts and lay people to explore data and gain insights. python seaborn library to quickly plot and explore structure data.
python plotly library to build interactive visualizations. Voyant Tools to explore text data Model data. techniques for modeling structured,
text, and image data through different modules image classification. vb. optical character recognition → recognizing digits from hand-written
images. vb. fine-grained categorization → categorizing the types of birds. text classification. vb. sentiment analysis → identifying emotions
from movie reviews. vb. categorizing the research aspect Deploy models can enable further quantitative or qualitative research with
insights. Data Science Fundamentals (Modeling). Classification. To classify spam messages we need examples → a dataset with
observation (messages) and labels (spam or ham). We can extract features (information) using human knowledge, which can help
distinguish spam and ham messages. Using features x (which contains x1 and x2), we can represent each message as one data point on a
p-dimensional space (p=2 in this case). We can think of the model as a function f that can separate the observation into groups (labels y)
according to their features = {x1, x2}. f(x) > 0 → spam, f(x) < 0 → ham. To find a good function f, we start from f and train it until satisfied.
We need something to tell us which direction and magnitude to update. First, we need an error metric. vb. sum of distances between the
misclassified points and line f. error = -y * f(x) for each misclassified point x={x1,x2}. We can use gradient descent to minimize the error to
train the model f iteratively. Depending on the needs, we can train different models (using different loss function) with various shapes and
decision boundaries. To evaluate our classification model, we need to compute evaluation metrics to measure and quantify model
performance. Accuracy = # of correctly classified points / # of all points. Only works on a balanced dataset. Unbalanced dataset →
accuracy for each class separately. If we care more about the positive class: precision = TP / (TP + FP) → how many selected items are
relevant. Recall = TP / (TP + FN) → how many relevant items are selected. F-Score = 2 * precision * recall / (precision + recall). To choose
models, we need a test set, which contains data that the models have not yet seen before during the training phase. To tune
hyper-parameters or select features for a model, we use cross-validation to divide the dataset into folds and use each fold for validation.
Don’t use the test-set to tune hyper-parameters or select features, which will lead to information leakage. Training set → for training
models. Validation → for tuning hyper-parameters and/or selecting features. Way to select features → recursively eliminate the less
important ones by using metrics like permutation importance → permuting a feature several times and measuring the decrease in model
performance. If two highly correlated features exist, the model can access the information from the non-permuted feature. Thus, it may
appear that both features are not important. A better way is to cluster the correlated features first. For time-series data, it is better to do the
split for cross-validation based on the order of time intervals, which means we only use data in the past to predict the future, but not the
other way around. Regression. Fits a function that maps features x to a continuous variable y. Linear regression → fits a linear function f
that maps x1 (vb. the first feature vector of something) to y, which can best describe their linear relationship. We can now create a feature
matrix X that includes the intercept term 0, which gives us a compact form of equation. We can now generalize linear regression to have
multiple predictors and keep the compact mathematical representation. We use the vector and matrix forms to simplify equations. We can
map vector and matrix forms to data directly. We can look at the feature matrix X from two different directions: one represents the features
→ columns, one represents the data points → rows. Finally, we need an error metric between the estimated response y and the true
response y to know if the model fits the data well. Usually, we assume that the error e is IID (independent and identically distributed) and
follows a normal distribution with zero mean and some variance 2. To find the optimal coefficient, we need to minimize the error using
gradient descent or taking the derivative of its matrix form. We can model a non-linear relationship using a polynomial function with degree
k. Using too complex/simple models can lead to overfitting/underfitting. The model fits the training set well but generalizes poorly on the
dataset. To evaluate regression models, one common metric is the coefficient of determination (R-squared). For simple/multiple linear
regression, R-squared equals the square of Pearson correlation coefficient r between the true y and the estimate y = f(X). R-squared
increases as we add more predictors and thus is not a good metric for model selection. The adjusted R-squared considers the number of
samples (n) and predictors (p). A bad R-squared does not always mean no pattern in the data. A good R-squared does not always mean
that the function fits the data well. R-squared can be greatly affected by outliers. Lecture 3. Structured data generally means the type of
data that has standardized formats and well-defined structures. Mathematically speaking, we want to estimate a function f that can map
feature X to label y such that prediction f(X) is close to y as much as possible. Decision trees. Has a non-linear decision boundary that
iteratively partitions the feature space. For simplicity, assume all features are binary. If we could only ask one question, which question
would we ask? We want to use the most useful feature that gives us the most information to help us guess. How can we quantify which

Les avantages d'acheter des résumés chez Stuvia:

Qualité garantie par les avis des clients

Qualité garantie par les avis des clients

Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.

L’achat facile et rapide

L’achat facile et rapide

Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.

Focus sur l’essentiel

Focus sur l’essentiel

Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur FloorReeuwijk. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour €7,66. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis)

65040 résumés ont été vendus ces 30 derniers jours

Fondée en 2010, la référence pour acheter des résumés depuis déjà 15 ans

Commencez à vendre!

Récemment vu par vous


€7,66  1x  vendu
  • (0)
Ajouter au panier
Ajouté