Klausur


Kartei Details

Karten 129
Sprache English
Kategorie Finanzen
Stufe Universität
Erstellt / Aktualisiert 24.11.2024 / 08.02.2025
Weblink
https://card2brain.ch/box/20241124_data_analytics
Einbinden
<iframe src="https://card2brain.ch/box/20241124_data_analytics/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
What is Sensitivity (true positive rate)?

True Positive / (True Positive + False Negative)

What is the Specificity (true negative rate)?

True Negative / (True Negative + False Positive)

What is the ROC curve?

Receiver Operating Characteristic. It illustrates sensitivity and specificity when the cutoff value decreases from 1 to 0. Better performance is shown by an ROC curve closer to the upper left corner.

How do you measure the ROC curve?

Higher area under the curve: - 0.5 indicates no better than random assignment - 1.0 indicates perfect separation of classes

What are the six procedure steps of a Lift Chart?

1. Usually in deciles. 2. First: Observations are ordered along predicted probabilities. 3. Calculate the proportion of 1s in each decile. 4. Divide by the average proportion of 1s in the data set. 5. This ratio gives the lift value.

How do you interpret a lift chart?

Taking the 10% of observations most likely classified as 1s by the model yields almost eight times as many 1s as a random selection of 10% of cases.

What is the Mean Absolute Error?

Gives an idea of the magnitude of errors.

What is the Mean Error?

Gives an idea of systematic over- or underprediction.

How do you measure the accuracy of a prediction?

Compare the root mean squared error and mean absolute error of training and validation data.

What is a level-level regression and how do you interpret it?

y = ßx + E. If x increases by one unit, then y changes by ß units.

What is a log-level regression and how do you interpret it?

log y = ßx + E. If x increases by one unit, then y changes by ß * 100 percent.

What is a level-log regression and how do you interpret it?

y = ß log x + E. If x increases by 100 percent, then y changes by ß units.

What is a log-log regression and how do you interpret it?

log y = ß log x + E. If x increases by 1 percent, then y changes by ß percent.

What is the difference between a simple and multiple regression model regarding the RMSE?

The RMSE of the training data is lower in the multiple regression compared to the simple one.

For what do you need logistic regression?

When the target variable y is categorical (e.g., color). We only deal with binary outcomes (yes (1), no (0)).

What is the Type 1 Error?

False Positive

What is the Type 2 Error?

False Negative

What are Lasso, Ridge and Elastic Net regression good for?

They can be used for prediction or classification when we have large data sets.

What is the trade-off the ridge has to deal with?

Ridge has a larger bias than the OLS, but a lower variance. This reflects the Bias-Variance Trade-Off.

What is the OLS?

We minimize the sum of squared residuals to find ßs.

What is the RIDGE?

We minimize the sum of squared residuals + a shrinkage penalty with a squared term (power of two) to find ßs.

What is the LASSO?

We minimize the sum of squared residuals + a shrinkage penalty with an absolute value term (power of one) to find ßs.

What is the ELASTIC NET?

A combination of RIDGE and LASSO.

Why are some coefficients in LASSO equal to zero and in RIDGE not?

Because the constraint of the RIDGE forms a circle, so the ellipse never exactly touches zero.

When should you use Ridge and when Lasso?

Ridge: If most variables in the dataset are useful. Lasso: If most variables in the dataset are useless. If uncertain: Use Elastic Net Regression.

What do you do to find the optimal tuning parameter?

Try different tuning parameters, evaluate their performance, and compare. Performance is commonly checked using 10-fold cross-validation.

What is the 10-fold cross validation?

It involves comparing the mean performances of the tuning parameters on test data blocks.

What is the Python code for a confusion matrix with a cutoff of 0.5?

predicted = ['owner' if p > 0.5 else 'nonowner' for p in owner_df.Probability]

classificationSummary(owner_df.Class, predicted, class_names=['nonowner', 'owner'])

errorrate50 = (1 + 2) / (10 + 2 + 1 + 11)

accuracy50 = 1 - errorrate50 sens50 = 11 / (11 + 1)

spec50 = 10 / (10 + 2)

print(f"Error rate: {errorrate50:4.3f}")

print(f"Accuracy: {accuracy50:4.3f}")

print(f"Sensitivity: {sens50:4.3f}")

print(f"Specificity: {spec50:4.3f}")

What is the Python code for a table?

summary = pd.DataFrame({"Cutoff": [a, b, c], "Error rate": [a, b, c], "Accuracy": [a, b, c]}) summary

What is the Python code to plot the evolution of a rate (lineplot)?

sns.lineplot(data=summary, x="Cutoff", y="Error rate")

Python code: Relationship between cut-off value and error rate and accuracy in a common plot?

ax = summary.plot(x="Cutoff", y="Accuracy", legend=False)

ax2 = ax.twinx() summary.plot(x="Cutoff", y="Error rate",

ax=ax2, legend=True, color="r")

What is the Python code for reading data with the encoding ISO-8859-1?

toyota_df = pd.read_csv('ToyotaCorolla.csv', encoding="ISO-8859-1") toyota_df.head()

What is the Python code to show two variables of a table?

toyota_df[["Fuel_Type", "Price"]]

What is the Python code for a frequency table of a variable (Fuel_Type)?

toyota_df.Fuel_Type.value_counts() banking_df.Mortgage.value_counts().sort_index()

What is the Python code to calculate the arithmetic mean of a variable price for each category of another variable?

toyota_df.groupby('Fuel_Type').Price.mean()

What is the Python code to visualize the relationship between the selling price and the type of fuel in a boxplot?

sns.boxplot(x="Fuel_Type", y="Price", data=toyota_df, whis=100)

What is the Python code to visualize the relationship between the selling price and the type of fuel in a swarmplot?

with pd.option_context('mode.use_inf_as_na', True): sns.set(rc={'figure.figsize':(13,5), "figure.dpi":300,})

sns.set_theme(style="whitegrid")sns.swarmplot(x="Fuel_Type", y="Price", data=toyota_df, size=4)

What is the Python code to visualize the relationship between the selling price and the type of fuel in a stripplot?

with pd.option_context('mode.use_inf_as_na', True): sns.set(rc={'figure.figsize':(10,8), "figure.dpi":300,})

sns.set_theme(style="whitegrid")sns.stripplot(x="Fuel_Type", y="Price", data=toyota_df)

What is the Python code for an OLS Regression to appreciate the influence of a variable based on another variable?

modg_X = toyota_df[['Fuel_Type']

]modg_X = pd.get_dummies(modg_X, drop_first=True)

modg_X = sm.add_constant(modg_X)

modg_X = modg_X.astype(float) # Make sure that all columns have numerical values# Model estimation and results

modg = sm.OLS(toyota_df['Price'], modg_X)res = modg.fit()print(res.summary())

What is the Python code for regression statistics?

# Fuel_Type transform in Dummies

X = toyota_df[['Fuel_Type', 'HP']]

y = toyota_df[['Price']]# Transform Fuel_Type in dummies

X = pd.get_dummies(X, drop_first=True)# Split the datatrain_X, valid_X, train_y,

valid_y = train_test_split(X, y, test_size=0.4)# Model

fittingtoyota_ml = LinearRegression()toyota_ml.fit(train_X, train_y)