Data Analytics
Klausur
Klausur
Kartei Details
Karten | 129 |
---|---|
Sprache | English |
Kategorie | Finanzen |
Stufe | Universität |
Erstellt / Aktualisiert | 24.11.2024 / 08.02.2025 |
Weblink |
https://card2brain.ch/box/20241124_data_analytics
|
Einbinden |
<iframe src="https://card2brain.ch/box/20241124_data_analytics/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
True Positive / (True Positive + False Negative)
True Negative / (True Negative + False Positive)
Receiver Operating Characteristic. It illustrates sensitivity and specificity when the cutoff value decreases from 1 to 0. Better performance is shown by an ROC curve closer to the upper left corner.
Higher area under the curve: - 0.5 indicates no better than random assignment - 1.0 indicates perfect separation of classes
1. Usually in deciles. 2. First: Observations are ordered along predicted probabilities. 3. Calculate the proportion of 1s in each decile. 4. Divide by the average proportion of 1s in the data set. 5. This ratio gives the lift value.
Taking the 10% of observations most likely classified as 1s by the model yields almost eight times as many 1s as a random selection of 10% of cases.
Gives an idea of the magnitude of errors.
Gives an idea of systematic over- or underprediction.
Compare the root mean squared error and mean absolute error of training and validation data.
y = ßx + E. If x increases by one unit, then y changes by ß units.
log y = ßx + E. If x increases by one unit, then y changes by ß * 100 percent.
y = ß log x + E. If x increases by 100 percent, then y changes by ß units.
log y = ß log x + E. If x increases by 1 percent, then y changes by ß percent.
The RMSE of the training data is lower in the multiple regression compared to the simple one.
When the target variable y is categorical (e.g., color). We only deal with binary outcomes (yes (1), no (0)).
False Positive
False Negative
They can be used for prediction or classification when we have large data sets.
Ridge has a larger bias than the OLS, but a lower variance. This reflects the Bias-Variance Trade-Off.
We minimize the sum of squared residuals to find ßs.
We minimize the sum of squared residuals + a shrinkage penalty with a squared term (power of two) to find ßs.
We minimize the sum of squared residuals + a shrinkage penalty with an absolute value term (power of one) to find ßs.
A combination of RIDGE and LASSO.
Because the constraint of the RIDGE forms a circle, so the ellipse never exactly touches zero.
Ridge: If most variables in the dataset are useful. Lasso: If most variables in the dataset are useless. If uncertain: Use Elastic Net Regression.
Try different tuning parameters, evaluate their performance, and compare. Performance is commonly checked using 10-fold cross-validation.
What is the 10-fold cross validation?
It involves comparing the mean performances of the tuning parameters on test data blocks.
What is the Python code for a confusion matrix with a cutoff of 0.5?
predicted = ['owner' if p > 0.5 else 'nonowner' for p in owner_df.Probability]
classificationSummary(owner_df.Class, predicted, class_names=['nonowner', 'owner'])
errorrate50 = (1 + 2) / (10 + 2 + 1 + 11)
accuracy50 = 1 - errorrate50 sens50 = 11 / (11 + 1)
spec50 = 10 / (10 + 2)
print(f"Error rate: {errorrate50:4.3f}")
print(f"Accuracy: {accuracy50:4.3f}")
print(f"Sensitivity: {sens50:4.3f}")
print(f"Specificity: {spec50:4.3f}")
What is the Python code for a table?
summary = pd.DataFrame({"Cutoff": [a, b, c], "Error rate": [a, b, c], "Accuracy": [a, b, c]}) summary
What is the Python code to plot the evolution of a rate (lineplot)?
sns.lineplot(data=summary, x="Cutoff", y="Error rate")
Python code: Relationship between cut-off value and error rate and accuracy in a common plot?
ax = summary.plot(x="Cutoff", y="Accuracy", legend=False)
ax2 = ax.twinx() summary.plot(x="Cutoff", y="Error rate",
ax=ax2, legend=True, color="r")
What is the Python code for reading data with the encoding ISO-8859-1?
toyota_df = pd.read_csv('ToyotaCorolla.csv', encoding="ISO-8859-1") toyota_df.head()
What is the Python code to show two variables of a table?
toyota_df[["Fuel_Type", "Price"]]
What is the Python code for a frequency table of a variable (Fuel_Type)?
toyota_df.Fuel_Type.value_counts() banking_df.Mortgage.value_counts().sort_index()
What is the Python code to calculate the arithmetic mean of a variable price for each category of another variable?
toyota_df.groupby('Fuel_Type').Price.mean()
What is the Python code to visualize the relationship between the selling price and the type of fuel in a boxplot?
sns.boxplot(x="Fuel_Type", y="Price", data=toyota_df, whis=100)
What is the Python code to visualize the relationship between the selling price and the type of fuel in a swarmplot?
with pd.option_context('mode.use_inf_as_na', True): sns.set(rc={'figure.figsize':(13,5), "figure.dpi":300,})
sns.set_theme(style="whitegrid")sns.swarmplot(x="Fuel_Type", y="Price", data=toyota_df, size=4)
What is the Python code to visualize the relationship between the selling price and the type of fuel in a stripplot?
with pd.option_context('mode.use_inf_as_na', True): sns.set(rc={'figure.figsize':(10,8), "figure.dpi":300,})
sns.set_theme(style="whitegrid")sns.stripplot(x="Fuel_Type", y="Price", data=toyota_df)
What is the Python code for an OLS Regression to appreciate the influence of a variable based on another variable?
modg_X = toyota_df[['Fuel_Type']
]modg_X = pd.get_dummies(modg_X, drop_first=True)
modg_X = sm.add_constant(modg_X)
modg_X = modg_X.astype(float) # Make sure that all columns have numerical values# Model estimation and results
modg = sm.OLS(toyota_df['Price'], modg_X)res = modg.fit()print(res.summary())
What is the Python code for regression statistics?
# Fuel_Type transform in Dummies
X = toyota_df[['Fuel_Type', 'HP']]
y = toyota_df[['Price']]# Transform Fuel_Type in dummies
X = pd.get_dummies(X, drop_first=True)# Split the datatrain_X, valid_X, train_y,
valid_y = train_test_split(X, y, test_size=0.4)# Model
fittingtoyota_ml = LinearRegression()toyota_ml.fit(train_X, train_y)