Data Analytics
Klausur
Klausur
Fichier Détails
Cartes-fiches | 129 |
---|---|
Langue | English |
Catégorie | Finances |
Niveau | Université |
Crée / Actualisé | 24.11.2024 / 08.02.2025 |
Lien de web |
https://card2brain.ch/box/20241124_data_analytics
|
Intégrer |
<iframe src="https://card2brain.ch/box/20241124_data_analytics/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Créer ou copier des fichiers d'apprentissage
Avec un upgrade tu peux créer ou copier des fichiers d'apprentissage sans limite et utiliser de nombreuses fonctions supplémentaires.
Connecte-toi pour voir toutes les cartes.
What is the Python code to calculate the arithmetic mean of a variable price for each category of another variable?
toyota_df.groupby('Fuel_Type').Price.mean()
What is the Python code to visualize the relationship between the selling price and the type of fuel in a boxplot?
sns.boxplot(x="Fuel_Type", y="Price", data=toyota_df, whis=100)
What is the Python code to visualize the relationship between the selling price and the type of fuel in a swarmplot?
with pd.option_context('mode.use_inf_as_na', True): sns.set(rc={'figure.figsize':(13,5), "figure.dpi":300,})
sns.set_theme(style="whitegrid")sns.swarmplot(x="Fuel_Type", y="Price", data=toyota_df, size=4)
What is the Python code to visualize the relationship between the selling price and the type of fuel in a stripplot?
with pd.option_context('mode.use_inf_as_na', True): sns.set(rc={'figure.figsize':(10,8), "figure.dpi":300,})
sns.set_theme(style="whitegrid")sns.stripplot(x="Fuel_Type", y="Price", data=toyota_df)
What is the Python code for an OLS Regression to appreciate the influence of a variable based on another variable?
modg_X = toyota_df[['Fuel_Type']
]modg_X = pd.get_dummies(modg_X, drop_first=True)
modg_X = sm.add_constant(modg_X)
modg_X = modg_X.astype(float) # Make sure that all columns have numerical values# Model estimation and results
modg = sm.OLS(toyota_df['Price'], modg_X)res = modg.fit()print(res.summary())
What is the Python code for regression statistics?
# Fuel_Type transform in Dummies
X = toyota_df[['Fuel_Type', 'HP']]
y = toyota_df[['Price']]# Transform Fuel_Type in dummies
X = pd.get_dummies(X, drop_first=True)# Split the datatrain_X, valid_X, train_y,
valid_y = train_test_split(X, y, test_size=0.4)# Model
fittingtoyota_ml = LinearRegression()toyota_ml.fit(train_X, train_y)
What is the Python code to show the regression statistics of training data?
print('Performance Measures (Training data)') regressionSummary(train_y, toyota_ml.predict(train_X))
What is the Python code to show the regression statistics of validation data?
print('Performance Measures (Validation data)') regressionSummary(valid_y, toyota_ml.predict(valid_X))
What is the Python code to replace the spaces in all variable names with underscores _?
banking_df.columns = [s.strip().replace(" ", "_") for s in banking_df.columns] banking_df.head()
What is the Python code to convert a variable into a categorical variable?
banking_df["Education"].value_counts().sort_index()
banking_df["Education"] = banking_df["Education"].map({1: "Undergrad", 2: "Graduate", 3: "Advanced/Professional"})
banking_df.head()
What is the Python code to generate a new variable that takes the value 0 when Mortgage has the value 0 and takes the value 1 in all other cases?
banking_df["has_mortgage"] = [0 if x == 0 else 1 for x in banking_df["Mortgage"]]
banking_df.head()
What is the Python code to estimate a logit model: log(odds(has.mortgage = 1| income) = ß0 + ß1 * income?
X_simple = banking_df["Income"]
Y_simple = banking_df["has_mortgage"]
X_simple = sm.add_constant
(X_simple)logit_simple_mod = sm.Logit
(Y_simple, X_simple)logit_simple_mod_res = logit_simple_mod.fit()print(logit_simple_mod_res.summary())
What is the Python code to add explanatory variables and estimate it again?
X_full = banking_df[["Income", "Family", "CCAvg", "Education", "Age"]] X_full = pd.get_dummies(X_full, prefix_sep="_", drop_first=True)
X_full = X_full.astype(float) # Make sure that all columns have numerical data types
Y_full = banking_df["has_mortgage"] X_full = sm.add_constant
(X_full)logit_full_mod = sm.Logit(Y_full, X_full)
logit_full_mod_res = logit_full_mod.fit()print(logit_full_mod_res.summary())
What is the Python code to make a confusion matrix?
predict_valid = logit_reg.predict(valid_X) cm2 = confusion_matrix(valid_y, predict_valid)
ConfusionMatrixDisplay(cm2).plot()
What is the Python code to generate a lift chart?
import kds as kds
kds.metrics.plot_lift(valid_y, predict_valid)
What are numeric variables?
Continuous -> infinite numbers (e.g. size, time, age) and
Integer -> integer (e.g. number of cars, number of cities)
What are categorical variables?
Ordinal -> values can be ordered logically (e.g. good - ok - bad) and
nominal -> values cannot be ordered logically (e.g. blue, yellow, red)
What are the core ideas of Data Mining?
Data Analysis, Visualisation, Prediction, Classification, Data Reduction, Association Rules, Recommendation Systems
Name the 9 steps of the Data Mining Process
Define/ Understand the purpose of the analysis;
Obtaining data (possibly including sampling);
Data analysis, cleaning, preparation;
Reduce the data (dimension) (for supervised data mining, partition it);
Specify the analysis goal (classification, prediction, etc.);
Selection of techniques (e.g. regression, logit);
Iterative implementation and tuning;
Evaluation of the results;
Roll-out and widespread use of the best model
Supervised learning's goal is to predict a target or outcome variable; the goal of unsupervised learning is to identify patterns and divide data into meaningful groups.
In supervised learning a target value is known in the training data, it is the data on which the algorithm is trained; in unsupervised learning there is no target variable for prediction or classification.
Supervised learning: classification and prediction; unsupervised learning: association rules, data reduction, data exploration, visualisation.
The goal is the prediction of a numerical outcome variable.
The goal is the prediction of a categorical outcome variable.
What is the code for descriptive analysis of the dataset housing?
housing_df.describe()
What is the code to show the dimension of the dataset?
housing_df.shape
What is the code to show the first 5 lines of the dataset?
housing_df.head()
Number of categories - 1 -> redundant information leads to the failure of algorithms.
Observation that is extreme (far away) compared to the rest of the data.
We need expertise in the data to determine whether it is an error or a true extreme. Sometimes it is possible to correct the error. If the number of outliers is small and we recognize it as an error: treat it as a missing value.
-
- 1 / 129
-