Toto je statické zobrazenie, ak chcete Notebook spustiť, prihláste sa do prostredia Data Lab.
Assignment¶
- Load file
Accidents.csv
, divide data into three tables according the values ofAccident_Severity
attribute (for the first table select only values -Fatal
, for the second -Serious
and for the third -Slight
). (2p)
In [ ]:
# YOUR CODE HERE
- From the table with
Slight
value randomly select 10% of examples using thesample
method. The following example demonstrates the usage of this method. (2p)
In [ ]:
# `frac` specifies how many examples should be selected (0.1 = 10%), `random_state` inicializing the generator of the
# random numbers, so the same selection can be replicated
sample_data = accidents_slight.sample(frac=0.1, random_state=1234)
In [ ]:
# YOUR CODE HERE
- Combine all three tables into a modified
Accidents
table, which will contain 10% ofSlight
examples and allFatal
andSerious
severity examples. After merging you should have 45,021 examples. (2p)
In [ ]:
# YOUR CODE HERE
- Join the modified
Accidents
table with theVehicles
table according to theAccident_Index
key so that only accident vehicles from the modifiedAccidents
table are in the resulting table. After merging, you should get a reduced training set with fewer examples to use further for data analysis. As we have reduced the number of less severe examples, we have increased the weight of the more severe examples. (2p)
In [ ]:
# YOUR CODE HERE
- Select only the following attributes for further analysis:
Day_of_Week
1st_Road_Class
Road_Type
Light_Conditions
Weather_Conditions
Road_Surface_Conditions
Urban_or_Rural_Area
Vehicle_Type
Sex_of_Driver
Age_of_Driver
Engine_Capacity_(CC)
Age_of_Vehicle
Accident_Severity
We will do this selection of attributes in order to remove redundant attributes in the dataset. We will not use redundant attributes describing e.g. geolocation, etc., or we will remove attributes that cannot be used for prediction (they are not known before the occurrence of the accident itself). (2p)
In [ ]:
# YOUR CODE HERE
- Count the number of missing values for individual attributes. Fill in the missing values appropriately (note: missing values are marked with -1). (4p)
In [ ]:
# YOUR CODE HERE
- Using the contingency table, show the dependencies between the following attributes and the
Accident_Severity
target attribute:
Day_of_Week
Sex_of_Driver
Age_of_Driver
(necessary to discretize this attribute appropriately)
Use one of the visualizations in the seaborn library to graphically display these relationships. (5p)
In [ ]:
# YOUR CODE HERE
- Create a dataset in which you replace all nominal attributes with numeric or binary ones. (3p)
In [ ]:
# YOUR CODE HERE
- Divide the data into training and test sets in the ratio 70/30. Use the
Accident_Severity
attribute as the target attribute. (2p)
In [ ]:
# YOUR CODE HERE
- Use the function
SelectKBest
andmutual_score_info
to calculate the importance of individual attributes for prediction in the training set. Try to use this information in the preprocessing of data for some of the models. (3p)
In [ ]:
# YOUR CODE HERE
- Train different classification models for the prediction of
Accident_Severity
attribute. Train the following models with preset parameters:
- k-nearest neighbors
- Decision trees
- Random forests
Test individual models with 10-fold cross-validation using the accuracy
metric.
Attention - for individual models, choose a suitable method of pre-processing (possible modification of pre-processing is in the step 8). (6p)
In [ ]:
# YOUR CODE HERE
- Compare the trained models also using the ROC curve of the test set. Identify the model that gives the best results with the preset parameters. In this step, try to tune the model by finding the best fitting parameters using
GridSearchCV
. Find and list the best combination of parameters. (4p)
In [ ]:
# YOUR CODE HERE
- Train the model with the best parameters on the entire training set. Test the model on the test set. Evaluate the model using
accuracy
,precision
, andrecall
metrics. Write ``confusion matrix'' for it. (3p)
In [1]:
# YOUR CODE HERE