This repository includes modified data and source code used in the paper "On the application of clustering for extracting driving scenarios from vehicle data ", submitted to the journal "Machine Learning with Applications".
If we want to extract test cases from driving data with the purpose of testing vehicles, we want to avoid using similar test cases. In this paper, we focus on this topic. We provide a method for extracting driving episodes from data utilizing clustering algorithms. This method starts with clustering driving data. Afterward, data points representing time-ordered sequences are obtained from the cluster forming a driving episode. Besides outlying the foundations, we present the results of an experimental evaluation where we considered six different clustering algorithms and available driving data from three German cities. To evaluate the cluster quality, we utilize three cluster validity metrics. In addition, we introduce a measure for the quality of extracted episodes relying on the Pearson coefficient. The experimental evaluation shows that the cluster validity metrics do not provide good results. The Pearson coefficient allows ranking the clustering algorithms. The carried out experimental evaluation leads to the following results. We can extract meaningful episodes from driving data using any clustering algorithm considering four to eight clusters. Combining k-means clustering with auto-encoders leads to the best Pearson correlation. SOM is the slowest clustering method, and Canopy is the fastest.
In this work we make use of the public available A2D2 data [1] originally downloaded from https://www.a2d2.audi/a2d2/en/download.html
For the clustering approach we used the Bus Signals data available for three German cities: Gaimersheim, Munich and Ingolstadt.
For the driving scenarios validation we make use of the Camera-FrontCenter images.
The original data is not provided in this repository, in order to download the initial bus signals data and camera images, please refer to above-mentioned link. For the purpose of our study we performed some changes on the original data, here we only provide the modified data used in each step of our approach.
[1] Geyer, Jakob, et al. "A2d2: Audi autonomous driving dataset." arXiv preprint arXiv:2004.06320 (2020).
in order to be able to run all algorithms available in the Python wrapper for the Java machine learning workbench Weka (see: https://www.cs.waikato.ac.nz/~ml/weka/), it is required to install these libraries:
-
python-weka-wrapper3 : https://github.com/fracpete/python-weka-wrapper3
-
python-javabridge 4.0.3 : https://pypi.org/project/python-javabridge/
-
sklearn-weka-plugin which makes Weka algorithms available in scikit-learn. https://github.com/fracpete/sklearn-weka-plugin
The project is composed of two main folders:
I/ Clustering approach:
-
Data_interpolation: This script performs Cubic spline interpolation on the original data in order to synchronize all bus signals values. The intperolated data for each city is saved in the "Interpolated_data" folder.
-
Convert_Interpolated_data: This script converts the interpolated data to the .arff format which is the required format used in WEKA. In this work we only make use of four main sensors which are : 'accelerator_pedal', 'brake_pressure', 'steering_angle_calculated', 'vehicle_speed' but it is possible to use all sensors if required. The data in .arff format for each city is saved in the "arff_data" folder.
-
Data_cleansing: This script changes all brake pressure values which are <= 0.2 to 0 in order to obtain more precise results. Also it filters out the Timestamps values since we do not take into account the time attribute when clustering the data. The clean data (including timestamps) and filtered data (without timestamps) is saved in "Clustering-input_data" folder".
-
Data_clustering: In this script, we perform clustering using five different algorithms provided by weka (Simple k-means, EM, Canopy clustering, SOM clustering and K-means using Auto-encoder pre-processing ). We save the extracted driving scenarios in .json files using the function: create_JSON_files, plus we provide the option of saving driving scenarios in .arff format if needed to be later used as input to weka. For this we use the function:Create_arff_outputfiles. The extracted driving scenarios are saved in both mentioned formats in Results/driving_scenarios folder. Also, in this script we compute the average percentage of Pearson correlation in each cluster using the function AVG_Percentage_Pearson_correlation. The Pearson correlation results for each city are saved in .txt format in the Results folder. We also provide some other graphs of extracted episodes, their probability distribution and Pearson correlation matrices corresponding to each attribute. These graphs were used in our previous paper (see: https://web.archive.org/web/20211013034723id_/http://ksiresearch.org/seke/seke21paper/paper118.pdf)
-
DB_CH_AvgS: computes The Davies-Bouldin index (DB), The Calinski-Harabasz index (CH) and The Silhouette index (S) for each algorithm using different numbers of clusters. These three metrics are available in Scikit-learn library.
-
Plot_silhouette: Plot the silhouette results for a given algorithm using different numbers of clusters.
II/ Camera_images_Validation:
In this folder, we provide scripts for matching camera images available in the original A2D2 data to the extracted driving scenarios (available in Clustering_approach/Results/driving_scenarios).
- img2vid.py: This script creates videos for each city using the original Camera Front_center images downloaded from the link above.
- synchro_vid_clustered_bus: Synchronizes the clustered bus signals with camera images and matches each driving scenario to corresponding sequences of images.