Dissemination – Data catalogue
The Data Catalogue (available at http://wiki.fot-net.eu/index.php/Data_Catalogue) is the latest addition to the FOT-Net Wiki catalogues. As a community resource, all registered users are able to update the contents of the wiki.
The Data Catalogue describes datasets collected in FOTs and NDSs including conditions for availability for re-use. It introduces the scope of the data (e.g. kilometers driven, number of test users and sensor data collected), test setup, and provides links to further information and contact persons for each dataset.
As the current datasets regarding automation deal with development of functions, CARTRE decided rather to complement the FOT Data Catalogue with a new area/page, than add these development datasets to the mix of user datasets. This chapter essentially contains the information regarding automated driving datasets added to the wiki. The wiki allows for 1) wider publicity for these available datasets and 2) enables information collection from the testing community, revealing further usable and available (either publicly or through licensing) datasets.
Having new partners review the Data Catalogue also produced valuable comments for improvement. So far description text improvements have been made and visual options have been noted for further development.
Available datasets on automated driving
Generally, all naturalistic driving and image classification datasets are usable for automated driving studies, as they can be used as training data. Naturalistic driving data indicates how humans behave in different scenarios and the data can be used to identify different testing scenarios for automation. Most of such naturalistic driving datasets from around the world are already featured on the Data Catalogue. They are multi-purpose, enabling a wide set of research questions not limited to automated driving development.
This chapter details the publicly available datasets that have specifically been recorded with automated vehicles (or the like) or collected for development of automated vehicle functionality. To date, these datasets are quite different than FOT datasets from large-scale user tests, which the FOT-Net project has created an online catalogue for. The following datasets can be classified as development data. Data from large-scale user tests has not yet been made widely available, much due to competitive development status of current prototype vehicles. The catalogue information has been compiled by the CARTRE project in 2018, with some pointers coming from the ongoing ENABLE-S3 (European Initiative to Enable Validation for Highly Automated Safe and Secure Systems) project.
Oxford RobotCar Dataset
The Oxford University has collected a dataset consisting of 1000 km recorded driving in central Oxford over the period of 1.5 years (W. Maddern, G. Pascoe, C. Linegar and P. Newman, 2016). One needs an academic e-mail address to register, ending with .edu or .ac.uk. Alternatively, the university can be contacted for negotiating a commercial license. The data is mainly intended for non-commercial academic use. The dataset features almost 20 million images. Information on the dataset is available at http://robotcar-dataset.robots.ox.ac.uk/.
Apollo is an automated driving ecosystem and open platform initiated by Baidu. It features source code, data and collaboration options. The platform offers various types of development data, e.g. annotated traffic sign videos, vehicle log data from demonstrations, training data for multi-sensor localization and scenarios for their simulation environment. More information is available at http://apollo.auto.
ApolloScape, a part of Apollo, additionally offers training data for semantic segmentation (pixel-level classification of video frames, usually input for training neural networks). As of March 2018, the dataset contained 74 thousand video frames. More information is available at http://apolloscape.auto/.
Data uploaded by partners is considered to be private by default (http://apollo.auto/docs/promise.html, accessed on June 2nd 2018), but it can be marked public or even so, that specific partners cannot access the data. Sample data is available but wider access to data requires negotiated licenses. Apollo features a business model where one part of the model is about getting wider access to the resources through data and SW contributions.
The consortium has released 400 hours of video, including also GPS and inertial measurement unit data. The basic license is limited for personal use. More information is available at http://data-bdd.berkeley.edu/. The consortium is housed at Berkeley and includes members like GM, Ford, Qualcomm and Nvidia. Also Apollo (Baidu) joined the consortium in 2018. In co-operation with Nexar, Berkeley DeepDrive made 100 000 videos available in June 2018. The dataset, BDD100K, includes 40 second clips of data collected in multiple cities in the US. The videos are complemented with GPS information and annotations of objects and lane markings. More information describing this particular part of DeepDrive can be found at https://bair.berkeley.edu/blog/2018/05/30/bdd/.
Comma.ai is a start-up that has built advanced neural network components that enable self-driving features. They have open sourced parts of their data and software code. They sell dashcam components that go together with the software. Users can submit data and earn community points (https://community.comma.ai/wiki/index.php/FAQ, accessed on June 2nd 2018). More information is available at https://comma.ai/.
KITTI Vision Benchmark Suite
The Karlsruhe Institute of Technology has open sourced six hours of data captured while driving in Karlsruhe. The dataset is famous for its use in vision benchmarks. Annotations / evaluation metrics are provided along with raw data. The dataset cannot be used for commercial purposes. More information is available at http://www.cvlibs.net/datasets/kitti/.
The Cityscapes dataset features 5000 images with high quality annotations and 20,000 images with coarse annotations from 50 different cities. The images are annotated at pixel-level and offer training material for neural network studies. When the dataset is used in studies, the users are requested to cite related dataset papers. More information is available at https://www.cityscapes-dataset.com/.
The Udacity open source self-driving car project
Udacity is building an open source self-driving car. The project offers example data recordings from more than ten hours of driving and annotated driving datasets, where objects in video have been marked with surrounding boxes. In addition to open sources tools, Udacity publishes programming challenges to further the development. The project plans to attract students from around the world. More information is available at https://www.udacity.com/self-driving-car.
HD1K Benchmark Suite
This dataset for optical flow (movement measured from video) benchmarking was created by the Heidelberg Collaboratory for Image Processing in close cooperation with Robert Bosch GmbH. It contains over 1000 frames of high-resolution video with diverse weather. The dataset contains reference information about movement. It is used for optical flow algorithm benchmarking. The data is general-purpose but it was collected with a measurement van. More information is available at http://hci-benchmark.org/.
Playing for data
This Darmstadt University dataset is an example on recent efforts in the academic community to extract neural network training data from computer games. In games, every pixel belongs to known objects. This reduces the need for manual annotation work, but certainly the data is limited to the details the game can generate. The datasets consists of 24966 densely labelled frames and it is compatible with the Cityscapes dataset. More information is available at http://download.visinf.tu-darmstadt.de/data/from_games/.
Málaga Urban Dataset
This stereo camera and laser dataset was collected on a 37 km route in urban Malaga. The files are downloadable right away, under BSD open source license, requesting referral to a scientific paper by authors from universities of Almeria and Malaga. More information is available at https://www.mrpt.org/MalagaUrbanDataset.
The Data Catalogue serves as a pointer to accessible data, but it does not host data. Sample data, however, can be included. The main reason for not hosting data is that accessing the ITS/FOT datasets usually requires bilateral licensing negotiations, as they are not fully public & anonymized. Agreements partially ensure that any remaining personal and confidential data are properly handled, plus there can be e.g. rules for monitoring that publications do not single out individuals or unnecessary performance data.
Secondly, permanent data hosting requires resources and a business model. Rather, when a project seeks data hosting services, CARTRE points to companies and e-infrastructure services, which may be partially publicly funded. Projects could e.g. store their data for a defined period for a fee and get related hosting services.
The main problem is the funding for maintaining the dataset. Previous experiences tell us that a dataset cannot be available based on potential interested projects paying for access. This model has proven to be not sustainable, as there is no money available to cover the cost in the low-demand periods. The project fee model works best when combined with a basic funding, that would act an assurance for projects that the data will be available over time.
What is interesting about automated vehicle data is that single development vehicles may collect terabytes of data per day and this data has to be readily usable e.g. for algorithm development. Such needs have to be tackled by various developers, as several companies race to develop new systems. Either these developers set up their own big data management systems, dealing e.g. with Apache services (Hadoop, Hive, Spark, Nifi) and learn to use big data toolsets – or turn to companies offering data management services.
As development of automated driving is a strong activity, new companies are starting to appear targeting vehicle manufacturers, offering data management services for fleets of vehicles collecting automated driving data. These datasets contain petabytes of video and laser scanner data. The data must be well-accessible for use e.g. in neural network training. When considering e-infrastructure services for such amounts of data, the new companies can likely offer well-tailored data management.
AutonomouStuff, (https://autonomoustuff.com/product/quantum-storage-solution-kits/, accessed on June 2nd 2018) is an example of a company offering such new type of data management services, serving numerous test vehicles in the USA. The next figure outlines their end-to-end support:
This private business development is interesting, as it will facilitate cheaper, more powerful data management facilities also to research data. Because what needs to be kept in mind is that development and large-scale user testing pilots usually collect rather different amounts of data. Where data in the order of petabytes may be necessary for development, user evaluation data consists of data in the range of terabytes requiring lower-resolution video feeds and driving style data from vehicles. The latter user test data is used for identifying changes in driving behaviour, either the driver’s or the car’s, e.g. a change in average following distance on highways or how the driver/user behave in the vehicle.
When confidential product data is removed from the equation, the data amounts of automated driving come close to past user tests.
The access to data is not only limited by lack of funding for providing the data. An additional essential area is that the data contains personal data and immaterial property right data. Anonymization and de-identification is a way to work around this limitation and provide data more openly.
Examples of ITS/FOT data storage and access being utilized successfully
The SHRP2 database (http://www.trb.org/StrategicHighwayResearchProgram2SHRP2/Blank2.aspx, accessed on June 2nd 2018) contains NDS data from over 3,500 drivers recruited from six locations in the United States, in total more than 5 million trips. Data include video, sensor, vehicle network, and participant assessment data, as well as summary data related to events and trips. Roadway elements can be obtained from the Roadway Information Database (RID).
The data is stored at Virginia Tech in the US and the organisation is provided funding to keep the dataset available to researchers from the US DOT. Still, the model used is that if an organisation would like specific work done, such as additional dataset extracted or annotations of video, the organisation needs to pay for their own specific request. In this way, the funding goes to maintenance and access facilities for sustainable availability and storage of both original data and refined datasets used for papers.
The users of the SHRP2 data are from different parts of the world, the majority being from the United States. Data access is based on the level of detail requested and the need for personally identifying information (PII) either through the InSight website (available at https://insight.shrp2nds.us) or via a data use license (DUL). Video and GPS can only be accessed within a secure data enclave. There were 174 active DULs for SHRP2 data, and between 20 and 30 requests per month as of two years after the dataset was opened up for re-use.
UDRIVE was the first large-scale European Naturalistic Driving Study on 120 cars, 50 trucks and 40 powered two wheelers. The data was collected in six countries in Europe. The acquired data includes: vehicle data, Mobileye, video (seven views: driver face, pedals, cockpit, steering wheel, front middle, left front, right front), GNSS, and questionnaires.
UDRIVE was by definition a data sharing project. Data management was centralised since all the collected pre-processed data was stored and managed by the Central Data Center (CDC). The CDC provided remote access to all analysis sites, and all analysis was performed on one single dataset.
To protect the data throughout the data handling chain, a “Data Protection Concept” was developed. The concept also sets the specific requirements for data protection after the project. The data can be remotely accessed after the project by third parties if funding is provided. To protect the personal data, video and GPS, these data can only be accessed via a secure enclave at one of the project partners having remote access to the CDC.
After the project, eight former UDRIVE partners started the UDRIVE data User Group, to jointly pay to keep the data available. The data remain at SAFER, the former CDC, and is accessible to the eight partners using the project solution of remote access. Still, the funding is not sustainable, as the partners are depending on projects to keep the data available. Two partners have downloaded the dataset after having implemented secure data protection in line with the data protection concept. Currently, these datasets are only available to researchers within their organisation.
ITS Public Data Hub
In recent years, the US Department of Transportation’s ITS Joint Program Office’s (JPO) Research Data Exchange (RDE) collected and published data from various tests, especially C-ITS pre-deployment tests. RDE has now been deprecated and datasets have been transferred to ITS Public Data Hub (available at www.its.dot.gov/data/), which is a publicly funded organisation. The ITS Public Data Hub provides a single point of entry to over 100 public datasets, enabling third-party research and harmonization of similar data. Much of the data is about connected vehicle tests, the latest addition being Wyoming DOT Connected Vehicle Pilot in early 2018 (https://www.its.dot.gov/data/about/, accessed on June 2nd 2018)
The JPO has been setting up further practices for sharing also data, which cannot become fully public for privacy or confidentiality reasons. They are developing controlled-access research data systems providing varying access rights (https://www.its.dot.gov/data/public-access/, accessed on June 2nd 2018).