The introduction of GDPR in Europe is raising general awareness of privacy regulations. In FOTs and other tests, it seems that partners have become more cautious than before about what could be collected. The FOT/NDS community have for many years stated that anonymization is needed to ensure the privacy of the participants in a study (e.g. when publishing results or to enable data sharing). This is true, but incredibly difficult to achieve as long as the original dataset is still accessible. Anonymized data, by legal definition, should not be possible to refer back to a person with likely methods. Even aggregated data (grouped by a participant) can be tracked back to the user if the original dataset is still available. The practical implementation is instead data de-identification and pseudonymization.
Pseuedonymization is described in GDPR and is used to remove direct identifiers (such as name or social security number). Instead a driver’s name, a pseudonym identification number is used or a vehicle identifier. This has been a common practise in the FOT/NDS community for many years.
De-identification methods can be seen on a scale where the lowest levels reveal more of the participant and the higher levels make it quite difficult to track back the data. In principle, the FOT/NDS community use different de-identification strategies based on the dataset and the consent of the participants.
De-identification is necessary if data is to be shown publicly or shared. In recent years, the topic of de-identification has been rising in public events and workshops, and also in tool development. FOT-Net hosted two workshops that covered mainly video and GPS data de-identification and anonymization issues and new tools are being developed.
Data reduction by video coding is seen as a main de-identification method in FOTs: only the codes and timestamps would be shared, not the original videos. Such annotation codes can extend to classification of driver’s facial movements or pose behind the wheel. The main issue is that the currently available driver monitoring tools hardly extend beyond detecting gaze and head direction. There is need for new tools that detect further attributes and annotate driver activity, e.g. whether the driver is talking to a phone or, in automated vehicle case, reading a book.
Automated vehicles collect video data (i.e. personal data) also from other road users and the passengers of the car. In a research project that is usually legal, considering there is a valid need for the data collection, videos are not collected from restricted areas and that the data will be protected. After the data is no longer needed by the research project, it should be anonymized or archived safely. For production vehicles, the aim is to develop video analysis tools that are capable of extracting the wanted features from the video, so that the video can be deleted already in the vehicle due to personal data issues and amount of data if video is to be stored.
Besides videos, automated vehicle research raises also another general de-identification topics: GPS and other vehicle internal data, be it from sensors or the navigation functionality. Where similar data in previous FOTs only displayed driver activity or driving style, in AV tests vehicle data shows how the product manoeuvres in traffic. The developers can be worried, that such data, when shared, reveals too much about the implementation logic of the automated driving functions.
FOTs generally have a technical evaluation part that deals with detailed performance. Such data is processed as confidential and related reports are reviewed before publication. This way of working also applies to automated vehicle pilots. At this moment, it is still unclear to what degree vehicle data can and will be de-identified in ongoing test projects before sharing any data. It seems that the confidential product-related data would be restricted to named persons and partners carrying out technical analyses and other first-hand investigations.
As FOT and pilot evaluation concentrates in changes in driving style or driver behaviour, after the first level of analysis has been completed, i.e. indicators and e.g. speed distributions have been calculated and videos have been annotated, the resulting data from the first level analyses could to some extent be considered de-identified and mostly free of confidential product details. For example, such processed data would tell the vehicle speed distribution on motorway in percentages. Indicators are derived from raw data, but the raw data would not have to be shared.
In conclusion, de-identification and anonymization are essential for the possibilities to collect, store and access personal and IPR data. Large efforts are currently invested in automated annotation tools, to facilitate provision of data both to automated vehicle manufacturers to reduce the data amount and enable further developments based on real world data, and to research, to be able to share data more openly.
Privacy by design
To adhere to GDPR it is important for the organisations act according to Art. 25 on data protection by design and by default. This means that any organisation managing personal data must implement technical and organisational measures from the earliest stage of data processing. In a FOT/NDS context this could mean that any data transferred must be encrypted, safely and securely stored, and that any direct identifiers (e.g., driver name or vehicle registration number) would be replaced in a pseudonymization step (e.g. using driver id or vehicle id). In addition the European Union Agency for Network and Information Security (ENISA) propose different strategies for “Privacy by design in the era of big data” (D’ Acquisto et. al. 2015), described in table Privacy by-design strategies.
|Privacy by-design strategy||Description|
|Minimize||The amount of personal data should be restricted to the minimal amount possible (data minimization).|
|Hide||Personal data and their interrelations should be hidden from plain view.|
|Separate||Personal data should be processed in a distributed fashion, in separate compartments whenever possible.|
|Aggregate||Personal data should be processed at the highest level of aggregation and with the least possible detail in which it is useful.|
|Inform||Data subjects should be adequately informed whenever processed (transparency).|
|Control||Participants should be provided agency over the processing of their personal data.|
Acting on the two concepts of data protection by design and by default and privacy by design gives a solid ground for the data centre. The data centre must although take the two concepts and balance them to the research needs. It is important to document the decisions that consider these areas.
Two sets of requirements are suggested below, one for data centres and one for analysis sites, to support the process of implementing a documenting data protection. This document recommends eight requirements be considered by a DC, called DC1–DC8, and ten be considered by an AS, called AS1–AS10. Moreover, documents related to both the DC and the AS are listed. Depending on the classification of the data involved, the needed level of protection will vary, regardless of the data size.
It is important to state that these requirements should be seen as a starting point for the FOT/NDS project organisation to further investigate the issue together with their IT department. The requirements and implementation plans need to be adopted according to the categories of the dataset as well as the existing IT-infrastructure of the organisation.
Note that additional requirements must be considered for specific categories of personal data (GDPR art. 9).
Having a completely anonymised dataset could mean that the usefulness and value for analysis is so low that keeping the dataset available is meaningless. In any case, legal and ethical restrictions on how long one is allowed to keep a personal dataset might force its deletion anyway. These restrictions apply to several current datasets, increasing the need for methods to extract essential, anonymous data before the data is discarded.
For rich media (such as video or images or GPS traces), feature extraction is the key to preserving privacy. Feature extraction could be used to translate media data into measures, thus removing the identifiable elements. Efficient feature extraction would solve two major issues for FOT/NDS: current datasets could be shared and features could be extracted from data before they are purged.
The first decision to make is which features should be extracted from the data; if the extraction is being performed prior to data deletion, data owners, providers and researchers must collaborate on this difficult task. The next step is to select an extraction method and evaluate its performance. Some interesting cases have been published regarding the SHRP2 dataset (Smith et. al., 2015 and Seshadri et. al., 2015). Promising efforts are ongoing to evaluate and improve extraction methods, and interesting results were presented at the two consecutive FOT-Net Anonymization workshops in Gothenburg (2015 and 2016), giving an overview of European and American efforts. The presentations can be found on the FOT-Net website (http://fot-net.eu/library/?filter=workshops). Finally, the project must decide if it has the extensive computational resources required to extract features from a large dataset.
The main benefit of feature extracting is the possibility of enhancing existing datasets with new attributes or measures, previously only available from costly video coding processes.
GPS traces are also considered personal data, albeit indirect, as they can potentially reveal where people live and work and even their children´s schools. Similarly, no detailed travel diaries covering long periods of time can be made public if they contain addresses, even though a person making a single trip in the diary could actually be anyone living or working at those addresses. There are many approaches being explored to ensure personal integrity, e.g., k-anonymity and differential privacy (D’ Acquisto et. al. 2015). The trade-off here, between anonymization and maintaining usefulness of the data for research, is difficult.