Examining Uber’s Use of the Sexual Misconduct and Violence Taxonomy and the Development of Uber’s United States Safety Report BY: Janine Zweig, Chad Sniffen, and Emily Tiry November 21, 2019 Introduction In 2017, Uber Technologies’ leadership recognized that their system to categorize users’ reports of incidents of sexual harassment, sexual misconduct, and sexual assault could be improved to better help them more fully understand the nature and scope of these problems experienced by users of their platform. The classification system had limited categories and relied heavily on subjective determinations by agents. Without an objective basis, categorization could not be consistently applied. Recognizing the need for outside expertise to create an effective categorization system, Uber’s leadership engaged with RALIANCE, the National Sexual Violence Resource Center (NSVRC), and the Urban Institute (Urban) to develop a research-informed categorization system. We published the Sexual Misconduct and Violence Taxonomy in late 2018.1 The Sexual Misconduct and Violence Taxonomy includes 21 behaviorally-specific categories that do not overlap (are mutually exclusive) and capture the full array of possible incidents (are collectively exhaustive). Appendix A includes the final taxonomy categories. At the same time the taxonomy was published, Uber committed to publishing a US Safety Report in 2019 that would provide information about user-reported safety incidents, including those categorized within the Sexual Misconduct and Violence Taxonomy . Once the taxonomy was published, Uber began implementing it to categorize all new incidents of sexual harassment, sexual misconduct, and sexual assault reported by platform users going forward. Uber also retrospectively applied the taxonomy to such incidents reported in 2017 and 2018. The project described in this brief report aimed to assess Uber’s integration of the taxonomy into its system of receiving and accurately categorizing complaints from platform users, and Uber’s approach to developing the US Safety Report. The project had two components: A verification analysis of how Uber agents and auditors categorize user-reported incidents into the taxonomy. This process compared the way Uber’s staff categorized user-reported incidents to the way staff from NSVRC and Urban did for two samples of reports.2 The goal of this comparison was to determine the overall reliability with which Uber staff categorize reports into the taxonomy and the process of auditing these data. A system assessment of Uber’s overall integration of the taxonomy into its incident-reporting process, of how taxonomy data are managed and audited, and of how the sexual assault data are to be presented in Uber’s forthcoming 2019 US Safety Report. The goal of this assessment was to document the taxonomy data cleaning and processing system, and provide an assessment as to the integrity, objectivity, and rigor with which Uber analyzed the sexual assault data and intended to report it in the US Safety Report. The remainder of this brief report documents the observations from these two project components. First, we present a summary of Uber’s procedures for processing a report of sexual misconduct or violence. This provides important context for understanding the resulting data included in Uber’s taxonomy. Next, we describe our project methods for verifying the use of the taxonomy and its data, and for conducting our systems assessment. We then present our findings and observations from both the project components, the verification analysis and the systems assessment, for three domains: the training of customer service agents and incident report auditors; the incident report categorization and alignment process; and the approach to and analytic strategies for the sexual assault data intended for the US Safety Report. What happens when a user reports an incident of sexual misconduct or violence to Uber? Reports of sexual misconduct or violence come to Uber’s attention in several different ways, including a user (a passenger or driver) reporting the incident through the Uber app or Uber’s website, by phone, or through social media. Incidents also may come to Uber’s attention if a law enforcement agency contacts them during an investigation. Reports made through the app or website are first processed through a machine learning system that determines whether the report is safety-related. If either keywords or the natural language processing (machine learning) system classifies a report as safety-related, it is routed to Uber’s Incident Response Team, in which there are multiple teams of safety support agents that address these reports. Once routed here, a team of agents first confirm the report is safety-related and determines if it is urgent or non-urgent. If this team determines the report to be urgent, it is sent to a second team that makes the first outreach to the reporting party, places a hold on the account of the accused party, and categorizes the report into the taxonomy for the first time. After this, the report is routed to a team of agents that conducts a more extensive outreach to gather additional information from the reporting party, the party the report is against, and any third parties, and reviews additional, potentially relevant information, such as global positioning system (GPS) data and account histories. After gathering this additional information, these agents may change the taxonomy category to which the report was first assigned and take appropriate action toward the users, as defined in Uber’s internal standards. All reported incidents that receive a taxonomy classification are audited. Uber’s Safety and Insurance Analytics data team assigns each incident report to an auditor, who reviews the necessary information collected during the customer support and investigation process, and independently categorizes the incident report into the taxonomy. If the auditor’s categorization matches the final categorization of the customer support agent or investigator, the audit is complete. If the auditor’s categorization differs, the report is automatically returned to the audit queue and assigned to another auditor. If the auditor is unsure of how to categorize a report, they are able to escalate the report to a Subject Matter Expert (SME). SMEs are Uber employees with extensive experience in the use of the taxonomy and who provide training to other employees. If questions still remain about the appropriate classification of a report, it may continue to be escalated to a team lead, the Safety and Insurance Analytics data team, the legal team, and eventually outside experts such as NSVRC. Incident reports in which the auditor’s categorization differed from the customer support agent’s or in which the auditor was unsure of the categorization are reviewed in periodic escalation sessions where the final taxonomy category is applied by Uber’s team leads and relevant managers. Methods This project included two components: a verification analysis and a systems assessment. Methods for component 1: Verification analysis For the first component, NSVRC/Urban staff used the taxonomy to categorize two samples of user-reported sexual misconduct and violence incident reports during the time period being considered for Uber’s 2019 US Safety Report (2017 and 2018): a non-random sample of 200 reports and a representative, random sample of 383 reports. Our categorizations were assessed for the extent of alignment within our team and compared with Uber’s categorization. The sample of 200 non-random reports was selected because they had been perceived as particularly difficult to classify by Uber employees. The 383 random reports spanned the full range of sexual misconduct and sexual assault incidents and were selected using a randomized database query designed by an Uber employee. For the purpose of this analysis, it was determined that 383 incidents would be a sufficient sample size to be representative of the overall set of reported incidents classified using the taxonomy at a 95 percent confidence level. Each NSVRC/Urban team member individually categorized the reports according to the published taxonomy and used an Uber training document developed for their customer service agents and auditors to guide their work. The training document provides definitions of terms and examples of qualifying/non-qualifying incidents to illustrate each taxonomy category. This training document was a taxonomy knowledge resource developed by Uber for internal use (but is not the only training resources available to Uber’s internal taxonomy users). After we categorized the incident reports individually, we then compared across our results and documented any differences between categorizations. When we did not classify a report in the same way, we discussed considerations around our classification and came to a consensus on a single final category for each incident that seemed most correct. We then compared NSVRC/Urban categorizations to the incident categorization determined at the final stage of Uber’s internal audit process. To assess the rigor of the alignment process, we calculated both percent agreement and kappa statistics in order to account for agreement that may have occurred by chance. Kappa values are commonly interpreted as follows: <0 = poor agreement; 0.00-0.20 = slight agreement; 0.21-0.40 = fair agreement; 0.41-0.60 = moderate agreement; 0.61-0.80 = substantial agreement; and 0.80-1.00 = almost perfect agreement.3 We followed the same process for both samples of reports. Methods for component 2: System assessment For the second project component, and to assess Uber’s overall integration of the taxonomy into their incident reporting system and how that integration might contribute to their 2019 US Safety Report, staff from NSVRC and Urban conducted seven interviews with nine Uber employees (six individual interviews and one three-person interview). Each of these individuals have direct influence over the way the taxonomy is implemented, how incident data categorized by the taxonomy are managed, and how aggregate sexual assault data based on the taxonomy are likely to be disclosed by Uber in its US Safety Report. Interviews lasted between 45 and 120 minutes and were conducted via video-conference. We developed an interview protocol covering the following general categories of questions: The history and development of Uber’s response to unwanted sexual experiences encountered by users of its platform. The steps in Uber’s incident reporting and response process. The training of Uber’s customer service agents and auditors. The quality of incident report data, including data comprehensiveness and how they are cleaned and analyzed. The identification of the specific sexual assault data and analytic approaches for what may be reported in the US Safety Report. The overall processes involved in writing the US Safety Report. Findings and observations Below, we report our findings and observations from both project components for the following topics: the training of customer service agents and incident report auditors; the incident report categorization and alignment process; and the approach to and analytic strategies for the Safety Report. Training of customer service agents and incident report auditors We found that Uber engages in a robust training process that focuses on continual categorization alignment between employees who use the taxonomy for both incident response and auditing purposes. Initial training of agents that handle urgent cases and auditors includes 6 to 8 weeks of onboarding, and the training includes background about sexual misconduct and assault, training around the reporting pathway process, and agent well-being. Agents also receive training in investigation guidelines (including sensitive investigations and interview skills). Ongoing training occurs for both groups of staff as well. For customer services agents, this is accomplished by weekly and monthly audits of reports to determine the degree of internal alignment between agents for taxonomy classifications, with a goal of 95% agreement. Also, recordings of agent interactions with users are routinely reviewed and evaluated based on categorization accuracy, empathy with the user, and correct use of Uber’s incident response process. Auditors go through similar routine alignment testing. Auditors are given biweekly samples of reports for which a categorization key had been created by team leads and management. The goal for auditors is 85% alignment to the categorization key. The overall goal is for alignment rates to be at these levels for each individual taxonomy category, and Uber is working toward this goal, with some categories exceeding this goal. Incident report categorization and alignment process We found that, overall, Uber has developed a rigorous process to promote the accuracy of incident report categorizations, leading to reliable data being captured in the taxonomy. While training and continuous alignment testing of customer service agents and incident report auditors (described above) is an important part of this process, criteria for data cleaning are also important. In addition, Uber has high accountability when processing incident report data. All modifications to an incident report (e.g., a change in taxonomy categorization) are logged as part of that data record. Thus, the history of a report’s categorization can be tracked if it were to change, along with documentation as to why that change occurred. Our findings for the verification analysis we conducted indicated levels of agreement success when it came to categorizing reports into the taxonomy. Among the internal project team, NSVRC/Urban staff had an 80 percent agreement rate (Fleiss’ kappa=0.76) for the sample of 200 non-random reports of incidents and an 82 percent agreement rate (Fleiss’ kappa=0.80) for the sample of 383 random reports of incidents. When comparing our categorizations with Uber’s classification, the agreement rates varied. For the sample of 200 non-random reports of incidents the NSVRC/Urban classification aligned with Uber’s classification for 160 of 200 (80 percent) reports (Cohen’s kappa = 0.75).4 After discussing this analysis with Uber staff, we excluded 13 reports for which NSVRC/Urban did not have the same information as Uber’s auditors when categorizing them (that is, NSVRC/Urban staff were provided less information about the report), resulting in misalignment. When excluding those 13 cases, NSVRC/Urban aligned with Uber for 160 of 187 reports (86 percent). For the sample of 383 random reports of incidents, the NSVRC/Urban classification aligned with Uber’s classification for 267 of 383 (70 percent) incidents (Cohen’s kappa = 0.66). After further discussions with Uber staff, we learned that they have an “insufficient information” category for both sexual misconduct and sexual assault that we did not know was available to classify reports into during our categorization process. However, for 13 reports that Uber auditors had classified as “insufficient information” we classified them as “unable to categorize,” essentially classifying them the same way. Counting those 13 reports as aligned resulted in alignment of 280 of 383 (73 percent) incidents (Cohen’s kappa = 0.70). In sum, data categorized in Uber’s taxonomy are reliable and reports can be consistently classified across agents. According to common interpretation standards of kappa statistics,5 our analysis showed almost perfect agreement (0.80-1.00) among the NSVRC/Urban team members in classifying sexual misconduct and assault reports made to Uber, and substantial agreement (0.61-8.0) between Urban/NSVRC staff and Uber staff. Observations around misaligned incident reports Though our agreement rates both within the NSVRC/Urban team and in comparison to Uber are at defensible levels, we identified patterns related to areas where misalignments occurred. These misalignments generally fell into three categories of issues: training issues, definition issues, and eligibility issues. Notably, when sharing the training and definitional issues we identified when classifying reports into the taxonomy, Uber staff indicated that some of these issues have been addressed in updated training materials and processes since the initial materials were provided to the NSVRC/Urban team. Training issues Misalignment due to incorrect use of the taxonomy, which we characterize as training issues, occurred when classifications did not match instructions and/or definitions provided in the training materials. These include: Assuming intent or actions not described in the report of the incident. Coding “attempted non-consensual sexual penetration” without clear description of attempted penetration or mention of attempted removal of clothing or other specific phrases included in the definition of this category. For example, “got on top of” and “forced down” are phrases not included in the definition of this category. Attempting to classify a report without enough behaviorally-specific information (e.g., the only experience described is “inappropriate sexual advance” without additional specific behaviors described). Not coding actions according to provided definitions. Not coding licking as non-consensual kissing. Classifying a described body part as a sexual body part instead of non-sexual body part (as per the definitions) and vice versa. When more than one type of behavior is described in the report, coding reports as a category lower in severity instead of the category with the highest severity as required by the taxonomy instructions, such as: Completed touching/kissing of sexual OR non-sexual body parts should supersede attempted touching/kissing. Attempted non-consensual sexual penetration should supersede completed touching/kissing of non-sexual body part. Definitional issues Misalignment due to definitional issues occurred when the training materials lacked the guidance necessary to consistently distinguish between taxonomy categories. We also identified issues relating to the ambiguity of certain words used in reports without additional contextual information that could be used to help with categorization. This includes: Lack of guidance regarding how the word “trying” distinguishes between attempted and completed actions. Reports that used the word “trying” without any other description of completed touching or kissing were often interpreted as completed actions. Lack of guidance about whether categorizing the female chest as a sexual body part is age dependent. A number of incidents use potentially sexually-related keywords such as “seduce,” “pervert,” “sexual harassment,” and “inappropriate,” but provide no additional context, and the training materials did not specify how these incidents should be categorized. Lack of clarity on the difference between a “personal question” vs. an “explicit comment.” The definition of “asking personal questions” included questions about sexual preferences but did not specify if the concept of “preferences” includes sexual behaviors and activities. Lack of clarity on whether the perception of a sexual or romantic component is required for some categories, including “flirting” and “staring or leering.” Lack of information on how to categorize removal of clothing when that removal does not access a sexual body part (e.g., removing clothing that has vomit on it). Definitional and Training Issues Addressed by Uber When we shared the training and definitional issues we identified when classifying reports into the taxonomy, Uber staff indicated that some of these issues have been addressed in updated training materials and processes since the initial materials were provided to the NSVRC/Urban team. Thus, misalignments generated by some of these issues have been resolved. Specifically: Uber already interprets entrapment and/or use of force, such as holding someone down or laying on top of someone, as attempted non-consensual sexual penetration. Uber created an “insufficient information” classification, which is used when non-specific words are used in a report, such as “pervert,” “sexual harassment,” and “inappropriate” and no additional information is obtained during follow-up by the agent. Uber has specified in the definition of attempted/completed non-consensual kissing that a reference to a “kiss” without a specified body part should assume oral contact (i.e., sexual body part). Uber has redefined the act of soliciting a sexual act by expanding the definition to include a report that includes the words “seduce” or “seduction.” Eligibility issues Misalignment due to eligibility issues occurred when reports were classified according to the taxonomy although they likely should not be eligible for inclusion. For example: Reports with a lack of Uber involvement (i.e., the reported incident took place outside the context of an Uber ride). Reports that have been previously reported (i.e., “double counting” a reported incident). Approach to and analytic strategies for the Safety Report During our system analysis, Uber shared that the US Safety Report would include specific taxonomy categories, provided reasons for including such categories, discussed their data cleaning and analysis approach, and provided a snapshot of the internal report review process. In general, we found these processes to be based on careful consideration with a focus on reporting safety information accurately and using rigorous data. Uber shared the criteria they employed to include a sexual assault taxonomy category in the US Safety Report for the years 2017 and 2018. Categories were reported if they: (1) represented the most serious incidents occurring on the platform; (2) included reliable data; and (3) could be compared in some capacity to existing national data sets on sexual assault. Using these criteria, Uber decided to report on five sexual assault categories. Though the report doesn’t include any sexual misconduct categories, reporting on five sexual assault categories provides the public with information about the nature of the most severe experiences occurring on the platform. The five categories are: Non-Consensual Sexual Penetration Attempted Non-Consensual Sexual Penetration Non-Consensual Kissing: Sexual Body Part Non-Consensual Touching: Sexual Body Part Non-Consensual Kissing: Non-Sexual Body Part The data for these categories were classified reliably by Uber taxonomy users. Aggregating across these five categories, alignment rates between auditors for classifying reports within them exceeds 85%. Also, alignment for four individual categories exceeded 85 %, with the exception of attempted non-consensual penetration, which had a 78% alignment rate. When it came to data cleaning within each of these five taxonomy categories, Uber decided to include nearly all reports of incidents within the categories in the US Safety Report. Two data-cleaning criteria were used to identify sexual assault incident reports that would not be included in analyses for the US Safety Report: (1) reports that are not Uber-related (e.g., incidents reported to Uber that occurred on a different ride-sharing platform), and (2) reports that are directly disaffirmed by the reporting party (e.g., upon follow-up, a user confirms they only reported a sexually-related complaint to be given a free ride or other such disaffirmation). Data analyses for the purposes of the US Safety Report were primarily descriptive in nature. As such, an assessment of their rigor and complexity is not particularly relevant. That said, the analytic approach to providing this descriptive information has two important features. First, Uber has decided to report incident rates for each of the five taxonomy categories rather than just the aggregate number of incidents across the five categories occurring during 2017-2018. Reporting rates rather than absolute numbers is key to the public understanding the scope of problem in relation to the platform. The rates are contextualized based on the number of completed trips during 2017-2018. Second, data for each category will be provided by the reporting party; in other words, including for drivers and riders separately. This is critically important and provides a major contribution to the sexual assault prevention and intervention field. To date, the field and the general public do not understand the extent to which drivers, as well as riders, have such experiences on the Uber platform. With this information, stakeholders from the field and Uber itself can examine any similarities or differences in the experiences of their riders compared to drivers and develop tailored and specific prevention and intervention efforts based on those patterns. Lastly, Uber has a demanding and layered internal review process for the draft report. The report has been reviewed by several staff representing several functional groups within Uber’s structure and senior leadership. The specific staff identified to review the draft report included those with a business need and those with subject matter expertise. The review process is complex but is intended to provide a quality assurance process so that the subset of data from the taxonomy that are being reported are as accurate as possible. Conclusions This project set out to assess Uber’s integration of the taxonomy into its system of receiving and accurately categorizing complaints from platform users—the verification analysis, and Uber’s approach to developing the US Safety Report—the systems assessment. Through the data collection and analysis activities we conducted, we learned that Uber has implemented the Sexual Misconduct and Violence Taxonomy with accuracy, employing strong quality assurance processes to ensure ongoing accuracy. They engage in a robust initial and ongoing training process that focuses on continual categorization alignment among employees who use the taxonomy for both incident response and auditing purposes. These efforts aim to produce reliable data across all categories, and our analyses conclude that the goal has been largely achieved; the sexual assault data in the taxonomy categories included in the US Safety Report are statistically reliable. In general, we found the processes to develop the US Safety Report focused on accuracy and used rigorous data. About the organizations RALIANCE is a national partnership dedicated to ending sexual violence in one generation. RALIANCE partners with a wide range of organizations to improve their cultures and create environments free from sexual harassment, misconduct and abuse. Every day, RALIANCE helps leaders establish safe workplaces and strong communities by advancing research, influencing policy, and supporting innovative programs. The National Sexual Violence Resource Center (NSVRC) is the leading nonprofit in providing information and tools to prevent and respond to sexual violence. NSVRC translates research and trends into best practices that help individuals, communities and service providers achieve real and lasting change. The center also works with the media to promote informed reporting. Every April, NSVRC leads Sexual Assault Awareness Month, a campaign to educate and engage the public in addressing this widespread issue. NSVRC is also one of the three founding organizations of RALIANCE, a national, collaborative initiative dedicated to ending sexual violence in one generation. The Urban Institute, founded in 1968, is a trusted source for unbiased, authoritative insights that inform consequential choices about the well-being of people and places in the United States. They are a nonprofit research organization that believes decisions shaped by facts, rather than ideology, have the power to improve public policy and practice, strengthen communities, and transform people’s lives for the better. Urban Institute experts diagnose current challenges and look ahead to identify opportunities for change. The Urban Institute’s Justice Policy Center is committed to developing evidence related to criminal justice challenges and has a long history of examining sexual assault, domestic violence, and other victimization experiences for the US Department of Justice, state government, and local jurisdictions. For example, Urban published the first national documentation of payment practices for sexual assault medical forensic exams and an assessment of the extent to which survivors are billed for such exams and the first national documentation of state departments of corrections’ response to the Prison Rape Elimination Act. About the Authors Janine Zweig, Ph.D., is Associate Vice President and a Senior Research Fellow in the Justice Policy Center at Urban Institute. She conducts both theoretical research and program and policy evaluations related to sexual violence and victimization, intimate partner and dating violence, substance use, and criminal justice policies. She has examined several provisions of the Violence Against Women Act and the Prison Rape Elimination Act. Before graduate school, Dr. Zweig served as a volunteer/intern for a Women-In-Crisis program. This shelter for battered women serves five rural counties, and she participated in legal and child advocacy, case management, and hotline operation. She holds a Ph.D. in Human Development and Family Studies from the Pennsylvania State University. Chad Sniffen, M.P.H., has been with the National Sexual Violence Resource Center (NSVRC) since 2015, where he is the Outreach Manager for both NSVRC and RALIANCE. He has worked or volunteered in the movements against sexual and domestic violence as a prevention educator, victim advocate, researcher, women’s self-defense instructor, and board member since 1999. Prior to NSVRC, he worked at the California Coalition Against Sexual Assault for eight years focusing on technical assistance to rape prevention education programs and building the PreventConnect online community of practice. He has also worked for community rape crisis and domestic violence programs in California and Arizona, as well as campus prevention programs at the University of Arizona and the University of California, Davis. He holds a Master of Public Health degree from the University of Arizona. Emily Tiry, M.P.P., is a Research Associate in The Urban Institute’s Justice Policy Center, where her current research focuses on governmental and non-governmental responses to sexual assault, development and validation of risk assessment tools, and the development of national statistics on state court criminal case processing. She has extensive experience with a variety of quantitative research methods, including evaluation methods, administrative data collection and analysis, and predictive modeling. Prior to joining the Urban Institute, she conducted research on gun violence and prescription drug monitoring programs. She received a Master of Public Policy degree from Duke University. Appendix A: The Sexual Misconduct and Violence Taxonomy Sexual Misconduct Staring or Leering Comments or Gestures > Asking Personal Questions Comments or Gestures > Comments About Appearance Comments or Gestures > Flirting Comments or Gestures > Explicit Gestures Comments or Gestures > Explicit Comments Displaying Indecent Material Indecent Photography Without Consent Soliciting Sexual Contact Masturbation / Indecent Exposure Verbal Threat of Sexual Assault Sexual Assault Attempted Touching: Non-Sexual Body Part Attempted Kissing: Non-Sexual Body Part Non-Consensual Touching: Non-Sexual Body Part Non-Consensual Kissing: Non-Sexual Body Part Attempted Touching: Sexual Body Part Attempted Kissing: Sexual Body Part Non-Consensual Touching: Sexual Body Part Non-Consensual Kissing: Sexual Body Part Attempted Non-Consensual Sexual Penetration Non-Consensual Sexual Penetration 1Sniffen, C., Durnan, J., & Zweig. J. (2018). Helping industries to classify reports of sexual harassment, sexual misconduct, and sexual assault. Retrieved from the National Sexual Violence Resource Center: https://www.nsvrc.org/helping-industries-classify-reports 2This verification comparison was conducted by three NSVRC/Urban staff; two of whom were among the original staff that developed the taxonomy. 3For more information, please see Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174. doi:10.2307/2529310 4We used Cohen’s kappa when comparing NSVRC/Urban’s categorization with Uber’s because we only compared two ratings per report, as opposed to comparing three ratings per report when comparing NSVRC/Urban’s internal alignment. 5For more information, please see Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174. doi:10.2307/2529310 This study was funded by Uber Technologies, Inc. The opinions, findings, conclusions, and recommendations expressed in this document are those of the authors and do not necessarily reflect those of the Urban Institute, National Sexual Violence Resource Center, or RALIANCE, or their trustees or funders.