Abstract
Research on fairness in machine learning (ML) has been largely focusing on individual and group fairness. With the adoption of ML-based technologies as assistive technology in complex societal transformations or crisis situations on a global scale these existing definitions fail to account for algorithmic fairness transnationally. We propose to complement existing perspectives on algorithmic fairness with a notion of transnational algorithmic fairness and take first steps towards an analytical framework. We exemplify the relevance of a transnational fairness assessment in a case study on a disaster response system using images from online social media. In the presented case, ML systems are used as a support tool in categorizing and classifying images from social media after a disaster event as an almost instantly available source of information for coordinating disaster response. We present an empirical analysis assessing the transnational fairness of the application’s outputs-based on national socio-demographic development indicators as potentially discriminatory attributes. In doing so, the paper combines interdisciplinary perspectives from data analytics, ML, digital media studies and media sociology in order to address fairness beyond the technical system. The case study investigated reflects an embedded perspective of peoples’ everyday media use and social media platforms as the producers of sociality and processing data-with relevance far beyond the case of algorithmic fairness in disaster scenarios. Especially in light of the concentration of artificial intelligence (AI) development in the Global North and a perceived hegemonic constellation, we argue that transnational fairness offers a perspective on global injustices in relation to AI development and application that has the potential to substantiate discussions by identifying gaps in data and technology. These analyses ultimately will enable researchers and policy makers to derive actionable insights that could alleviate existing problems with fair use of AI technology and mitigate risks associated with future developments.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Algorithmic Fairness Beyond Individuals and Groups
Machine learning (ML) systems are among other things hailed for their ability to sort and scrutinize vast sets of information, for efficiency gains in complex distributional tasks, for automating repetitive tasks etc. Considering these potentials, ML systems increasingly become used in automated decision-making (ADM) processes in many areas of life - from automated services and allocations in public administration, to workforce management or medical diagnoses (AlgorithmWatch, 2020; Jarrahi et al., 2021; Krzywdzinski et al., 2022; Tomašev et al., 2019). They, in other words, become implemented as a problem-solving agent for dealing and managing individuals - with partly far-reaching impacts as demonstrated in numerous cases of algorithmic discrimination (Ensign et al., 2018; Zuiderveen Borgesius, 2018).
Such examples of algorithmic discrimination have been accompanied by an extensive scientific discussion on algorithmic fairness (Barocas et al., 2019; Chouldechova & Roth, 2018; Binns, 2017). Unfair outcomes in ADM leading to discrimination have been established as one of the ethical challenges in the use of algorithms in ADM systems, next to others such as lack of transparency or accountability (Mittelstadt et al., 2016). Within ML research, engagement with questions of algorithmic fairness has led both to discussions of how to assess and evaluate fairness in ML as well as of ways how to mitigate bias, as one of the causes for unfair outcomes - resulting in a broad field of fairness definitions as well as fairness metrics (the mathematical or statistical translation of fairness definitions) (Barocas et al., 2019). These have mostly focused on individual or group-level fairness (Castelnovo et al., 2022; Hertweck & Heitz, 2021).
Similarly, also anti-discrimination jurisdiction considers non-discrimination as a personal right (European Union, 2000; Council of Europe, 1953; United Nations, 1948). In this sense, the question of algorithmic fairness has been considered as a question of individual or group concern - in other words as a question of whether individuals due to certain characteristics or belonging to a particular group of people (e.g. based on gender, ethnicity, education etc.) or otherwise algorithmically curated groups (Wachter, 2022) have been treated differently compared with others.
In this work, we argue that a perspective on algorithmic fairness as an individual or group level concern is insufficient. Technologies such as artificial intelligence (AI) and ML are increasingly being considered as a “strategic technology” (Durant et al., 1998) with supposed problem-solving capabilities far beyond the individual. ML applications are for instance referred to and strategically implemented as incremental tools within complex global transformational tasks or crisis situations, thereby implementing forms of algorithmic governance (Katzenbach & Ullbricht, 2019). Fostered by supranational political bodies such as the United Nations (UN) or the European Union (EU) but also by global corporations (European Commission et al., 2022) (UN DESA, 2021), we are observing that ML applications are expected to solve complex societal problems on a global scale (Katzenbach, 2021). Such problems range for instance from algorithmically managing online hate speech through automated platform governance (Gorwa et al., 2020), migration flows through automated border control and border construction (Amoore, 2021; Dijstelbloem et al., 2011; Pfeifer, 2021), the reduction of CO2 emissions through automated energy distribution (Klobasa et al., 2019; Nishant et al., 2020), or the coordination of disaster management after crisis events through automated damage classifications (Depardey et al., 2019; Linardos et al., 2022).
Considering the increasing implementation of ML applications to automate decision-making processes targeting entire populations across nation-states and broader geographical regions, we are proposing a transnational perspective on algorithmic fairness as a complementary and necessary addition to individual and group-level fairness. The application scenario we consider are ML systems in Disaster Response Management (DRM). We develop novel concepts for evaluating established fairness definitions in a transnational setting and demonstrate their efficacy in comprehensive empirical evaluations. While ML methods to support DRM have been applied in recent years on a global scale by categorizing disaster events and consequences based on image- and text-classification of social media content, from an ethical perspective it needs to be critically assessed whether such systems perform equally well across groups of countries or whether these applications systematically disadvantage disaster response support for specific types of countries (for instance based on their socio-economic status). If that would be the case, efforts would need to be undertaken to enhance performance to ensure a globally fairly distributed disaster response based on ML methods.
We will provide a concept of transnational algorithmic fairness that grounds its fairness assessment on country-based development indicators as representations for sensitive discriminatory attributes based on which nation-states might be discriminated against. We will then apply our concept of transnational fairness to the case of a specific disaster response ML application and dataset to test whether certain groups of countries are structurally disadvantaged within the classification output.
Eventually, we will reflect on social media platforms, wider telecommunications infrastructures and different patterns of using and appropriating social media platforms in the context of discussions on digital divides to reflect how processing data for such systems are being produced. Considering these data producing infrastructures then also raises concerns about origins of algorithmic (un)fairness beyond the (technical) ML system. In doing so, we are introducing an interdisciplinary perspective informed by research on data analytics, ML, digital media studies and media sociology to come to an encompassing assessment of algorithmic fairness which goes beyond the technical system and includes an embedded perspective of peoples’ everyday media use and social media platforms as the producers of sociality and processing data - with relevance beyond the case of algorithmic fairness in disaster scenarios.
2 Transnational Algorithmic Fairness and Global Justice
The increasing use of ML as a “strategic technology” (Durant et al., 1998) that has the capacity to potentially change societies and whose role for society is being negotiated in political and public discourses (Katzenbach, 2021), requires a perspective on algorithmic fairness beyond individuals and groups within a national setting. On a global level, the Sustainable Development Goals (SDGs) and the discussion of how AI will contribute to their achievement (UN DESA, 2021; Vinuesa et al., 2020) stand exemplary for the problem-solving capabilities that ML systems are being assigned to on a transnational level. Further examples above have demonstrated that ML applications are increasingly affecting individuals and populations on a global scale. They exemplify potentially unsettling consequences when ML applications produce very different outcomes regarding different populations. What so far has not been sufficiently addressed is how to ensure fairness beyond individuals and groups within nation-states when ML applications become applied in critical areas on a global scale. In other words, what is missing is a link between matters of global justice and algorithmic fairness.
The automated moderation tools of large online platforms for example differ fundamentally in their performance depending on the language that needs to be moderated. Automated tools to detect hate speech on social media platforms are usually first optimized for English and subsequently for other majority languages. For minority languages, these tools might not work optimally - as has been demonstrated by Facebook’s inability to moderate the wave of hatred and calls to violence in Burmese targeted at the Rohingya minority in Myanmar since 2017. A UN fact finding mission since postulated that Facebook, as the platform through which incitements of hatred and calls to violence were able to spread in Myanmar, played a role in the resulting genocide (UNHRC, 2018). And the later revelations by whistleblower Frances Haugen confirmed that Facebook did not invest sufficiently in safeguards to prevent extreme forms of hatred in Myanmar (or similarly happening in Ethiopia) from spreading, considering that their automated moderation tools did not perform sufficiently in the local language Burmese (Akinwotu, 2021). Such examples demonstrate that differences in performance as the basis of unfair algorithmic outcomes and forms of algorithmic discrimination matter from a global justice perspective.
However, not all different treatment can be considered as leading to unfair outcomes or discrimination. While in the discussion on algorithmic fairness, the focus does lie on disparity, it is essential “to ask (...) whether the disparities are justified and whether they are harmful.“(Barocas et al., 2019, p. 3). While Mittelstadt et al. (2016, p. 8) define discrimination in the context of ADM as the “adverse disproportionate impact resulting from algorithmic decision-making”, Barocas et al. (2019, p. 76) specify that discrimination “is not different treatment in and of itself, but rather treatment that systematically imposes a disadvantage on one social group relative to others” (Barocas et al., 2019). Different reasons exist why some differential treatment might be considered adverse, as imposing disadvantages and thus being morally objectionable (Mittelstadt et al., 2016), among them relevance, generalizations, prejudice, disrespect, immutability and compounding injustice (Barocas et al., 2019).
For the case of transnational algorithmic fairness, we are following Wachter’s argument that also algorithmic groups that do not align with traditionally protected attributes in discrimination law should be brought into the focus of discussions on algorithmic fairness (Wachter, 2022). Understanding nation-states and groups of nation-states as algorithmic groups makes it possible to acknowledge morally objectionable forms of discrimination from a global justice perspective with potentially far-reaching adversarial impacts on local populations. Groups of nation-states then do not form algorithmic groups based on extensive online-profiling, as is the case for individuals who might be grouped based on their characteristic as “dog owners” or “sad teens”, as Wachter describes. Instead, they can be considered algorithmic groups since as an entity they might become objects of algorithmic decision-making with potential adversarial impacts effecting their populations and since they build contexts for data aggregation and data processing for ML applications in transnational settings.
A transnational perspective on algorithmic fairness considers moral obligations in relation to ADM targeting populations of certain geographical regions. Algorithmic discrimination, as the outcome of unfair ADM, in a transnational setting thus conceptualizes adversarial impacts on certain populations based on their location and belonging to certain geographically distinguishable entities. Due to matters of data availability, we are focusing in the case study presented here on national contexts, even though also sub-national contexts could be considered. Further, we are specifically conceptualizing a transnational perspective (Nye & Keohane, 1971) on algorithmic fairness, allowing us to consider relations between populations across state-boundaries and not focusing primarily on relations between nation-states. A transnational perspective on algorithmic fairness is in this sense well-aligned to group nation-states and populations within the assessment of algorithmic fairness according to similarities regarding certain characteristics and does not assess solely based on national belonging. A transnational perspective in other words allows for seeing similarities across nation-states and accordingly allocating them into algorithmic groups.
Further, a transnational perspective on algorithmic fairness is grounded in a cosmopolitan understanding of global justice, which stresses on moral grounds questions of a just distribution among every living human being, acknowledging universality as a key element of a global justice perspective (Pogge, 1992). Stressing distributive justice, Matthias Risse summarizes a cosmopolitan perspective as asking: “If shares of material goods are among the rights and protections everyone deserves, we must ask if this depends on where people live.“(Risse, 2011, p. 3) Questions of global distributional injustices have lately been adopted in ML research, especially in relation to colonialism, the dominance of western values and cultural hegemony in AI applications, regulatory and infrastructural monopolies in the AI industry (Birhane et al., 2022; Mohamed et al., 2020; Png, 2022) as well as in discussions on extractivist and exploitative AI production (Crawford, 2021; Gray & Siddharth, 2019; Bender et al., 2021). A transnational approach to ML fairness, tries to connect these observations on distributional injustices with algorithmic fairness assessments of specific ML applications.
3 Quantifying Transnational Fairness
In Sect. 2 we considered the need for a transnational fairness assessment from a global justice perspective with its moral obligations to object forms of discrimination with potential adversarial impacts on local populations. While our concept for transnational fairness allows for grouping nation-states and populations beyond nationality based on certain characteristics or similarities, the attributes on which to group nation-states is an open subject. However, it is only by the selection of certain attributes that transnational fairness assessments can be made. Attributes or group characteristics will determine on which basis disadvantages of groups or which form of discrimination can be defined and discussed. In the following, we will propose a method to inform a transnational fairness assessment with a data centric approach, that allows an application on a case-by-case basis. We will demonstrate this method in a disaster response use-case and report fairness infringements we are able to assess with our method.
3.1 Development Indicators as Sensitive Attributes for Nation-States
ML components can be biased and can discriminate against groups of citizens in policing (Angwin et al., 2016), recruiting (Lahoti et al., 2019), to name just two examples. Commercially available ML based face recognition software has been reported to achieve much lower accuracy on females with darker skin color (Buolamwini & Gebru, 2018). As the reasons for ML predictions are often not easily interpretable in general, also the reasons for unfair ML decisions can be difficult to trace back. But most of the origins of unfair ML predictions are related to the quality and bias of the training data (Barocas & Selbst, 2016).
In application contexts as described above, attributes like income, gender, ethnicity, marital status, and disabilities among others commonly get declared as sensitive variables or used to define protected or unprivileged groups to measure fairness. The literature refers to many different measures that are used for quantifying fairness or the disparity between protected and unprotected groups or individuals. Each of these metrics emphasizes different aspects of fairness and often single fairness notions lack a consistent definition, are difficult to combine or even contradict each other (Caton & Haas, 2020). As fairness is eventually an ethical or normative problem, there is no single definition or metric for fairness that accounts for all relevant factors in all situations. Hence, fairness metrics should be selected depending on the given context and requirements, while inquiring relevant stakeholders (affected, domain experts, policy makers and ethicists).
The same principles as described for individual notions of fairness can be applied to a transnational concept of fairness in ML. While sensible discriminatory attributes and protected variables on an individual level are for instance reflected in anti-discrimination jurisdiction, there is no established framework based on which to assess unfairness in a comparison of nation-states. To assess transnational algorithmic fairness sensitive attributes could be aggregated on a national level. This aggregation of attributes could in turn be used to group countries. Unfortunately, the required sensitive attributes are difficult to obtain with sufficiently high quality and volume globally. Aggregating sparse and low quality data on sensitive attributes could lead astray analyses on transnational algorithmic fairness.
Few studies attempted to show fairness infringements in ML on transnational scale. Fairness is often merely defined based on coarse geographic locations like the (Global) North and the (Global) South or based on continents. But analyzing fairness based on directions and continents (or even only on nation-states) often lacks socio-economic characteristics and thus cannot give substance to a discussion of what fair means from a global justice perspective. So far there are only modest and rudimentary approaches that attempt to consolidate socio-economic data as a basis to a fairness discussion in ML on a transnational scale (Shankar et al., 2017; DeVries et al., 2019; Goyal et al., 2022). Such approaches thus cannot adequately contribute to a debate on which fairness notion is appropriate in the context of DRM.
The problem of plain geo-diversity results from the lack of sensitive attributes on a transnational level (like gender, ethnicity, income account for individuals). While acknowledging their limitations (see Sect. 3.3), we propose development indicators (such as the Human Development Index) as a basis for investigating sensitive attributes in a transnational context. Development indicators often result from socio-economic data and are common instruments in social sciences and global development, and generally said, have the purpose to empirically compare development across countries (Baster, 1972). By addressing development (or sometimes called progress in the literature) they imply an underlying definition or theory of what development (or progress) is and how to measure it (McGranahan, 1972)Footnote 1. Development indicators especially got designed and used by the United Nations to monitor economic, social, demographic, environmental and other goals set by the institution and their organs (Vries, 2001). Inequality and disparities between nation-states are often the cause why development indicators got created, and so they are often also used as tools and models in policy making (Vries, 2001).
3.2 Case Study: Disaster Response Management
As in many other fields, ML methods have become applied and researched in Disaster Informatics and Disaster Risk Management (DRM). Applications in DRM include evaluating exposure and vulnerability of geographical regions to disasters, assisting to build resilience to disasters, forecasting of disaster events and assessing post-disaster impacts (Depardey et al., 2019; Linardos et al., 2022) to assist humanitarian aid and rebuilding efforts. Questions of fairness are relatively under-explored in disaster informatics (Yang et al., 2020; Gevaert, 2021) and ML in disaster related contexts, although the disparate impact of disaster events to different socio-economic groups is a well-studied topic in DRM (Hallegatte et al., 2018), both on individual as on global level.
The application scenario for ML we are researching in this study is concerned with post-disaster impact, particularly with immediate disaster response, which aims to locate and identify damages caused by disaster events to assist humanitarian aid by rapidly allocating resources and evaluating urgency of emergencies. While traditionally satellite imagery and images from (unmanned) aerial vehicles (AVs and UAVs) are used to assess damages, social media has become an important means for both actors in humanitarian aid as for those affected by disaster events (Imran et al., 2015; Said et al., 2019). Informing about incidents on and calling for help via social media (due to collapsed or overloaded emergency call systems) enables information to be distributed faster and with more coverage than traditional methods.
The volume of information requires scalable methods, often based on ML, to detect these reports on disaster events. The speed of the response to disaster events is one of the most important factors for successful aid. ML algorithms can help to improve and scale the automatic detection of content related to disaster events in online social media. One example is the AI for Disaster Response (AIDR) system (Imran et al., 2014). AIDR can classify online social media posts into a set of user-defined categories. The system has been successfully used on data from Twitter during the 2013 Pakistan Earthquake to distinguish informative vs. non-informative tweets posted leveraging a combination of ML components with human-participation (through crowd-sourcing) in real-time.
In this study, we work with the MEDIC dataset (Alam et al., 2023), a collection of multiple datasets comprising 71,198 images from social media feeds worldwide, mainly from Twitter and partly from Google, Bing, Flickr, and Instagram. MEDIC is a recent computer vision dataset and serves as a benchmark for disaster event classification and features four tasks: disaster type, humanitarian, informative and damage severity (see also Fig. 1). These tasks are novel for disaster response datasets and originate from consultation to disaster response experts. The tasks are designed to assist humanitarian aid with information about the disaster for coordinating an immediate and appropriate disaster response.
As in other applications of ML or ADM systems in the context of disaster response, inequalities between participants or groups can as well be defined by socio-economic aspects. These can include economic (fiscal resilience of state and citizens, rebuilding capabilities), technological (robustness, availability, and access to digital infrastructure), political (governmental policy for internet and social media, censorship, laws), cultural (habits and practices of using social media and photography for informing about disasters) and health related aspects (health-care, availability of medical supplies, infrastructure and staff, welfare). The disposition of these attributes or factors for every nation-state can determine inequalities between states and outcomes in disaster response. To find corresponding attributes to these factors, we are using the following development indicators based on which to conduct our fairness assessment: the Human Development Index (HDI) (UNDP, 1990), the Democracy Index (DI) (EIU, 2021) and the ICT Development Index (IDI or Information and Communication Technologies Development Index) (ITU, 2017):Footnote 2
The HDI is a popular index published annually by the United Nations which is used throughout many disciplines. It is the composite index of life expectancy, education and per capita income. These indicators overlap with measures for social inequality often used in social sciences: income, education and health.
The DI (EIU, 2021) is an index of 60 questions answered by experts to capture political pluralism, consensus about government, political participation, democratic political culture and civil liberties.
The IDI is as well a composite index (of 11 indicators), first published in 2008, to measure and compare developments in information and communication technologies across nation-states. One of its main objectives is to measure the (global) Digital Divide, which describes disparities in access to internet and information technology. The three main categories the IDI measures are access to information and communication technologies (ICT), the use of ICT and skills regarding ICT. The access category describes with five indicators infrastructure and access to telephone, mobile-cellular and internet, but also gives information about number of subscriptions and households making use of computers and internet. The use category describes three indicators about the intensity of internet usage by individuals, while the skills category contains three indicators about literacy and education of communication technology users (ITU, 2017). The IDI appears as a very suitable index in the context of a disaster response system based on social media.
As Caton and Haas state that sensitive attributes might be represented or encoded in other (not so obvious) variables to the fairness context (Caton & Haas, 2020), we assume that the different development indicators are either directly sensitive variables or at least proxies for more encrypted protected variables. We hypothesize that the here explained sensitive variables are indeed relevant latent factors encoded in the data set. In other words, e.g. the availability and usage of ICT infrastructure, as measured by the IDI, is assumed to impact availability and quality of images in online social networks and thereby also the quality of the predictive performance of ML models trained on this data.
For defining sensitive groups and investigating fairness in ML algorithms trained on the MEDIC dataset with regards to the development indicators, the nation-states involved in the disaster events contained by the dataset needed to be identified and assigned to each image sample. The MEDIC dataset itself does not contain any information about the involved nation-states. The metadata only consists of information about the four classification tasks and the source dataset, from which the images were obtained. The involved disaster events of the dataset are mentioned in the publication for the dataset (Alam et al., 2023), but are not part of the actual metadata. However, we could relate disaster events of images by the naming of the files and the folders they were contained in. Via the identified disaster events, we were able to locate and assign the nation-states in which a specific disaster event occurred to the majority of the images in the datasetFootnote 3Footnote 4, resulting in 10,675 located test images.
For the actual grouping of the nation-states based on the described development indicators (HDI, DI and IDI) we use dimensionality reduction, particularly Principal Component Analysis (PCA), to find the maximal variance directions in the dataFootnote 5. The results of PCA revealed that 97% of the variance of the data is explainable with the first and second principal component (PC)Footnote 6. Based on the first two PCs we divided the nation-states into three groups as indicated in Fig. 3.
As also shown in Fig. 2 group A is particularly described by nation-states with very high HDI, IDI and DI. Except for Chile, it includes states of north and west Europe and North America and is characterized by higher economic and living status and advanced development. Group B is described by nation-states with still high HDI and medium DI. But this group suffers from lower IDI and regarding this index is not distinguishable to group C. Group B states are economically and by levels of living well developed, but have smaller democracy deficits and suffer significantly from the Digital Divide. Involved countries are spread over Eastern Europe, Africa, South and Southeast Asia and Central and South America. Finally, low HDI and DI are the main explanations for group C. These countries belong to the Middle East and South Asia and are developing countries. Low level of living, political and economical instability, civil wars and autocracies characterize this group.
With our aim to investigate fairness in a specific application scenario and demonstrate the relevance of transnational fairness on the MEDIC dataset and validate the usefulness of development indicators, we focus on comparability with prior work on the same dataset. To that end, we align our selection of the fairness metrics on the metrics used in the MEDIC publication. The authors of the MEDIC dataset report accuracy, precision, recall and hamming loss but emphasize F1-Scores for their benchmarkFootnote 7. While one could consider the F1-Score as a useful metric combining several aspects, such as precision and recall, in one value, the F1-Score has not been adopted by the ML community for fairness metrics as it does not offer the level of detail needed to evaluate fairness relevant aspects. Instead of F1-Score fairness our research focuses directly on precision and recall (Barocas et al., 2019). Among the most popular statistical fairness metrics (Mehrabi et al., 2022; Verma & Rubin, 2018; Caton & Haas, 2020), precision is used to measure Predictive Parity (also called outcome test) and recall is used to measure Equal Opportunity (also called false negative error rate balance) (Caton & Haas, 2020):
Predictive Parity is met when sensitive groups have equal precision or equal probability of samples with a positive predicted label being labeled with a positive class. For two sensitive groups \(\mathcal {A}\) and \(\mathcal {B}\), a target variable Y, a predicted target variable \(\hat{Y}\) and a data point X Predictive Parity is defined as:
Equal Opportunity requires sensitive groups to have equal recall or having equal probability that samples with a positive class label will also have positive predicted values. For two sensitive groups \(\mathcal {A}\) and \(\mathcal {B}\) Equal Opportunity is defined as:
In the context of disaster response with the MEDIC dataset, Predictive Parity means for instance that samples from all sensitive groups have equal probability that if they are predicted to be earthquakes, they will actually be labeled earthquakes. Predictive Parity also means that samples predicted to represent non-disaster or rescue volunteering are actually equally likely to be labeled correct, regardless of their sensitive groupFootnote 8. In other words, Predictive Parity means that the predictions of all sensitive groups have equal chances of triggering a false alarm. Equal Opportunity in this context would imply, for example, that samples from all sensitive groups have an equal probability of being labeled earthquake or not-a-disaster or rescue volunteering if they actually have that label. In other words, Equal Opportunity is satisfied if all sensitive groups have the same probability of not detecting actual label classes (misses). We consider missing an actual disaster event and confusing different disaster characteristics (false alarm) as the most severe scenarios in an automated disaster response system.
3.3 Transnational Fairness Infringements in Disaster Response Management based on Development Indicators
For assessing fairness of automated disaster classification, we reproduced the results of the MEDIC paper using ResNet-18 architecture (He et al., 2015) with the setup and hyperparameters the authors provided in their publicationFootnote 9. Note that we did not aim at improving the classification performance of the published model. It is likely that with more advanced computer vision models higher overall predictive performance could be reached. The goal of this study was to investigate a case study using a published model. We were thus interested in the relative differences in predictive performance rather than optimizing the published model.
In order to conduct our fairness investigations, we grouped the data into three sensitive groups based on PCA on the selected development indicators of all involved nation-states as we describe in Sect. 3.2 and show in Table 1 and Fig. 3. Besides the group results, we report also images which are not locatable and the ungrouped results which contain all group and not locatable images
(see Table 3).
For evaluating the grouped classification performances, we refer to precision and recall as metrics as described in Sect. 3.2. For the evaluation we used the same test dataset as in the publication containing 15,688 images. The results overall are quite diverse and fairness issues are heterogeneously spread out between tasks, subtasks and sensitive groups. According to the selected performance metrics precision and recall, respectively we are reporting Predictive Parity and Equal Opportunity infringements.
Our results demonstrate that the image classifier trained on the MEDIC dataset achieves different precision and recall values across nation-state groups. While the predictive performance on the ungrouped data on subtask level and even on the grouped data on task level (see Tables 2 and 3) show sound results, where the classification metrics only vary between 2% to 5% points across the sensitive groups for the four tasks, inspecting the predictive performances of subcategories of each task reveals significant contrasts between the sensitive groups (see Table 3). Figure 4 shows the fairness infringements for each subtask: Referring to Predictive Parity, bias is most prominent for the subtasks earthquake, fire, flood, hurricane and other disaster in the task disaster types with deviations up to 58%, 46%, 43%, 40%, and 46% points between sensitive groups in precision. But bias is also prominent in the task humanitarian in the subtask affected injured and dead people and rescue volunteering and donation effort, with up to 42% and 39% points. Further fairness issues are evident in the damage severity task in the subtasks mild and severe with differences up to 26% and 15% points and also in informative task with a deviation of 9% points in informative subtask. Equal Opportunity infringements show a similar pattern throughout the subtasks and the sensitive groups. While there are also significant fairness infringements observable, they are less pronounced than the bias defined by Predictive Parity. The disaster type subtasks earthquake, fire, flood, hurricane and other disaster deviate by 26%, 38%, 27%, 37%, and 8% points in recall. Affected injured and dead people and rescue volunteering and donation effort in humanitarian task diverges by 37% and 17% points between the groups, while in the damage severity task mild and severe show variations of 5% and 9%.Footnote 10
Viewing the fairness infringements from a group perspective (see Fig. 4) the results show that group A is especially disadvantaged for earthquake, other disaster and injured or dead subtask. Less drastic, but still significant differences are observable in flood, informative, rescue volunteering and donation effort, mild and severe subtask. Equal Opportunity infringements are less pronounced for earthquake subtask and almost not evident or even irrelevant for flood, other disaster, informative, rescue volunteering and donation effort and mild subtask. Looking at group B the results show strong disadvantages regarding fire, other disaster and injured or dead detection. But also apparent shortcomings are observable in hurricane and severe subtask. Equal Opportunity infringements in this group only deviate to a small amount compared to the Predictive Parity infringements, showing a very similar pattern. Strong bias for Group C is evident in classifying fire, flood, hurricane, rescue or donation efforts and mild damage severity, while this group does not show lesser pronounced biases. As in group A, we observe also that Equal Opportunity infringements are less distinct than Predictive Parity.
3.4 Data Set Imbalances Correlate with Fairness Infringements
In order to investigate potential causes for these differences in predictive performance across nation-state groups, we investigated whether the number of training examples is correlated with predictive performance. Indeed, we found that sample size in individual subtasks and sensitive groups are often correlated strongly with predictive performance (see Table 3). But in some cases lower sample sizes for one sensitive group could perform equally or better than another sensitive group with a higher amount of images e.g. like in the case of earthquakes for group B and C. For Predictive Parity infringements, classification matrices also reveal explanations to some extent: deviations in precision are mostly explainable with the large amount of negative samples in each task: there is a considerable class imbalance with 49% to 71% negative samples in opposite to each single positive subclass. This circumstance contributes to the major proportion of false positives for each subclass. The precision value, which is the base for Predictive Parity, is highly influenced by this fact. Especially in cases where sensitive groups contain only a small amount of samples for a subclass compared to other groups, they suffer easily from lower precision and hence Predictive Parity issues. In consequence, differently balanced subtasks among the sensitive groups appear as the main cause for Predictive Parity issues. This aspect points towards biases in the sampling process of the MEDIC dataset concerning the different sensitive groups.
In contrast to the precision metric, the true positive rate or recall is invariant with respect to false positives caused by class imbalance. Consequently, Equal Opportunity issues, which are based on recall, can not be explained with class imbalance issues across different subtasks (like for Predictive Parity). Recall, the base of equal opportunity, only depends on the frequency and composition of false negatives, despite sample sizes of any other subclasses. Differences in recall between two sensitive groups can result from two reasons, that can also be interweaved: first of all images of one sensitive group are of different quality (or have different features), thus might be easier or more difficult to learn. If this fact is paired with imbalances of samples between different sensitive groups (group imbalance), the difficulty to learn a subclass becomes twofold. As a matter of course, Equal Opportunity issues point both towards qualitative differences of subtask samples as to biases in the sampling process of the MEDIC dataset concerning the different sensitive groups.
When interpreting differences in predictive performance and consequently differences in fairness metrics across groups it is important to consider the different causes. Ideally, the predictive performance and fairness measures of any ADM system should be invariant to the frequency of a disaster. Just as we expect an ADM for disease detection to perform equally well across socio-demographic groups, regardless of the varying frequency of diseases in those groups. This non-utilitarian notion of fairness is particularly favored in ML applications concerning health and human lives (Hertweck et al., 2021), aligning often with civil or human rights principles (Friedler et al., 2021) and, in our case, also with the global justice perspective outlined in Sect. 2.
In practice however, there are many factors that lead to sampling biases and subsequently differences in fairness metrics – and not all of those reasons can be altered or should be considered as discrimination. Due to the different geographic locations of countries in the respective groups, some disaster types are more likely for some groups than for others. For instance, earthquakes are less likely to occur in group A countries (although Chile, Greece and Italy are contained in this groups where earthquakes are not unlikely) than other country groups, while hurricanes (or tropical cyclones) are less likely to occur in group C countries.
Other causes for class imbalance between country groups could include a smaller amount of coverage and social media posts for some country groups, which we discuss in Sect. 3.5. These effects could be seen as discrimination and indeed could be addressed with appropriate counter measures as they highlight structural biases and systemic issues between socio-demographic groups. Interpreting differences in fairness should differentiate between these causes.
Next to these effects there are also statistical aspects related to class imbalance that should be considered when interpreting fairness metrics. Most importantly, we emphasize that sampling biases, originating from structural discrimination or from geographical factors, can lead to apparent differences in predictive performance and fairness metrics across groups even if a classifier actually has the same predictive performance across groups. One example is that of Predictive Parity, or differences in precision across groups: changing the ratio of true positive examples to true negative examples, or in other words changing the class imbalance, will lead to a change in precision even if the performance of a classifier is constant (Williams, 2021). Simply put, the lower the ratio of true instances relative to false instances per category, the lower the precision and F1 will be for that category, if the false and true positive rates, characterizing the actual predictive performance, are constant. Assuming that the ratio of positive instances within each country group is the same in the training and test data, this does not affect the interpretation of Predictive Parity within a country group for a given classification category. But we emphasize that interpretations that evaluate relative differences of Predictive Parity across country groups are difficult if the number of positive instances in a classification category vary.
3.5 Discussion: Discrimination of Nation-States Depends on Disaster Classification Task and Subtask
Interestingly there appears to be not one specific underprivileged group of countries in terms of disaster classification metrics. In other words, we do not find evidence for systematic unfairness against a group of nation-states. The results suggest that the predictive performance of the classifier is biased towards different nation-state groups depending on disaster classification task and subtasks of these tasks. Low sample sizes are often correlated with low prediction scores as shown in Table 3 and Sect. 3.4), but this representation bias does not explain all shortcomings. Qualitative differences of disaster depictions between the sensitive groups seem also relevant as described in Sect. 3.4. With regard to the used development indicators, this means that lower index values do not relate to being disadvantaged within the analyzed classifier. Groupings by development indicators rather articulate the different expressions disasters can have regarding being documented and reported in social media by different groups: advanced developed nation-states based on the selected development indicators (group A) are disadvantaged within the system regarding detecting disaster types of earthquake, other disaster, but also regarding injuries or casualties. Well-developed nation-states by the used development indicators with democracy and information technology deficits (group B) are disadvantaged regarding fire, and other disaster disaster types and also for injuries and casualties. Developing countries and nation-states with economic and political instabilities or even civil wars and with autocracies (group C) are suffering from shortcomings in detecting fire, flood, and hurricane disaster types and recognizing humanitarian efforts and milder infrastructural damages.
While our approach to assessing transnational fairness in ML focuses on the technical system, it needs to be acknowledged that the basis for unfair results of an ML system also lies beyond the technical system. Our results suggest that it is important to consider how data producing infrastructures, availability and quality of data sets influence differences in predictive performance of ML solutions deployed worldwide. Using the ICT development index as basis for the fairness assessments already reflects possibly discriminatory differences in the ways telecommunication and digital media infrastructures are distinct across groups of countries. But we need to acknowledge that possible bias for transnationally applied ML systems might be rooted in more subtle differences across countries, that cannot be accounted for in development indicators, but which might still matter as potentially discriminatory attributes.
The system used in DRM that we based our study on relies on social media data (images) posted by users after a disaster event. Such images are then processed by the system to classify essential information about the kinds of disaster events that have happened to coordinate and support humanitarian aid. It is plausible to assume that differences in telecommunication infrastructures (Zorn & Shamseldin, 2015; ITU, 2021) and different ways and cultural habits of using social media (Anduiza et al., 2012; Hasebrink et al., 2015; Kleis Nielsen & Schrøder, 2014; Plantin & Punathambekar, 2019) across the globe as well as the different technical set-ups and affordances (Hutchby, 2001) of social media platforms will lead to differences in the kinds of data that will become produced across the globe after a disaster event.
These differences might matter for the processing of the ML application and consequently for its outputs. It is thus misleading to think of an algorithm as deciding impartially as long as the technical problems of unfairness are becoming fixed. Instead, it is essential to acknowledge and to account for the fact that bias might lie outside of the technical system, for instance in the way different platforms lead to different content, the culturally distinct ways in which people appropriate social media platforms and the ways it is culturally accustomed to produce certain types of social media content in some countries compared to others. It is essential to further push an interdisciplinary perspective combining perspectives on the technical questions (ML, data analytics) with approaches that consider the ways in which social media and social media production (digital media studies, media sociology) work.
4 Outlook: Transnational Algorithmic Fairness Beyond the Technical System
Automated decision making has become ubiquitous and the use of ML technology is already impacting societies on a transnational scale. These systems carry the risk of amplifying existing biases, further marginalizing less privileged groups and aggravating global injustices. In order to analyze and alleviate these effects, we argue that existing notions of fairness should be complemented by transnational concepts of fairness that are able to capture global justice implications of ADM. Transnational algorithmic fairness assessments can then build the basis for critical interdisciplinary investigations into algorithmic justice implications.
We proposed a grouping of nation-states based on development indicators and demonstrate that state-of-the-art computer vision methods for DRM exhibit substantial differences with respect to the predictive performance across groups of nation-states. Such differences could ultimately impact the availability and speed of disaster response - or considering the broad range of algorithmic applications on a transnational scale many other adversarial impacts on certain populations.
Examining the potential causes for these algorithmic biases we find that often data availability influences predictive performance. This effect can be associated with some factors captured in the development indicators for access to information technology infrastructure and highlights the importance of trustworthy and reliable data for measuring existing biases in globally deployed ML solutions.
While we have presented here a case study into DRM to demonstrate the relevance of a transnational approach to algorithmic fairness, further reflections on normative or ethical grounds for moral assessments of discrimination on a transnational scale, solutions to observed fairness issues, the topic of accountability and transparency for a transnationally-applied algorithmic system are out of the scope of this article and will be left open to future research. More conceptual work and empirical analyses into transnational fairness in ML is thus desperately needed - and should align with emerging discussions on matters of global justice in the context of ML applications (Birhane, 2020; Birhane et al., 2022; Mohamed et al., 2020; Png, 2022).
From a global justice perspective, it will be essential to question availability of data and underlying causes for global justice implications already in the data producing infrastructures. Here, investigations into transnational algorithmic fairness can benefit from discussions on data justice, critical data studies or data colonialism (Taylor, 2017; Dencik et al., 2022; Couldry & Mejias, 2019; Iliadis & Russo, 2016). Based on investigations into transnational algorithmic fairness as presented in this case study, further critical investigations into the processes, practices, capitalist market structures in data production etc. can provide profound insights into how data extracting media technologies contribute towards manifestations of injustices through algorithmic (un)fairness. Such research could contribute towards reflecting algorithmic fairness beyond the technical system.
The presented case study relies on development indicators, which come with their own limitations. While economic indicators (such as the gross domestic product) can be defined and measured relatively easily, others, such as social indicators are more difficult to define and could be considered as proxies of underlying social phenomena (McGranahan, 1972). For instance life expectancy not only reflects medical services, but also other variables like literacy, housing conditions, diet, income, etc. (McGranahan, 1972). Other essential limitations include that the negotiation and constitution of such indicators reflect power asymmetries between nation-states and global injustices in global governance frameworks and international bodies (Moellendorf, 2009; Caney, 2006; Jongen & Scholte, 2022; Reus-Smit & Zarakol, 2023). Our reliance on such indicators, despite their limitations, lies grounded in matters of data availability. Future investigations into transnational algorithmic fairness should consider limitations of available data sets and test new ways of grouping populations for assessments of transnational fairness. But even though acknowledging such limitations is essential, it should should not prevent conceptual and empirical work into transnational algorithmic fairness. Instead more conceptual work, beyond indicators from a global governance framework that deals with its own justice implications, is required here.
Further research should especially focus on three aspects regarding transnational forms of algorithmic (un)fairness: fairness implication in relation to the technical system, in the organizational embedding of such transnationally applied systems as well in the production of data and their underlying infrastructures (inlcuding people’s media and data producing practices) for transnationally applied systems to process. An interdisciplinary perspective on transnational algorithmic (un)fairness is thus indispensable, not only reflecting ML research on fairness but equally science and technology studies-inspired approaches to the social shaping of technologies (Bijker et al., 1987; Bijker & Law, 1994) as well as research into media use and appropriation as central ways in which data becomes produced, to what purpose and how this might equally manifest global injustices and the relevance of global digital divides (Hargittai, 2003; Fuchs & Horak, 2008) for algorithmic fairness from a global justice perspective.
In summary, more work on transnational fairness needs to evolve in order to develop mitigation strategies and in order to reflect in conceptual and empirical sound ways on global injustices relating to transnational forms of algorithmic (un)fairness. Such research could then undergird critical reflections on ADM in times when ML application increasingly become applied as assistive and strategic technologies.
Notes
For a simplified example, life expectancy may serve as an indicator for the medical services within a country.
Accordingly to any other definition of factors on which inequality is based, other development indicators may be used in order to represent these. For future studies we aim to study development indicators and other nation-state level indicators for their statistical relevance, their robustness and their capability to represent inequalities further.
To give an example, we assigned the disaster event “Nepal earthquake 2015” reported in the MEDIC publication to the file “data/ASONAM17_Damage_Image_Dataset/nepal_eq/nepal_eq_severe_im_16578.jpg”. From this event we could derive Nepal as an involved state and assign it to the image.
The MEDIC dataset contains about 25% samples depicting hurricanes. These disaster events are of enormous regional scale and involve several nation-states (from North America to north of South America, and even sometimes to North Europe or West Africa). After probing the hurricane images of the MEDIC dataset and considering Twitter usage statistics (Statista, 2022), we assumed for our analysis that the hurricane events in the MEDIC dataset are images that relate to the United States. For typhoon images we encountered a comparable situation of ambiguity, but not on the same scale as for hurricanes. After looking at damages caused by the typhoons that are recorded in MEDIC, we decided to use the Philippines as a representative for this disaster event type in our example.
Missing values for some countries indicators were imputed with a K-Nearest-Neighbour imputation algorithm.
This approach especially scales when using further index data for expressing inequalities and for grouping the countries. A suitable dimensionality reduction or clustering technique may be selected based on quantity and quality of used data.
In the case of the MEDIC dataset as a multi-class classification dataset, weighted average F1-Score for each task and F1-Score for each class of the four different tasks is used (Alam et al., 2023).
These examples have inherently different impact on misclassification, which makes the selection of an appropriate fairness measurements complicated: E.g. a false alarm for earthquake or not-a-disaster are of different severity.
https://github.com/firojalam/medic
Due to very small sample sizes in the subtask landslide throughout all sensitive groups, we do not take this subtask into account when reporting our observations and findings.
References
Akinwotu, E. (2021). Facebook’s role in Myanmar and Ethiopia under new scrutiny. The Guardian, https://www.theguardian.com/technology/2021/oct/07/facebooks-role-in-myanmar-and-ethiopia-under-new-scrutiny.
Alam, F., Alam, T., Hasan, M. A., Hasnat, A., Imran, M., & Ofli, F. (2023). MEDIC: A multi-task learning dataset for disaster image classification. Neural Computing and Applications, 353, 2609–2632. https://doi.org/10.1007/s00521-022-07717-0
AlgorithmWatch. (2020). Automating Society Report (Tech. Rep.) https://automatingsociety.algorithmwatch.org/wp-content/uploads/2020/12/Automating-Society-Report-2020.pdf.
Amoore, L. (2021). The deep border. Political Geography, 109, 102547. https://doi.org/10.1016/j.polgeo.2021.102547.
Anduiza, E., Perea, E. A., Jensen, M. J., & Jorba, L. (2012). Digital media and political engagement worldwide: A comparative study. Cambridge University Press.
Angwin, J., Larson, J., Mattu, S. & Kirchner, L. (2016). Machine Bias. ProPublica. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing?token=TiqCeZIj4uLbXl91e3wM2PnmnWbCVOvS
Barocas, S. , Hardt, M. & Narayanan, A. (2019). Fairness and machine learning: limitations and opportunities. https://www.fairmlbook.org
Barocas, S. & Selbst, A. D. (2016). Big data’s disparate impact. SSRN Electronic Journal, 104(3), 671–732. https://doi.org/10.2139/ssrn.2477899
Baster, N. (1972). Development indicators: An introduction. The Journal of Development Studies, 9, 831–920. https://doi.org/10.1080/00220387208421409
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). New York. Association for Computing Machinery . https://doi.org/10.1145/3442188.3445922
Bijker, W. E., Hughes, T. P., & Pinch, T. (1987). The social construction of technological systems: New directions in the sociology and history of technology. MIT Press.
Bijker, W. E., & Law, J. (1994). Shaping technology/building society: Studies in sociotechnical change. MIT Press.
Binns, R. (2017). Fairness in machine learning: Lessons from political philosophy. Conference on Fairness, Accountability, and Transparency, New York, Forthcoming, Proceedings of Machine Learning Research, Vol. 81, p. 1–11.
Birhane, A. (2020). Algorithmic colonization of Africa. Scriptorium, 17, 389–409.
Birhane, A., Kalluri, P., Card, D., Agnew, W., Dotan, R. & Bao, M. (2022). The values encoded in machine learning research. FAccT '22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 173–184, https://doi.org/10.1145/3531146.3533083
Buolamwini, J. & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In S. A. Friedler & C. Wilson (Eds.), Proceedings of the 1st conference on fairness, accountability and transparency (Vol. 81, pp. 77–91). PMLR. https://proceedings.mlr.press/v81/buolamwini18a.html.
Caney, S. (2006). Cosmopolitan justice and institutional design: An egalitarian liberal conception of global governance. Social Theory and Practice, 324, 725–756.
Castelnovo, A., Crupi, R., Greco, G., Regoli, D., Penco, I. G., & Cosentini, A. C. (2022). A clarification of the nuances in the fairness metrics landscape. Scientific Reports, 1, 214–219.
Caton, S., & Haas, C. (2020). Fairness in machine learning: A survey. ACM Computing Surveys, 56(7), 1–38.
Chouldechova, A. & Roth, A. (2018). The frontiers of fairness in machine learning. https://doi.org/10.48550/ARXIV.1810.08810. Accessed 16 Nov 2023.
Couldry, N., & Mejias, U. A. (2019). Data colonialism: Rethinking big data’s relation to the contemporary subject. Television & New Media, 204, 336–349. https://doi.org/10.1177/1527476418796632
Council of Europe. (1953). European Convention on Human Rights. https://www.echr.coe.int/Documents/Convention_ENG.pdf.
Crawford, K. (2021). Atlas of AI. Yale University Press.
Dencik, L., Hintz, A., Redden, J., & Treré, E. (2022). Data justice. Sage Publications Ltd.
Depardey, V. , Gevaert, C. M., Molinario, G. M., Soden, R., Balog-Way, S. & Breunig, A. (2019). Machine learning for disaster risk management. http://documents.worldbank.org/curated/en/503591547666118137/Machine-Learning-for-Disaster-Risk-Management. Accessed 03 Oct 2022.
DeVries, T., Misra, I., Wang, C. & van der Maaten, L. (2019). Does object recognition work for everyone? http://arxiv.org/abs/1906.02659. Accessed 28 Sep 2022.
Dijstelbloem, H., Meijer, A., & Besters, M. (2011). The migration machine. In H. Dijstelbloem & A. Meijer (Eds.), Migration and the new technological borders of Europe (pp. 1–21). Palgrave Macmillan. https://doi.org/10.1057/9780230299382_1
Durant, J., Bauer, M. W., & Gaskell, G. (1998). Biotechnology in the public sphere: A European sourcebook. Science Museum.
EIU, E. I. U. (2021). Democracy Index 2021 (Tech. Rep.). https://www.eiu.com/n/campaigns/democracy-index-2021/. Accessed 21 Oct 2022.
Ensign, D., Friedler, S. A., Neville, S., Scheidegger, C., & Venkatasubramanian, S. (2018). Runaway feedback loops in predictive policing. In Conference on fairness, accountability and transparency (pp. 160–171). PMLR.
European Commission, Joint Research Centre, Muench, S., Stoermer, E., Jensen, K., Asikainen, T., Scapolo, F. (2022). Towards a green & digital future: Key requirements for successful twin transitions in the European Union (Tech. Rep.). Publications Office of the European Union.
European Union. (2000). Racial equality directive. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32000L0043
Friedler, S. A., Scheidegger, C., & Venkatasubramanian, S. (2021). The (Im)possibility of fairness: Different value systems require different mechanisms for fair decision making. Communications of the ACM, 5, 136–143. https://doi.org/10.1145/3433949
Fuchs, C., & Horak, E. (2008). Africa and the digital divide. Telematics and Informatics, 252, 99–116.
Gevaert, C. M., Carman, M., Rosman, B., Georgiadou, Y., & Soden, R. (2021). Fairness and accountability of AI in disaster risk management: Opportunities and challenges. Patterns, 21, 1100363. https://doi.org/10.1016/j.patter.2021.100363
Gorwa, R., Binns, R., & Katzenbach, C. (2020). Algorithmic content moderation: Technical and political challenges in the automation of platform governance. Big Data & Society, 7(1), 712053951719898000. https://doi.org/10.1177/2053951719897945
Goyal, P., Soriano, A. R., Hazirbas, C., Sagun, L., & Usunier, N. (2022). Fairness indicators for systematic assessments of visual feature extractors. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (pp. 70–88).
Gray, M. L., & Siddharth, S. (2019). Ghost work: How to stop silicon valley from building a new global underclass. Houghton Mifflin Harcourt.
Hallegatte, S. , Rentschler, J. & Walsh, B. (2018). Building back better : Achieving resilience through stronger, faster, and more inclusive post-disaster reconstruction. World Bank. https://openknowledge.worldbank.org/handle/10986/29867. Accessed 03 Oct 2023.
Hargittai, E. (2003). The digital divide and what to do about it. In D. Jones (Ed.), New economy handbook (pp. 822–841). Academic Press.
Hasebrink, U. , Jensen, K.B. , van den Bulck, H. , Hölig, S. & Maeseele, P. (2015). Media audiences: Changing patterns of media use across cultures: A challenge for longitudinal research. International Journal of Communication 9, https://ijoc.org/index.php/ijoc/article/view/3452
He, K. , Zhang, X. , Ren, S. & Sun, J. (2015). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE
Hertweck, C. & Heitz, C. (2021). A systematic approach to group fairness in automated decision making. 2021 8th Swiss Conference on Data Science (SDS) (pp. 1–6). Lucerne, IEEE. https://ieeexplore.ieee.org/document/9474606/. Accessed 9 Nov 2023.
Hertweck, C., Heitz, C. & Loi, M. (2021). On the moral justification of statistical parity. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 747–757). Virtual Event Canada ACM. https://dl.acm.org/doi/10.1145/3442188.3445936. Accessed 09 Nov 2023
Hutchby, I. (2001). Technologies, texts and affordances. Sociology, 352, 441–456. https://doi.org/10.1177/S0038038501000219
Iliadis, A., & Russo, F. (2016). Critical data studies: An introduction. Big Data, 8, 32205395171667424. https://doi.org/10.1177/2053951716674238
Imran, M., Castillo, C., Diaz, F. & Vieweg, S. (2015). Processing social media messages in mass emergency: A survey. ACM Computing Surveys, 47(4),1–38.
Imran, M., Castillo, C., Lucas, J., Meier, P. & Vieweg, S. (2014). AIDR: Artificial intelligence for disaster response. In Proceedings of the 23rd international conference on world wide web (pp. 159–162). Seoul Korea ACM. https://dl.acm.org/doi/10.1145/2567948.2577034. Accessed 21 Sep 2022.
ITU. (2017). Measuring the information society report 2017 (Tech. Rep.). Geneva International Telecommunication Union. http://handle.itu.int/11.1002/pub/80f52533-en
ITU. (2021). Utilizing telecommunications and ICTs for disaster risk reduction and management (Tech. Rep.). Geneva International Telecommunication Union. https://www.itu.int/hub/publication/d-stg-sg02-05-2-2021/.
Jarrahi, M. H. , Newlands, G. , Lee, M. K. , Wolf, C. T. , Kinder, E. & Sutherland, W. (2021). Algorithmic management in a work context. Big Data & Society. https://doi.org/10.1177/20539517211020332.
Jongen, H., & Scholte, J. A. (2022). Inequality and legitimacy in global governance: An empirical study. Journal of International Relations, 283, 667–695. https://doi.org/10.1177/13540661221098218
Katzenbach, C. (2021). "AI will fix this" – The technical, discursive, and political turn to AI in governing communication. Big Data & Society, 82, 2.
Katzenbach, C., & Ulbricht, L. (2019). Algorithmic governance. Internet Policy Review, 8(4), 1–18. https://doi.org/10.14763/2019.4.1424
Kleis Nielsen, R., & Schrøder, K. C. (2014). The relative importance of social media for accessing, finding, and engaging with news. Digital Journalism, 24, 472–489. https://doi.org/10.1080/21670811.2013.872420
Klobasa, M., Plötz, P., Pelka, S. & Vogel, L. (2019). Artificial intelligence for the integrated energy transition (Tech. Rep.). Karlsruhe Fraunhofer ISI. https://publica.fraunhofer.de/handle/publica/300027
Krzywdzinski, M. , Pfeiffer, S. , Evers, M. & Gerber, C. (2022). Measuring work and workers. Wearables and Digital Assistance Systems in Manufacturing and Logistics.
Lahoti, P., Gummadi, K. P. & Weikum, G. (2019). iFair: Learning individually fair data representations for algorithmic decision making. 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE.
Linardos, V., Drakaki, M., Tzionas, P., & Karnavas, Y. L. (2022). Machine learning in disaster management: Recent developments in methods and applications. Machine Learning and Knowledge Extraction, 42, 446–473. https://doi.org/10.3390/make4020020
McGranahan, D. (1972). Development indicators and development models. The Journal of Development Studies, 83, 91–102. https://doi.org/10.1080/002203872084214141972
Mehrabi, N., Morstatter, F., Saxena, N. , Lerman, K. & Galstyan, A. (2022). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1–35.
Mittelstadt, B. D., Allo, P., Taddeo, M., Wachter, S., & Floridi, L. (2016). The ethics of algorithms: Mapping the debate. Big Data & Society, 3(2), 2053951716679679. https://doi.org/10.1177/2053951716679679.
Moellendorf, D. (2009). Global inequality and injustice. Journal of International Development, 218, 1125–1136. https://doi.org/10.1002/jid.1651
Mohamed, S., Png, M. T., & Isaac, W. (2020). Decolonial AI: Decolonial theory as sociotechnical foresight in artificial intelligence. Philosophy & Technology, 334, 659–684. https://doi.org/10.1007/s13347-020-00405-8
Nishant, R., Kennedy, M., & Corbett, J. (2020). Artificial intelligence for sustainability: Challenges, opportunities, and a research agenda. International Journal of Information Management, 53, 9. https://doi.org/10.1016/j.ijinfomgt.2020.102104
Nye, J. S., & Keohane, R. O. (1971). Transnational relations and world politics: An introduction. International Organization, 253, 329–349. https://doi.org/10.1017/S0020818300026187
Pfeifer, M. (2021). Intelligent borders? Securitizing smartphones in the European border regime. Culture Machine, 8, 201–222.
Plantin, J. C., & Punathambekar, A. (2019). Digital media infrastructures: Pipes, platforms, and politics. Media, Culture and Society, 412, 163–174. https://doi.org/10.1177/0163443718818376
Png, M. T. (2022). At the tensions of south and north: Critical roles of global south stakeholders in AI governance (pp. 1434–1445). USA Association for Computing Machinery.
Pogge, T. W. (1992). Cosmopolitanism and Sovereignty. Ethics, 103, 148–175. https://doi.org/10.1086/293470
Reus-Smit, C., & Zarakol, A. (2023). Polymorphic justice and the crisis of international order. International Affairs, 99, 1122. https://doi.org/10.1093/ia/iiac232
Risse, M. (2011). Global Justice (Tech. Rep.). John F. Kennedy School of Government, Harvard University. http://nrs.harvard.edu/urn-3:HUL.InstRepos:4669674
Said, N., Ahmad, K., Riegler, M., Pogorelov, K., Hassan, L., Ahmad, N., & Conci, N. (2019). Natural disasters detection in social media and satellite imagery: A survey. Multimed Tools Appl, 78, 31267–31302.
Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., & Sculley, D. (2017). No classification without representation: assessing geodiversity issues in open data sets for the developing world. http://arxiv.org/abs/1711.08536. Accessed 28 Sep 2022.
Statista. (2022). Regional distribution of desktop traffic to Twitter.com as of May 2022, by country. Statista.https://www.statista.com/statistics/261402/distribution-of-twitter-traffic-by-country/. Accessed 10 March 2022.
Taylor, L. (2017). What is data justice? The case for connecting digital rights and freedoms globally. Big Data & Society, 4(2), 2053951717736335. https://doi.org/10.1177/2053951717736335
Tomašev, N., Glorot, X., Rae, J. W., Zielinski, M., Askham, H., Saraiva, A., & Mohamed, S. (2019). A clinically applicable approach to continuous prediction of future acute kidney injury. Nature, 5727767, 116–119. https://doi.org/10.1038/s41586-019-1390-1
UN DESA. (2021). Artificial intelligence saving the natural world. https://www.un.org/en/desa/artificial-intelligence-saving-natural-world
UNDP. (1990). Human development report 1990. UNDP (United Nations Development Programme).
UNHRC. (2018). Fact-finding Mission on Myanmar: concrete and overwhelming information points to international crimes. https://www.ohchr.org/en/press-releases/2018/03/fact-finding-mission-myanmar-concrete-and-overwhelming-information-points?LangID=E &NewsID=22794
United Nations.(1948). Universal declaration of human rights. https://www.echr.coe.int/Documents/Convention_ENG.pdf.
Verma, S. & Rubin, J. (2018). Fairness definitions explained. Proceedings of the International Workshop on Software Fairness (pp. 1–7). Gothenburg Sweden ACM. https://dl.acm.org/doi/10.1145/3194770.3194776. Accessed 2 Aug 2022.
Vinuesa, R., Azizpour, H., Leite, I., Balaam, M., Dignum, V., Domisch, S., & Fuso Nerini, F. (2020). The role of artificial intelligence in achieving the Sustainable Development Goals. Nature Communications, 9, 1–10. https://doi.org/10.1038/s41467-019-14108-y
Vries, W. F. (2001). Meaningful measures: Indicators on progress, progress on indicators. International Statistical Review, 692, 313–331. https://doi.org/10.1111/j.1751-5823.2001.tb00461.x
Wachter, S. (2022). The theory of artificial immutability: Protecting algorithmic groups under anti-discrimination law. Tul. L. Rev., 97, 149. https://doi.org/10.2139/ssrn.4099100.
Williams, C. K. I. (2021). The effect of class imbalance on precision-recall curves. Neural Computation, 334, 853–857. https://doi.org/10.1162/neco_a_01362.
Yang, Y., Zhang, C., Fan, C., Mostafavi, A., & Hu, X. (2020). Towards fairness-aware disaster informatics: An interdisciplinary perspective. IEEE Access, 8, 201040–201054. https://doi.org/10.1109/ACCESS.2020.3035714
Zorn, C. R., & Shamseldin, A. Y. (2015). Post-disaster infrastructure restoration: A comparison of events for future planning. International Journal of Disaster Risk Reduction, 13, 158–166. https://doi.org/10.1016/j.ijdrr.2015.04.004
Zuiderveen Borgesius, F. (2018). Discrimination, artificial intelligence, and algorithmic decision-making (Tech. Rep.). Council of Europe, Directorate General of Democracy. https://rm.coe.int/discrimination-artificial-intelligence-and-algorithmic-decision-making/1680925d73
Acknowledgements
The authors thank the anonymous reviewers for their constructive feedback. Cem Kozcuer received funding from the Federal Ministry of Education and Research through the program “The development and recruitment of professorial staff at the Berlin University of Applied Sciences”, grant number 03FHP102. Felix Bießmann received funding from the Einstein Center Digital Future, Berlin. Anne Mollen did not receive special funding in preparation for this publication. She would like to thank all members of the “Media Sociology and Sustainability” team at the Department of Communication at the University of Münster for their helpful comments on the text. This research is also funded by the German Research Foundation (DFG) - Project number: 528483508 - FIP 12.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kozcuer, C., Mollen, A. & Bießmann, F. Towards Transnational Fairness in Machine Learning: A Case Study in Disaster Response Systems. Minds & Machines 34, 11 (2024). https://doi.org/10.1007/s11023-024-09663-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11023-024-09663-3