UNIFAI (Unified Framework for AI Impact Assessment)

From Algorithmic Impact Assessment
Jump to navigation Jump to search

Introduction

Software engineering has long been a critical field of study, focusing on optimization problems and techniques for producing high-quality code. However, the advent of Artificial Intelligence (AI) has significantly altered the landscape of software development. The emphasis has shifted from traditional code-centric paradigms toward usability, user experience, and the emergence of knowledge-driven, code-agnostic development environments.

The paradox in modern AI development lies in the critical need to identify and control code quality and characteristics, despite the inherent complexity and opacity of many AI paradigms. AI systems, particularly those employing deep learning, often operate through multiple layers of abstraction, making the underlying processes and decision-making mechanisms non-transparent. This factor can obscure the quality and reliability of the code, leading to potential issues in performance, ethics, and compliance. Given this complexity, there is an “moral” obligation for designers to revisit fundamental questions about code evaluation and modeling. This involves not just understanding the technical aspects but also ensuring/evaluating that the AI systems align with ethical standards and societal values.

Limitations of Conventional AI Evaluation

The work on AI evaluation is not new; it has been a topic of research for over four decades. For instance, [1] provides a foundational analysis on how to evaluate AI research by extracting insights and setting goals at each stage of development. While such analysis may appear trivial at first glance, it offers critical insights into the foundational drivers of a concept’s necessity and reveals the underlying value systems it implicitly adheres to. This type of analysis is often underreported or entirely absent in the documentation and discussion of AI systems. Metrics such as task-specific accuracy, precision, and false negative rate are frequently highlighted as benchmarks for model performance [2][3]. However, the methodological foundations and contextual relevance of these metrics are occasionally communicated in detail. On top of that, in certain instances [4], additional tuning is applied not to improve the model’s inherent performance, but rather to enhance the perceived accuracy as experienced by end users—introducing an additional layer of abstraction that may obscure the system's actual behavior. In other cases, additional tuning is employed to enhance the perception of accuracy. This might involve adjusting models to better align with human expectations or to mitigate biases, thereby improving the system's perceived reliability. In summary, the development of AI systems needs a return to the moral code analysis, not only to ensure technical robustness but also to align and conform with ethical and societal expectations that are expected.

While these benchmarks serve as valuable indicators of success in controlled environments, they are insufficient for assessing the complexities and uncertainties of real-world deployments, particularly when they are not tested with real data or concepts relevant to the intended use case of the system. Similarly with the previous comparison, data represents another dimension of evaluation, especially through the leveraging of benchmark datasets. Although this approach can effectively demonstrate the accuracy of a model in comparison with known discrete concepts, it predominantly fails to reveal or adequately explore the "correctness" in real-world scenarios for decision making. Although such datasets enable the assessment of model accuracy against well-defined and discrete tasks [5][6][7][8], they frequently fall short in demonstrating—or even investigating—the model’s correctness and reliability in real-world decision-making contexts.

Societal Impact, Ethical Frameworks, and Enforcement Gaps

While performance-based evaluation methods remain prevalent, they offer a limited perspective on AI system assessment. This approach's dominance can be attributed to two primary factors: its effectiveness as a marketing strategy that emphasizes performance metrics, and its close alignment with economic considerations that typically drive enterprise decision-making. However, this narrow focus on performance may overlook other crucial aspects of AI system evaluation, such as ethical considerations, societal impact, and long-term sustainability. A growing body of literature has raised concerns regarding the widespread and accelerated deployment of AI algorithms and platforms in diverse aspects of everyday life [9][10][11][12]. Numerous research efforts have proposed strategies to mitigate the associated social and economic impacts [13][14][15][16], while standardization bodies [17] and international organizations [18][19][20][21] have issued frameworks and ethical guidelines aimed at supporting responsible AI deployment. However, these initiatives are largely advisory in nature, lacking enforceability, and often fall short of establishing binding rules or regulatory mechanisms for the development and use of AI systems.

A more comprehensive assessment framework is necessary to ensure that AI systems are not only high-performing but also ethically sound and socially responsible.

Toward a Dynamic and Holistic Evaluation Framework

To address the growing need for comprehensive algorithmic oversight, an Algorithmic Impact Assessment (AIA) tool has been developed to offer a structured and holistic framework for evaluating AI systems. The tool is organized into six main categories, each designed to be completed sequentially—beginning with the initial planning stage and concluding with the withdrawal or retirement of the AI system or platform.

This tool extends and enriches the approach proposed by CapAI [22], as outlined in the CapAI Internal Review Protocol (IRP). It has been designed to align with the principles and regulatory requirements of the European Union Artificial Intelligence Act (AIA), ensuring that ethical, legal, and technical robustness is embedded throughout the AI lifecycle. While CapAI provides a strong foundation for documenting and supporting goals such as risk mitigation, trustworthiness, and accountability, this new assessment tool introduces key advancements. It places greater emphasis on user-friendliness and is built to be dynamically extensible, enabling contributions and updates from a wide range of stakeholders, including conformity assessment authorities, policymakers, researchers, AI experts, and the broader community. Contemporary AI paradigms exhibit significant variation in their foundational design principles, objectives, operational workflows, and modeling techniques. Approaches such as active learning, swarm intelligence, neuro-symbolic and other emerging paradigms are not developed according to a uniform framework, nor do they pursue identical outcomes or follow standardized developmental processes. This diversity of design logic underscores the dynamic and heterogeneous nature of AI system development. As a result, any static or rigid assessment framework risks becoming obsolete over time. The emergence of novel paradigms may necessitate the introduction of new evaluation criteria or even entirely new assessment categories, in order to adequately capture the nuances and specificities of these innovations.


References :

  1. Cohen, P. R., & Howe, A. E. (1988). How evaluation guides AI research: The message still counts more than the medium. AI magazine, 9(4), 35-35.
  2. Wei, J., Karina, N., Chung, H. W., Jiao, Y. J., Papay, S., Glaese, A., ... & Fedus, W. (2024). Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368.
  3. Paul, S., & Chen, P. Y. (2022, June). Vision transformers are robust learners. In Proceedings of the AAAI conference on Artificial Intelligence (Vol. 36, No. 2, pp. 2071-2081).
  4. Kocielnik, R., Amershi, S., & Bennett, P. N. (2019, May). Will you accept an imperfect ai? exploring designs for adjusting end-user expectations of ai systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1-14).
  5. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248-255). IEEE.
  6. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13 (pp. 740-755). Springer International Publishing.
  7. Harish, B. S., Kumar, K., & Darshan, H. K. (2019). Sentiment analysis on IMDb movie reviews using hybrid feature extraction method.
  8. Asghar, N. (2016). Yelp dataset challenge: Review rating prediction. arXiv preprint arXiv:1605.05362.
  9. Al-kfairy, M., Mustafa, D., Kshetri, N., Insiew, M., & Alfandi, O. (2024, September). Ethical challenges and solutions of generative AI: An interdisciplinary perspective. In Informatics (Vol. 11, No. 3, p. 58). Multidisciplinary Digital Publishing Institute.
  10. Wei, M., & Zhou, Z. (2022). Ai ethics issues in real world: Evidence from ai incident database. arXiv preprint arXiv:2206.07635.
  11. Baldassarre, M. T., Caivano, D., Fernandez Nieto, B., Gigante, D., & Ragone, A. (2023, September). The social impact of generative ai: An analysis on chatgpt. In Proceedings of the 2023 ACM Conference on Information Technology for Social Good (pp. 363-373).
  12. Padhi, I., Dognin, P., Rios, J., Luss, R., Achintalwar, S., Riemer, M., ... & Bouneffouf, D. (2024, August). Comvas: Contextual moral values alignment system. In Proc. Int. Joint Conf. Artif. Intell (pp. 8759-8762).
  13. Mbiazi, D., Bhange, M., Babaei, M., Sheth, I., & Kenfack, P. J. (2023). Survey on AI Ethics: A Socio-technical Perspective. arXiv preprint arXiv:2311.17228.
  14. Díaz-Rodríguez, N., Del Ser, J., Coeckelbergh, M., de Prado, M. L., Herrera-Viedma, E., & Herrera, F. (2023). Connecting the dots in trustworthy Artificial Intelligence: From AI principles, ethics, and key requirements to responsible AI systems and regulation. Information Fusion, 99, 101896.
  15. Shavit, Y., Agarwal, S., Brundage, M., Adler, S., O’Keefe, C., Campbell, R., ... & Robinson, D. G. (2023). Practices for governing agentic AI systems. Research Paper, OpenAI.
  16. Díaz-Rodríguez, N., Del Ser, J., Coeckelbergh, M., de Prado, M. L., Herrera-Viedma, E., & Herrera, F. (2023). Connecting the dots in trustworthy Artificial Intelligence: From AI principles, ethics, and key requirements to responsible AI systems and regulation. Information Fusion, 99, 101896.
  17. Schiff, D., Ayesh, A., Musikanski, L., & Havens, J. C. (2020, October). IEEE 7010: A new standard for assessing the well-being implications of artificial intelligence. In 2020 IEEE international conference on systems, man, and cybernetics (SMC) (pp. 2746-2753). IEEE.
  18. UNESCO. (2021). Recommendation on the Ethics of Artificial Intelligence. United Nations Educational, Scientific and Cultural Organization. https://www.unesco.org/en/artificial-intelligence/recommendation-ethics
  19. International Organization for Standardization & International Electrotechnical Commission. (2022). ISO/IEC 22989:2022 – Artificial intelligence — Artificial intelligence concepts and terminology. ISO. https://www.iso.org/standard/74296.html
  20. International Organization for Standardization & International Electrotechnical Commission. (2023). ISO/IEC 23894:2023 – Artificial intelligence — Guidance on risk management. ISO. https://www.iso.org/standard/77608.html
  21. Council of Europe. (2024). Framework Convention on Artificial Intelligence and Human Rights, Democracy and the Rule of Law. Strasbourg, France.
  22. Whittlestone, J., Nyrup, R., Alexandrova, A., & Cave, S. (2021). The CapAI framework: Assessing the capability of AI systems. Centre for the Governance of AI.