G. 2076  
Page 1  
Global Research journal of Natural Science  
& Technology (GRJNST)  
Volume: 04 - Issue 2 (2026), 2076  
ISSN P: 2790-7643 ISSN E: 2790-7651  
AuthStateBench: A Standards-Aligned Benchmark for Stateful Authorization  
and Authentication Workflows  
Received: 01 April 2026. Accepted: 23 April 2026. Published: 29 April 2026  
Muhammad Shahzad Khadim (Corresponding Author)  
Kohat University of Science and Technology, Kohat, Pakistan  
Syed Mufassir Shah  
Kohat University of Science and Technology, Kohat, Pakistan  
Zubair Khan  
International Islamic University Islamabad, Pakistan  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
Copyright © 2026 GRJNST. This article is published under an Open Access model. It is made available to the public under the terms of the Creative  
Commons Attribution 4.0 International (CC BY 4.0) license, which permits unrestricted use and distribution  
G. 2076  
Page 2  
Abstract:  
Authentication and authorization weaknesses in modern web applications rarely  
arise as isolated request-level defects. They often depend on role changes,  
session lifecycle conditions, object ownership boundaries, API authorization  
rules, and business workflow ordering. Existing vulnerability benchmarks and  
scanner evaluations remain valuable, but they often represent weaknesses as  
code-level or input-output defects and therefore underrepresent semantic  
failures such as IDOR/BOLA, function-level authorization bypass, stale-session  
reuse, tenant-boundary violation, privilege transition errors, and workflow  
bypass. This article introduces AuthStateBench, a standards-aligned benchmark  
design for modeling stateful authorization and authentication workflow  
vulnerabilities in web applications and APIs. The study uses a structured  
literature-based and standards-mapping methodology that draws on access-  
control testing research, stateful web testing, web logic flaw analysis, scanner-  
evaluation studies, vulnerability benchmark literature, AI-assisted vulnerability-  
analysis work, and major security guidance including OWASP Top 10,  
OWASP API Security Top 10, OWASP ASVS, OWASP WSTG, NIST  
SSDF, MITRE CWE, CISA Secure by Design, OAuth 2.0 security guidance,  
OpenID  
Connect,  
and  
software-assurance  
benchmark  
resources.  
AuthStateBench contributes a four-dimensional state model built around role  
state, session state, object-ownership state, and workflow state; a scenario  
taxonomy; a benchmark scenario template; standards-mapping logic; formal  
scenario and coverage equations; and comparison criteria for manual, scanner-  
assisted, AI-assisted, and standards-based assessment. The article does not claim  
empirical detection accuracy, tool execution, live-system testing, or dataset  
results. Instead, it provides a reproducible design artifact and validation  
roadmap for future controlled implementation and comparative evaluation.  
Keywords: Web application security; broken access control; authentication  
workflow; authorization testing; benchmark design; stateful security testing  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 3  
1. Introduction  
1.1 Background  
Modern web applications rarely fail through a single isolated defect. Their security  
posture emerges from the interaction of identity providers, session stores, API gateways,  
access-control middleware, object-level permission checks, business workflows, audit  
mechanisms, and recovery logic. A request that appears harmless in isolation may  
become dangerous when it is executed with the wrong role, a stale session, an unowned  
object identifier, or a skipped workflow step. This stateful character is especially visible  
in authorization and authentication flaws. A user may be authenticated but still  
unauthorized to perform a function; an expired token may be accepted after logout; an  
object identifier may expose another user's record; or a workflow endpoint may allow an  
operation before the required approval state has been reached.  
The current application-security landscape confirms the importance of this problem.  
OWASP Top 10:2025 identifies Broken Access Control as the leading web application  
security risk, and OWASP API Security Top 10:2023 places Broken Object Level  
Authorization at the first position for API security risk [1]-[4]. OWASP ASVS 5.0.0  
and the OWASP Web Security Testing Guide provide deeper verification and testing  
guidance for authentication, session management, access control, API behavior, and  
business logic [5], [6]. NIST SSDF and CISA Secure by Design guidance frame these  
weaknesses as secure-development and product-responsibility concerns rather than  
isolated testing events [8], [19], [20].  
Despite the availability of these standards, there remains a benchmark-design problem.  
Existing vulnerability benchmarks such as OWASP Benchmark, NIST SARD, and  
Juliet provide valuable tool-evaluation support, but their structure is stronger for code-  
level and input-driven weakness classes than for multi-step authorization and  
authentication failures that depend on role, session, object, and workflow state [7], [9]-  
[11]. This gap matters because scanners, AI-assisted testing systems, manual testers, and  
secure-development teams need comparable scenario definitions before meaningful  
evaluation can occur.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 4  
1.2 Problem Statement  
The central problem addressed in this article is that stateful authorization and  
authentication workflow vulnerabilities are difficult to benchmark using isolated HTTP  
requests, generic scanner signatures, or code snippets detached from application state. A  
conventional vulnerability test case often asks whether a specific payload triggers a  
specific response. In contrast, a stateful authorization flaw may require two or more  
accounts, at least one protected object, a known workflow state, a token lifecycle  
condition, and an expected policy decision. Without documenting these conditions, two  
studies may appear to evaluate the same vulnerability category while actually testing  
different security properties.  
This creates three practical weaknesses in the research landscape. First, scanner  
evaluations can overrepresent input-driven flaws while underrepresenting semantic  
authorization failures. Second, AI-assisted testing studies may claim progress without  
showing whether role, session, object ownership, and workflow preconditions were  
modeled. Third, secure-development guidance may remain difficult to operationalize  
because standards requirements are not translated into benchmark scenario templates.  
The problem is therefore not a lack of standards or individual testing techniques; the  
problem is the absence of a structured benchmark design that connects standards, state  
dimensions, scenario categories, and evaluation criteria.  
1.3 Research Gap  
Prior work has examined web application security, automated vulnerability detection,  
access-control analysis, business logic flaws, scanner performance, and software-  
vulnerability benchmarks [26]-[40]. However, the literature remains fragmented when  
the target weakness depends on a combination of authenticated identity, authorization  
policy, object ownership, token lifecycle, request ordering, and business-state transition.  
A stronger benchmark design is needed to make such weaknesses reproducible and  
comparable across testing approaches.  
The precise research gap is as follows: existing web security research lacks a standards-  
aligned benchmark design that systematically models stateful authorization and  
authentication workflow vulnerabilities using role, session, object-ownership, and  
workflow-state dimensions. This gap prevents consistent comparison of manual testing,  
scanner-assisted testing, AI-assisted testing, and standards-based review for flaws such as  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 5  
IDOR/BOLA, privilege escalation, session misuse, role confusion, and workflow  
bypass.  
1.4 Aim and Objectives  
The aim of this article is to design a standards-aligned benchmark model for classifying,  
structuring, and evaluating stateful authorization and authentication workflow  
vulnerabilities in modern web applications.  
The objectives are: (1) to review literature on broken access control, authentication  
workflow flaws, stateful web security testing, benchmark-based evaluation, scanner  
limitations, and AI-assisted vulnerability analysis; (2) to identify recurring patterns  
involving role misuse, object ownership, session state, workflow bypass, token lifecycle  
errors, and privilege transitions; (3) to map these patterns to OWASP Top 10:2025,  
OWASP API Security Top 10, OWASP ASVS, OWASP WSTG, NIST SSDF, CISA  
Secure by Design guidance, and MITRE CWE; (4) to develop a benchmark scenario  
taxonomy for stateful authorization and authentication workflow weaknesses; (5) to  
define evaluation criteria for manual testing, scanner-assisted testing, AI-assisted testing,  
and standards-based review; and (6) to propose a benchmark documentation template  
and future validation roadmap.  
1.5 Research Questions  
RQ1. What stateful authorization and authentication workflow vulnerabilities are most  
frequently discussed in recent web application security literature?  
RQ2. How can role, session, object ownership, and workflow state be used to classify  
authentication and access-control weaknesses?  
RQ3. How can OWASP Top 10:2025, OWASP API Security Top 10, OWASP  
ASVS, OWASP WSTG, NIST SSDF, CISA Secure by Design guidance, and MITRE  
CWE be mapped to benchmark scenarios for stateful web security testing?  
RQ4. What benchmark scenario categories are needed to represent IDOR/BOLA,  
privilege escalation, session misuse, workflow bypass, role confusion, and token lifecycle  
weaknesses?  
RQ5. What evaluation criteria can compare manual, scanner-assisted, AI-assisted, and  
standards-based testing approaches without reporting unsupported empirical results?  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 6  
RQ6. What limitations exist in current web security benchmarks for evaluating access-  
control and authentication workflow flaws?  
RQ7. What future validation pathway is required to make AuthStateBench suitable for  
empirical cybersecurity research?  
1.6 Scope  
The scope of AuthStateBench is benchmark design, scenario structuring, standards  
mapping, evaluation logic, and future validation planning. The article focuses on web  
applications and web APIs where authorization and authentication outcomes depend on  
state. The benchmark design is intentionally abstract: it does not require unauthorized  
testing of real systems, private datasets, exploit demonstrations, scanner output,  
screenshots, or fabricated tool results. It is suitable for later implementation in  
controlled vulnerable applications, teaching laboratories, controlled research  
environments, or expert-review exercises.  
The scope excludes malware analysis, network intrusion detection, blockchain security,  
IoT firmware testing, and generic AI-in-cybersecurity discussions unless they directly  
inform benchmark-design principles. The article also excludes claims of detection  
accuracy because no tool execution or empirical implementation is reported.  
1.7 Contributions  
This article makes five concrete contributions. First, it proposes AuthStateBench, a  
benchmark-design artifact for stateful authorization and authentication workflow  
weaknesses in web applications and APIs. Second, it defines a four-dimensional state  
model that captures role state, session state, object-ownership state, and workflow state.  
Third, it introduces a scenario taxonomy and editable documentation template for  
repeatable benchmark construction. Fourth, it adds a standards-mapping layer that  
connects scenario classes to OWASP, ASVS, WSTG, NIST SSDF, CISA Secure by  
Design, MITRE CWE, OAuth, and OpenID guidance. Fifth, it defines evidence-based  
comparison criteria for manual testing, scanner-assisted testing, AI-assisted testing, and  
standards-based review without inventing unsupported empirical results.  
The contribution differs from a generic OWASP survey because it does not merely  
summarize risk categories. It translates recurring access-control and authentication  
failure patterns into reusable scenario classes. It also differs from a scanner-evaluation  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 7  
paper because it does not claim performance measurements. Instead, it prepares a  
benchmark structure that can later support such measurements transparently.  
1.8 Structure of the Paper  
Section 2 reviews key concepts, recent studies, standards, and limitations of existing  
work. Section 3 presents AuthStateBench and its conceptual components. Section 4  
explains the structured literature-based and standards-mapping methodology. Section 5  
presents analytical findings, benchmark outputs, standards alignment, and comparison  
with existing approaches. Section 6 discusses implications, limitations, and future  
research. Section 7 concludes the article.  
Fig. 1 summarizes the article's logic in a roadmap style: evidence is gathered, gaps are  
synthesized, the state model is defined, benchmark artifacts are produced, and validation  
is reserved for controlled future work.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 8  
Fig. 1. Research roadmap and logical framework for AuthStateBench.  
2. Literature Review  
2.1 Key Concepts  
Authentication verifies the identity of an entity, while authorization determines whether  
that entity may access a resource or perform a function. Session management maintains  
continuity between authenticated interactions, often through cookies, bearer tokens,  
refresh tokens, or server-side session identifiers. Object-level authorization checks  
whether a user may access a specific object instance, not merely whether the user belongs  
to a broad role. Workflow authorization checks whether an action is permitted at a  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 9  
particular stage of a business process. These concepts are separated analytically but often  
fail together in real applications.  
A benchmark for stateful authorization and authentication weaknesses must therefore  
model more than a vulnerability label. For example, IDOR/BOLA is not simply  
“changing an ID.” It is a failure to enforce ownership or tenant isolation when a user-  
controlled identifier points to a protected object [3], [4], [16]-[18]. Similarly, session  
misuse may include insufficient session expiration, reuse of tokens after logout, fixation,  
privilege transition failures, or incomplete invalidation after account changes [15], [21]-  
[23]. Workflow bypass refers to reaching an endpoint or state transition without  
satisfying preconditions such as payment, approval, verification, or reauthentication.  
2.2 Review of Recent Studies  
Research on access-control vulnerability detection has developed through static analysis,  
black-box testing, role-differential analysis, model inference, and workflow-based  
reasoning. Sun et al. proposed static detection of access-control vulnerabilities by  
inferring role-based access assumptions from code [29]. Li and Xue introduced  
BLOCK, a black-box approach for detecting state violation attacks by observing normal  
behavior and identifying invariant violations [30]. Felmetsger et al. highlighted that logic  
vulnerabilities receive less attention than classic input-validation flaws, even though they  
can cause serious security failures [31]. Pellegrino and Balzarotti examined black-box  
detection of logic flaws using behavioral patterns extracted from interactions [32].  
More recent work continues to show that authorization flaws are difficult to evaluate  
without stateful context. Rennhard et al. presented an approach to automatically detect  
HTTP GET request-based access-control vulnerabilities [26]. Zhong et al. surveyed  
prevention and detection of access-control vulnerabilities in web applications and  
emphasized roles, permissions, resources, and business logic [27]. BACScan addressed  
black-box detection of broken access-control vulnerabilities and reinforced the need to  
consider multiple users and permissions [28]. SWaTEval proposed an evaluation  
framework for stateful web application testing, and ProFuzzBench showed the value of  
explicit benchmarks for stateful protocol fuzzing even outside web authorization [35],  
[36].  
Benchmarking literature also motivates AuthStateBench. OWASP Benchmark and NIST  
SARD support evaluation of vulnerability detection tools, while Juliet and newer SARD  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 10  
resources provide curated software-assurance test cases [7], [9]-[11]. However,  
benchmark critiques have noted that benchmark structure can shape tool behavior and  
may not always represent real semantic vulnerabilities [38]-[40]. AI-assisted vulnerability  
detection further increases the need for clearly defined evaluation tasks because LLM-  
based systems may appear effective on code-level benchmarks while struggling with  
multi-step security specifications, contextual authorization rules, and workflow  
semantics [41]-[46].  
2.3 Existing Standards, Frameworks, and Models  
OWASP Top 10:2025 and OWASP API Security Top 10:2023 provide risk  
taxonomies that place broken access control, object-level authorization, broken  
authentication, and function-level authorization among major application-security  
concerns [1]-[4]. OWASP ASVS 5.0.0 provides a verification standard for web  
application technical controls and is particularly relevant because it includes  
requirements for authentication, session management, access control, API behavior, error  
handling, logging, and business logic [5]. OWASP WSTG complements ASVS by  
describing testing activities and reporting expectations [6].  
NIST SSDF describes secure software development practices for mitigating software  
vulnerability risk, while CISA Secure by Design guidance frames secure defaults, product  
accountability, and reduction of customer security burden as core software manufacturer  
responsibilities [8], [19], [20]. MITRE CWE provides weakness families that can map  
benchmark scenarios to well-known categories such as improper access control,  
improper authentication, missing authentication for critical function, insufficient session  
expiration, authorization bypass through user-controlled key, missing authorization, and  
incorrect authorization [12]-[18]. OAuth 2.0 security best current practice and OpenID  
Connect specifications help ground authentication and token lifecycle scenarios in real  
identity protocols [21]-[23].  
2.4 Limitations of Existing Work  
Existing work has four main limitations for this article's purpose. First, many  
benchmarks emphasize source-code-level flaws or input-driven vulnerabilities, which are  
important but do not fully capture stateful authorization semantics. Second, scanner-  
comparison studies often evaluate whether tools detect known classes without  
documenting the role, session, object, and workflow state required to reproduce  
authorization failures. Third, access-control studies vary in their threat models and  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 11  
assumptions, making comparison difficult across manual, automated, and AI-assisted  
approaches. Fourth, standards provide requirements and guidance but do not always  
translate them into benchmark-ready scenario templates.  
These limitations do not reduce the value of existing standards or benchmarks. Rather,  
they identify a missing layer between standards and tool evaluation: a scenario-design  
model that specifies preconditions, actors, state, expected secure behavior, insecure  
behavior, evidence requirements, and mapping to standards. AuthStateBench is intended  
to provide that layer.  
2.5 Summary of Research Gap  
The literature shows strong interest in web application security, access-control analysis,  
scanner evaluation, software-assurance datasets, and AI-assisted vulnerability detection.  
It also shows a persistent gap: stateful authorization and authentication failures require  
benchmark scenarios that encode role, session, object ownership, and workflow  
conditions. Without such encoding, researchers and practitioners risk comparing tools  
and methods on unclear or incomplete assumptions. AuthStateBench responds to this  
gap by proposing a structured benchmark design rather than claiming empirical results.  
Table 1. Literature search strategy and source categories.  
Source category  
Examples  
Purpose in the review  
Academic literature  
IEEE Xplore, ACM Digital Identify peer-reviewed  
Library, SpringerLink,  
ScienceDirect, Wiley,  
Taylor & Francis, Google  
Scholar discovery  
studies on access-control  
testing, stateful web testing,  
logic flaws, benchmarks,  
scanners, and AI-assisted  
vulnerability analysis.  
Security standards and  
guidance  
OWASP Top 10:2025,  
OWASP API Security Top in recognized application-  
Ground scenario categories  
10, OWASP ASVS 5.0.0,  
OWASP WSTG, NIST  
SSDF, CISA Secure by  
Design  
security and secure-  
development expectations.  
Weakness taxonomies  
MITRE CWE families for Map benchmark scenarios  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 12  
access control,  
to common weakness  
authentication,  
identifiers and improve  
traceability.  
authorization, session  
expiration, and user-  
controlled keys  
Benchmark resources  
OWASP Benchmark,  
Compare existing  
NIST SARD, Juliet, SARD benchmark assumptions  
documentation,  
with the proposed stateful  
scenario design.  
Support authentication,  
token lifecycle, session, and  
risk-evidence interpretation.  
ProFuzzBench, SWaTEval  
OAuth 2.0 Security BCP,  
OpenID Connect, CVSS,  
EPSS  
Identity and risk  
specifications  
Table 2. Inclusion and exclusion criteria.  
Criterion type  
Included  
Excluded  
Topic relevance  
Web application security,  
API security,  
Generic cybersecurity with  
no web application  
authentication,  
relevance; malware-only,  
blockchain-only, IoT-only,  
or network-only studies.  
authorization, session  
management, workflow  
abuse, benchmark design,  
scanner evaluation, AI-  
assisted testing  
Method relevance  
Source quality  
Studies proposing,  
evaluating, surveying, or  
systematizing testing  
methods, standards,  
benchmarks, scanners,  
taxonomies, or frameworks  
Peer-reviewed papers,  
official standards,  
Papers that mention tools  
without explaining  
vulnerability modeling or  
test design.  
Unverifiable blogs,  
marketing pages,  
recognized cybersecurity  
guidance, and well-  
established benchmark  
unsupported claims, and  
sources without sufficient  
technical relevance.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 13  
resources  
Primarily 2020-2026, with Older material without  
Time period  
older foundational work  
retained where it shaped  
access-control or  
continuing relevance or  
citation value.  
benchmark research  
Sources compatible with a  
literature-based design  
article without fabricated  
results  
Integrity boundary  
Studies requiring private  
datasets, unauthorized  
testing, or non-reproducible  
company-only evidence.  
Table 3. Literature comparison on stateful web security testing.  
Research stream  
Representative  
sources  
Strength  
Limitation  
addressed by  
AuthStateBench  
Static access-control Sun et al. [29]  
analysis  
Can infer access-  
control assumptions source code is  
Not suitable when  
from source code  
and detect role-  
related weaknesses.  
unavailable;  
benchmark scenarios  
still need stateful  
documentation.  
Different studies use  
different  
Black-box state or  
logic testing  
BLOCK [30],  
Pellegrino and  
Balzarotti [32], Li  
et al. [34]  
Models behavior  
from interactions  
and can address  
logic or state  
assumptions;  
AuthStateBench  
standardizes role-  
session-object-  
workflow  
violations.  
dimensions.  
Parameter  
NoTamper [33],  
Rennhard et al.  
[26], BACScan [28] parameters, and  
access-control  
Targets object  
identifiers, request  
Object ownership  
and victim/attacker  
role conditions need  
explicit benchmark  
tampering and  
IDOR/BOLA  
analysis  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 14  
violations.  
representation.  
General  
vulnerability  
benchmarks  
OWASP  
Useful for  
Primarily stronger  
for code-level or  
input-driven  
weaknesses than  
multi-user workflow  
semantics.  
Benchmark [7],  
SARD and Juliet  
[9]-[11]  
repeatable tool  
evaluation and  
known test cases.  
AI-assisted  
vulnerability  
detection  
Xu et al. [41], Far  
et al. [42], CVE-  
Bench [44],  
Highlights growing Requires clearer task  
role of AI in  
security testing and  
exploitation  
specifications to  
evaluate semantic  
authorization and  
authentication  
Cybench [45]  
reasoning.  
workflow behavior.  
3. Proposed Framework / Benchmark / Model  
3.1 Conceptual Basis  
AuthStateBench is built on the premise that authorization and authentication  
vulnerabilities are not adequately represented by a single request or payload. They must  
be represented as stateful security-policy failures. The benchmark therefore uses four  
state dimensions: role state, session state, object-ownership state, and workflow state.  
Role state describes the identity and privilege level of the actor. Session state describes  
token validity, freshness, login/logout condition, reauthentication status, and privilege-  
transition effects. Object-ownership state describes whether the target resource is owned  
by, shared with, hidden from, or unrelated to the actor. Workflow state describes  
whether the requested action occurs in the expected business sequence.  
The benchmark design treats a vulnerability scenario as a controlled policy test. Each  
scenario begins with a documented precondition, an attacker role, an optional victim or  
target role, a protected object, a session condition, a workflow condition, an action,  
expected secure behavior, observed insecure behavior in a vulnerable implementation,  
and a standards/CWE mapping. This approach allows future implementation without  
relying on private systems or unauthorized exploitation.  
To make the design explicit, each benchmark scenario is represented as a stateful policy-  
test tuple rather than a single vulnerable request:  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 15  
B_s = <R_s, S_s, O_s, W_s, A_s, P_s, E_s>  
(1)  
where R_s is role state, S_s is session state, O_s is object-ownership state, W_s is  
workflow state, A_s is the attempted action, P_s is the expected policy decision, and E_s  
is the required evidence record.  
Scenario coverage can later be computed as:  
Coverage = |C_tested C_required| / |C_required|  
(2)  
A future method-comparison study may score evidence quality using a weighted  
criterion model:  
Score_m = Σ(k=1..n) w_k x_m,k  
(3)  
where x_m,k denotes whether method m satisfies criterion k, and w_k allows future  
researchers to prioritize scenario recognition, precondition handling, evidence quality,  
false-positive control, and reproducibility.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 16  
Fig. 2. Four-dimensional state model used to construct AuthStateBench scenarios.  
3.2 Main Components  
AuthStateBench contains five components. The first component is the scenario  
taxonomy, which groups benchmark cases into recurring classes such as object-level  
authorization failure, function-level authorization failure, privilege escalation, session  
lifecycle failure, workflow bypass, role-confusion failure, tenant-isolation failure, and  
reauthentication failure. The second component is the state matrix, which combines role,  
session, object, and workflow conditions. The third component is the standards-  
mapping layer, which connects scenarios to OWASP, NIST, CISA, MITRE, OAuth,  
and OpenID guidance. The fourth component is the scenario documentation template.  
The fifth component is the evaluation criteria set, which supports future comparison of  
manual testing, scanner-assisted testing, AI-assisted testing, and standards-based review.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 17  
These components are designed to be modular. A researcher can use the taxonomy to  
classify scenarios, the template to document cases, the standards mapping to justify  
relevance, and the evaluation criteria to compare methods. A practitioner can use the  
same structure for training, secure-code review, and controlled laboratory exercises.  
Fig. 3 operationalizes the benchmark construction sequence. The process starts from a  
security policy, converts it into allowed and denied state pairs, records evidence, maps  
the scenario to standards, and then allows a future evaluator to compare testing methods.  
Fig. 3. Benchmark scenario construction and evaluation pipeline.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 18  
3.3 Standards or Literature Mapping  
The standards-mapping layer is essential because it prevents the benchmark from  
becoming an arbitrary list of invented scenarios. Broken access control maps to OWASP  
A01:2025 and to API-level risks such as BOLA and BFLA [2]-[4]. Authentication and  
session-management scenarios map to ASVS requirements, OAuth 2.0 security guidance,  
OpenID Connect, and CWE categories for improper authentication, missing  
authentication, insufficient session expiration, and token misuse [5], [13]-[15], [21]-  
[23]. Secure-development and vulnerability-management relevance maps to NIST SSDF,  
CISA Secure by Design, CVSS, EPSS, and benchmark literature [8], [19], [20], [24],  
[25], [37]-[40].  
3.4 Evaluation Logic  
AuthStateBench does not present accuracy, precision, recall, F1-score, exploit success,  
scanner results, or AI-agent performance. Instead, it defines how future studies can  
evaluate such outcomes responsibly. A future evaluation can compare whether a method  
identifies the correct scenario class, recognizes the required preconditions, distinguishes  
authentication from authorization, determines object ownership, checks workflow order,  
explains evidence, and maps findings to standards. This design avoids false empirical  
claims while still providing a concrete foundation for empirical work.  
3.5 Justification  
Stateful authorization and authentication weaknesses require scenario-based benchmark  
design because isolated request testing is insufficient. For example, a GET request for  
/invoice/124 may be secure for the owner and insecure for a different user. A POST  
request that approves a transaction may be correct after review and insecure before  
review. A session token may be valid before logout and insecure if accepted afterward.  
These cases cannot be benchmarked by request shape alone; they require documented  
state. AuthStateBench makes this state explicit and therefore improves reproducibility,  
comparability, and standards alignment.  
Table 4. AuthStateBench scenario taxonomy.  
Scenario class  
Core failure  
Typical state  
dimensions  
Example secure  
expectation  
Object-level  
authorization failure object that belongs  
Actor accesses an  
Role state + object The server checks  
ownership + session ownership or tenant  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 19  
to another user,  
tenant, or role.  
Actor reaches a  
validity  
boundary for every  
object access.  
The server enforces  
function  
permissions  
Function-level  
Role state +  
workflow state  
authorization failure function outside  
permitted role or  
permission scope.  
independent of  
hidden UI controls.  
Privilege escalation  
Actor gains elevated Role state + session Privilege transitions  
capability by  
state + workflow  
state  
require server-side  
authorization and  
reauthentication  
where appropriate.  
manipulating role,  
token, endpoint, or  
transition.  
Session lifecycle  
failure  
Expired, logged-out, Session state + role Tokens are  
fixed, or stale  
tokens remain  
usable.  
state  
invalidated and  
refreshed according  
to security  
requirements.  
Workflow bypass  
Actor skips or  
reorders required  
process steps.  
Workflow state +  
role state + object  
ownership  
Business operations  
require all  
preconditions and  
state transitions.  
Role-confusion  
failure  
Application  
confuses guest, user, state  
privileged user,  
admin, or  
downgraded role.  
Role state + session Server-side policy  
resolves role  
correctly after login,  
logout, downgrade,  
or account changes.  
Tenant-isolation  
failure  
Actor crosses  
organization,  
workspace, or  
tenant boundary.  
Sensitive action  
proceeds without  
Object ownership + Tenant boundary is  
role state  
enforced for every  
resource and  
function.  
Reauthentication  
failure  
Session state +  
workflow state  
High-risk  
operations require  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 20  
fresh authentication  
or step-up  
fresh identity  
assurance or  
verification.  
equivalent control.  
Table 5. Role-session-object-workflow state matrix.  
Dimension  
Representative  
states  
Security question  
Failure indicator  
Role state  
Guest, registered  
Is the actor allowed Function succeeds  
user, privileged user, to perform this  
admin, downgraded function?  
user, service account  
for a role outside  
intended permission  
scope.  
Session state  
Valid, expired,  
Is the session state  
Action succeeds  
with stale, invalid,  
fixed, or  
insufficiently fresh  
session context.  
reused, fixed, logged acceptable for the  
out, token changed, requested action?  
privilege changed,  
reauthenticated  
Object-ownership  
state  
Owned object,  
unowned object,  
shared object,  
hidden object,  
tenant-specific  
object  
Does the actor have Object data or  
rights over this  
specific object  
instance?  
action succeeds  
across ownership or  
tenant boundary.  
Workflow state  
Normal sequence,  
skipped step,  
Has the process  
reached the required action before  
Endpoint allows  
repeated step, forced business state?  
endpoint, post-  
approval state,  
required  
preconditions or  
after invalid  
transition.  
rollback state  
Table 6. OWASP Top 10 / ASVS / NIST SSDF / CWE mapping table.  
Benchmark class OWASP  
mapping  
ASVS /  
WSTG  
mapping  
NIST / CISA  
mapping  
CWE  
mapping  
Object-level  
OWASP  
Access control  
SSDF  
CWE-284,  
authorization  
A01:2025;  
and API testing verification and CWE-862,  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 21  
failure  
API1:2023  
BOLA  
requirements  
vulnerability  
response;  
CWE-863,  
CWE-639  
Secure by  
Design default  
protection  
Secure design  
review and  
threat  
Function-level  
authorization  
failure  
OWASP  
A01:2025;  
API5:2023  
BFLA  
Server-side  
authorization  
verification;  
business logic  
testing  
CWE-862,  
CWE-863  
modeling  
Authentication  
workflow failure authentication- and identity  
OWASP  
Authentication SSDF secure  
CWE-287,  
CWE-306  
design and  
verification  
related risks;  
API2:2023  
verification  
requirements  
Broken  
Authentication  
Session lifecycle Broken access  
Session  
management  
Secure default  
session  
CWE-613,  
CWE-287  
failure  
control and  
authentication- verification;  
behavior  
adjacent risk  
logout and  
timeout testing  
Business logic  
and workflow  
testing  
Workflow  
bypass  
Business logic  
abuse; broken  
access control  
Threat  
modeling and  
secure  
CWE-840,  
CWE-863,  
CWE-862  
requirements  
Secure  
architecture  
and product  
safety  
Identity  
assurance and  
secure defaults  
Tenant-isolation Broken access  
Access control  
and data  
isolation  
verification  
Fresh  
authentication  
for sensitive  
CWE-284,  
CWE-862,  
CWE-863  
failure  
control; API  
object-level  
access control  
Reauthentication Broken access  
CWE-287,  
CWE-306  
failure  
control;  
authentication  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 22  
control  
actions  
weakness  
Table 7. Benchmark scenario template.  
Field  
Description  
Scenario ID  
Unique identifier such as ASB-OBJ-001 or  
ASB-SES-003.  
Scenario class  
Taxonomy category, such as object-level  
authorization failure or session lifecycle  
failure.  
Attacker role  
The role from which the unauthorized  
action is attempted.  
Victim/target role  
Object ownership condition  
Session condition  
The role or account owning the target  
resource, if applicable.  
Owned, unowned, shared, hidden, cross-  
tenant, or system-owned object.  
Valid, expired, logged out, token refreshed,  
privilege changed, stale, fixed, or  
reauthenticated.  
Workflow precondition  
Normal sequence, skipped stage, repeated  
stage, forced endpoint, pre-approval, post-  
approval, or rollback.  
Attack action  
Abstract action attempted in the controlled  
benchmark scenario.  
Expected secure behavior  
Insecure behavior  
Policy decision that should occur in a  
secure implementation.  
Failure condition that marks the scenario  
vulnerable in an intentionally vulnerable  
implementation.  
Standards mapping  
OWASP, ASVS, WSTG, NIST, CISA,  
MITRE CWE, OAuth/OIDC mapping  
as applicable.  
Evidence requirement  
What future testers must record: request  
sequence, session state, role pair, object ID  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 23  
relation, response, and policy rationale.  
4. Methodology  
4.1 Research Design  
The research design combined structured literature review, standards mapping,  
conceptual synthesis, and benchmark design. The study was not conducted as an  
empirical tool-evaluation experiment. No scanner was executed, no vulnerable laboratory  
was deployed, no live target was tested, and no private dataset was analyzed. The method  
instead used literature and standards to derive a benchmark design that can later support  
implementation and evaluation.  
This design is appropriate because the contribution is a scenario-construction model. A  
benchmark-design article must first define what counts as a scenario, what security  
property is being tested, what state must be recorded, and how relevance is mapped to  
standards before performance claims can be evaluated.  
4.2 Search Strategy / Data Source Strategy  
The search strategy used combinations of terms such as “broken access control web  
application testing,” “authorization vulnerability benchmark,” “authentication workflow  
vulnerability,” “stateful web application security testing,” “IDOR BOLA benchmark,”  
“role-based access control web vulnerability,” “workflow bypass web security,” “session  
management vulnerability testing,” “OWASP ASVS access control requirements,” “web  
vulnerability benchmark evaluation,” and “AI-assisted vulnerability detection  
benchmark.” Searches prioritized peer-reviewed databases and official standards sources.  
Google Scholar was used for discovery, while preference was given to publisher pages,  
official project pages, government guidance, RFCs, and standards pages where available.  
The source base included academic literature on access-control analysis, stateful web  
testing, web logic flaws, scanner evaluation, benchmarks, and AI-assisted vulnerability  
detection [26]-[46]. It also included standards and guidance from OWASP, NIST,  
MITRE, CISA, FIRST, OAuth, OpenID, and ISO/IEC [1]-[25], [47]-[49].  
4.3 Inclusion and Exclusion Criteria  
Sources were included when they focused on web application security, API security,  
authentication, authorization, session management, workflow abuse, benchmark design,  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 24  
vulnerability detection, scanner evaluation, or secure-development guidance.  
Foundational older sources were retained when they introduced important concepts or  
methods for access-control vulnerability detection or web logic testing. Sources were  
excluded when they focused only on generic cybersecurity, malware, blockchain, IoT, or  
network intrusion without web application relevance, or when they made unsupported  
claims about automation replacing human security testing.  
4.4 Screening or Selection Process  
The screening process followed a transparent review approach rather than a fully  
quantified PRISMA systematic review. Because exact search counts, duplicate counts,  
and exclusion counts were not recorded in a formal review registry, this article does not  
claim a completed PRISMA study. Instead, sources were screened by title, abstract,  
technical relevance, standards relevance, and contribution to scenario modeling. Selected  
sources were then coded according to vulnerability type, testing approach, state  
dimension, benchmark relevance, and standards applicability.  
4.5 Coding and Synthesis Method  
Thematic synthesis grouped the literature into four state dimensions: role state, session  
state, object-ownership state, and workflow state. Role state captured guest, user,  
privileged user, admin, downgraded user, and service-account contexts. Session state  
captured valid, expired, reused, fixed, logged-out, token-changed, and reauthenticated  
contexts. Object state captured owned, unowned, shared, hidden, tenant-specific, and  
system-owned objects. Workflow state captured normal sequence, skipped step, repeated  
step, forced endpoint, pre-approval, post-approval, and rollback conditions.  
Fig. 4 illustrates how literature, standards, and weakness taxonomies were synthesized  
into the benchmark outputs. The figure also makes clear that the article is a design study  
rather than a tool-execution experiment.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 25  
Fig. 4. Literature and standards synthesis process used to derive AuthStateBench.  
Algorithm 1. AuthStateBench scenario construction procedure.  
Input: Candidate weakness pattern, relevant standard clauses, state dimensions, and  
expected security policy.  
1. Identify the protected action and target object.  
2. Define the legitimate role, unauthorized role, session condition, and workflow  
precondition.  
3. Specify the expected secure decision and vulnerable behavior to be represented in a  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 26  
controlled implementation.  
4. Map the scenario to OWASP, ASVS/WSTG, NIST/CISA, MITRE CWE, and  
identity-protocol guidance where applicable.  
5. Record evidence requirements: actor role, object relation, session state, request  
sequence, response, and policy rationale.  
Output: A benchmark-ready scenario record that can later be implemented and  
evaluated in a controlled testbed.  
After coding, recurring vulnerability patterns were converted into benchmark scenario  
classes. Each scenario class was checked against relevant standards and weakness  
taxonomies to verify that it corresponded to recognized security concerns rather than  
arbitrary examples.  
4.6 Comparison Criteria  
Existing approaches were compared using standards alignment, explicit state modeling,  
reproducibility, evaluation readiness, evidence requirements, ability to support manual  
testing, ability to support scanner-assisted testing, ability to support AI-assisted testing,  
and suitability for secure-development education. These criteria were selected because a  
benchmark design must be useful across research and practice, not only within a single  
tool category.  
4.7 Validity and Reliability  
Validity was supported through triangulation across peer-reviewed literature, official  
standards, weakness taxonomies, and benchmark resources. Reliability was supported by  
using explicit scenario fields and state dimensions rather than narrative-only  
descriptions. The main validity limitation is that the benchmark has not yet been  
implemented as runnable vulnerable applications or evaluated by independent experts. A  
recommended next step is expert review by application-security researchers, professional  
penetration testers, and secure software engineers.  
4.8 Ethical Considerations  
The study does not involve human subjects, real user data, live-system testing,  
unauthorized access, exploit deployment, screenshots, tool execution, or private  
vulnerability findings. Scenarios are described abstractly and are intended for controlled  
educational or research environments. Any future implementation should use  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 27  
intentionally vulnerable applications, local testbeds, explicit permission, and responsible  
disclosure principles where applicable.  
Table 8. Evaluation criteria for manual, scanner-assisted, and AI-assisted testing.  
Criterion  
Manual testing Scanner-  
AI-assisted  
testing  
Standards-  
based review  
Reviewer  
checks whether  
requirement  
coverage  
assisted testing  
Scenario  
recognition  
Tester  
identifies role,  
Tool must  
discover or be  
Model must  
infer or be  
session, object, configured with provided with  
and workflow  
conditions.  
relevant state.  
scenario state.  
matches  
scenario.  
Precondition  
handling  
Strong when  
testers control  
accounts and  
workflows.  
Often limited  
without  
authenticated  
crawling and  
multi-user  
support.  
Automated  
traces plus  
manual  
Variable;  
depends on  
prompts,  
Strong for  
policy  
completeness  
but not a  
context  
windows, and  
tool integration.  
Explanation  
must cite  
runtime proof.  
Evidence  
quality  
Request  
sequence,  
account roles,  
object relation, confirmation.  
and response  
Traceable  
checklist and  
requirement  
mapping.  
observed  
evidence, not  
only plausible  
reasoning.  
evidence.  
False-positive  
control  
Human  
judgment can  
validate  
business  
context.  
May flag  
request  
differences  
without  
understanding  
policy.  
May hallucinate Can miss  
policy unless  
constrained by  
evidence.  
runtime  
behavior if  
review is  
document-only.  
Reproducibility Requires  
documented  
steps and  
Requires stable Requires fixed  
Requires  
versioned  
standards and  
crawler state  
and  
prompts, logs,  
and scenario  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 28  
accounts.  
login/session  
handling.  
context.  
mapping  
rationale.  
Best use  
Deep semantic Broad coverage Assisted  
Requirements  
traceability and  
secure-  
testing and  
business-logic  
reasoning.  
and repeatable  
baseline  
scanning.  
reasoning, test  
planning, and  
evidence  
development  
summarization. governance.  
5. Results and Analytical Outputs  
5.1 Thematic Findings  
The synthesis produced four thematic findings. First, access-control failures are semantic  
vulnerabilities: whether a request is secure depends on who sends it, what object is  
targeted, what session state exists, and what workflow state applies. Second, automated  
scanners can provide valuable coverage but may struggle with multi-user authorization,  
object ownership, and business process semantics unless state and credentials are  
explicitly modeled. Third, standards provide strong control expectations, but additional  
benchmark documentation is needed to turn those expectations into reproducible test  
cases. Fourth, AI-assisted testing can support security reasoning, but it requires  
structured task definitions and evidence constraints to avoid plausible but unsupported  
conclusions.  
5.2 Gap Mapping  
The gap mapping shows that existing resources support parts of the problem but not the  
complete stateful benchmark requirement. OWASP and NIST standards define what  
secure behavior should look like. MITRE CWE defines weakness families. OWASP  
Benchmark and SARD provide tool-evaluation resources. Access-control research  
proposes detection approaches. AI-security research highlights emerging automated  
reasoning capabilities. AuthStateBench integrates these strands by representing each  
benchmark scenario as a stateful policy test with traceable standards mapping.  
5.3 Framework or Benchmark Outputs  
The primary benchmark outputs are the scenario taxonomy, the state matrix, the  
standards mapping, and the scenario template presented in Tables 4-7. These outputs  
allow a future benchmark implementation to include representative scenario families  
rather than isolated examples. For instance, an object-level authorization scenario can be  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 29  
instantiated with different combinations of roles, tenants, session states, and object-  
sharing rules. A workflow bypass scenario can be instantiated with skipped approval,  
repeated confirmation, forced endpoint access, or post-rollback actions. This modular  
design supports expansion while preserving consistent documentation.  
5.4 Standards Alignment  
AuthStateBench aligns with standards at three levels. At the risk-taxonomy level, it maps  
to OWASP Top 10:2025 and OWASP API Security Top 10:2023 [1]-[4]. At the  
verification level, it maps to OWASP ASVS 5.0.0 and WSTG testing concerns [5], [6].  
At the secure-development and weakness-classification level, it maps to NIST SSDF,  
CISA Secure by Design guidance, and MITRE CWE categories [8], [12]-[20]. Identity-  
protocol references such as OAuth 2.0 Security BCP and OpenID Connect support  
token, authentication, and session lifecycle scenarios [21]-[23].  
5.5 Comparison with Existing Approaches  
Compared with generic OWASP reviews, AuthStateBench adds scenario-level  
reproducibility. Compared with scanner evaluations, it documents state conditions that  
tools must handle or be given. Compared with code-level vulnerability datasets, it  
emphasizes policy semantics and multi-step workflows. Compared with AI-security  
benchmarks, it provides domain-specific scenario templates for authorization and  
authentication workflow reasoning. It therefore does not replace existing benchmarks or  
standards; it complements them by filling a stateful scenario-design gap.  
Table 9. Research gap matrix.  
Existing resource What it provides Gap for stateful  
AuthStateBench  
authorization/authentication response  
OWASP Top  
10 / API Top  
10  
Risk categories  
for web and API itself specify role, session,  
security.  
Risk category does not by  
Converts risk  
categories into  
scenario classes.  
object, and workflow  
preconditions.  
OWASP ASVS Verification and Requirements need  
Maps scenarios to  
requirements and  
evidence  
/ WSTG  
testing guidance. benchmark-ready scenario  
fields for comparison  
studies.  
expectations.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 30  
NIST SSDF /  
CISA Secure by  
Design  
Secure-  
Guidance is broad and not a Links benchmark  
development  
governance and  
product-security  
principles.  
vulnerability test-case  
benchmark.  
scenarios to  
secure-  
development and  
validation  
activities.  
MITRE CWE  
Weakness family CWE IDs alone do not  
Uses CWE as  
traceability, not as  
the entire scenario  
definition.  
taxonomy.  
define multi-user workflow  
conditions.  
OWASP  
Benchmark /  
SARD / Juliet  
Repeatable test  
cases for tool  
evaluation.  
Less focused on multi-step  
authorization and  
authentication workflow  
semantics.  
Adds stateful  
scenario design for  
auth/access-  
control cases.  
Defines structured  
tasks for future  
AI-assisted testing  
comparison.  
AI vulnerability  
benchmarks  
Evaluate AI or  
agentic security  
capabilities.  
May not isolate  
authorization workflow  
reasoning from exploit  
execution.  
Table 10. Future validation and research roadmap.  
Stage  
Purpose  
Recommended  
activity  
Expected output  
Stage 1: standards  
alignment review  
Check whether  
mappings are  
accurate and  
complete.  
Compare each  
scenario class with  
OWASP, ASVS,  
WSTG, NIST,  
CISA, MITRE,  
OAuth, and  
Validated mapping  
table and revised  
scenario definitions.  
OpenID sources.  
Invite 3-5  
application-security matrix and updated  
Stage 2: expert  
review  
Assess practical  
realism and clarity.  
Expert feedback  
academics,  
taxonomy.  
penetration testers,  
or secure software  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 31  
engineers.  
Stage 3: controlled  
implementation  
Turn abstract  
scenarios into local  
vulnerable  
Implement  
Runnable  
benchmark  
prototype and  
ground-truth labels.  
representative  
scenarios in a  
testbed with  
documented  
accounts and  
workflows.  
applications.  
Stage 4: method  
comparison  
Evaluate manual,  
scanner-assisted, AI- without live targets  
assisted, and  
standards-based  
approaches.  
Run controlled tests Comparative  
evaluation results  
and record evidence with transparent  
quality.  
limitations.  
Stage 5: public  
release  
Enable reuse and  
replication.  
Publish  
Versioned public  
benchmark artifact.  
documentation,  
templates, scenario  
definitions, and  
implementation  
notes.  
6. Discussion  
6.1 Key Insights  
The first key insight is that stateful authorization and authentication testing must be  
framed as policy-state verification rather than payload detection. A payload-centered  
view is useful for injection and many input-driven flaws, but it is insufficient for  
understanding whether an authenticated user should access a resource, whether a tenant  
boundary should apply, or whether a workflow state authorizes an operation. The  
second insight is that standards mapping improves benchmark legitimacy, but standards  
alone do not create benchmark reproducibility. The third insight is that manual,  
scanner-assisted, and AI-assisted approaches should not be treated as interchangeable.  
Each has different strengths and evidence requirements.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 32  
6.2 Cybersecurity Implications  
AuthStateBench has implications for application-security research and cyber defense. For  
researchers, it provides a way to define comparable tasks for access-control and  
authentication workflow testing. For tool builders, it clarifies what state information a  
scanner or AI-assisted system must handle. For defenders, it encourages evidence-based  
assessment of whether authorization and session controls enforce server-side policy  
across realistic business states. For educators, it offers a structured way to teach junior  
penetration testers why authentication, authorization, session management, and  
workflow logic must be tested together.  
6.3 Practical and Policy Implications  
In practice, the benchmark design can support secure development lifecycle activities.  
Requirements engineers can use the state matrix to document authorization rules.  
Developers can use scenario templates to write tests for object ownership and workflow  
preconditions. Penetration testers can use the taxonomy to structure manual testing  
evidence. Security managers can map findings to standards for reporting and remediation  
prioritization. Policy teams can use the structure to connect secure-by-design  
expectations with concrete verification evidence.  
6.4 Limitations  
The article has clear limitations. AuthStateBench is a design artifact, not a completed  
empirical benchmark implementation. No tool results are reported, and no claim is made  
about detection accuracy. The taxonomy may require refinement after expert review and  
controlled implementation. Some standards references may evolve, so mappings should  
be versioned. The current design focuses on web applications and APIs and may not  
directly apply to mobile, IoT, blockchain, or low-level protocol systems without  
adaptation.  
6.5 Future Research Roadmap  
Future work should proceed in five steps. First, an independent standards-alignment  
review should verify scenario mappings. Second, expert review should assess practical  
realism and coverage. Third, selected scenarios should be implemented in controlled  
vulnerable web applications with versioned documentation. Fourth, manual, scanner-  
assisted, AI-assisted, and standards-based methods should be compared using explicit  
evidence criteria. Fifth, the benchmark should be released as a public artifact with  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 33  
scenario definitions, implementation notes, ground-truth labels, and clear ethical-use  
guidance.  
Fig. 5 converts the future-work discussion into a validation roadmap. This separation  
between design and empirical validation is important because the current article does not  
claim scanner accuracy or AI-agent performance.  
Fig. 5. Future validation roadmap for turning AuthStateBench into an empirical  
benchmark.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 34  
7. Conclusion  
Stateful authorization and authentication workflow weaknesses remain difficult to  
benchmark because their security depends on role, session, object ownership, and  
workflow conditions. Existing standards and benchmarks provide valuable foundations,  
but they do not fully solve the problem of scenario-level reproducibility for semantic  
access-control and authentication failures. AuthStateBench addresses this gap by  
proposing a standards-aligned benchmark design that includes a four-dimensional state  
model, scenario taxonomy, standards mapping, scenario template, evaluation criteria, and  
validation roadmap.  
The article deliberately avoids unsupported empirical claims. It does not report scanner  
results, AI-agent performance, exploit success, or real-system testing. Its contribution is a  
design framework that future researchers and practitioners can implement and evaluate  
transparently. By making stateful conditions explicit, AuthStateBench can improve  
comparability in web security testing, support secure-development education, and  
provide a foundation for more rigorous evaluation of manual, scanner-assisted, AI-  
assisted, and standards-based approaches to authorization and authentication workflow  
security.  
Declarations  
Funding  
No funding was received for this work.  
Competing Interests  
The author declares no competing interests.  
Ethics Approval  
Not applicable. This article is a literature-based and standards-aligned benchmark-  
design study and does not involve human participants, personal data, live-system testing,  
or animal subjects.  
Consent for Publication  
Not applicable.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 35  
Data Availability  
No empirical dataset was generated or analyzed. The article is based on publicly  
available literature, standards, and guidance sources cited in the reference list. Future  
benchmark scenarios should be released as versioned documentation if implemented.  
Author Contributions  
Muhammad Shahzad Khadim conceptualized the study, designed the methodology,  
developed the benchmark model, and wrote the manuscript. Syed Mufassir Shah  
contributed to literature review, data organization, and manuscript editing. Zubair Khan  
assisted in validation design, formatting, and final proofreading.  
Acknowledgements  
The author acknowledges the public work of OWASP, NIST, MITRE, CISA, FIRST,  
IETF, OpenID Foundation, and the academic security research community whose  
standards and studies informed this conceptual benchmark design.  
Declaration of generative AI and AI-assisted technologies in the manuscript preparation  
process  
During the preparation of this work, the author used ChatGPT by OpenAI to assist  
with language refinement, formatting improvement, clarity enhancement, and manuscript  
organization. After using this tool, the author reviewed and edited the content as needed  
and takes full responsibility for the content of the submitted manuscript.  
References  
[1] OWASP Foundation. OWASP Top Ten Web Application Security Risks 2025.  
OWASP, 2025. Accessed 27 April 2026.  
[2] OWASP Foundation. A01:2025 - Broken Access Control. OWASP Top 10:2025.  
Accessed 27 April 2026.  
[3] OWASP Foundation. OWASP API Security Top 10 2023. OWASP, 2023.  
Accessed 27 April 2026.  
[4] OWASP Foundation. API1:2023 - Broken Object Level Authorization. OWASP  
API Security Top 10 2023. Accessed 27 April 2026.  
[5] OWASP Foundation. OWASP Application Security Verification Standard 5.0.0.  
OWASP, 2025. Accessed 27 April 2026.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 36  
[6] OWASP Foundation. OWASP Web Security Testing Guide, latest release.  
OWASP. Accessed 27 April 2026.  
[7] OWASP Foundation. OWASP Benchmark Project. OWASP. Accessed 27 April  
2026.  
[8] M. Souppaya, K. Scarfone, and D. Dodson. Secure Software Development  
Framework (SSDF) Version 1.1: Recommendations for Mitigating the Risk of Software  
Vulnerabilities. NIST SP 800-218, 2022.  
[9] NIST SAMATE. Software Assurance Reference Dataset (SARD). National  
Institute of Standards and Technology. Accessed 27 April 2026.  
[10] P. E. Black. The Software Assurance Reference Dataset. NIST Internal Report  
8561, 2025.  
[11] F. E. Boland Jr. and P. E. Black. The Juliet 1.1 C/C++ and Java Test Suite.  
National Institute of Standards and Technology, 2012.  
[12] MITRE. CWE-284: Improper Access Control. Common Weakness Enumeration.  
Accessed 27 April 2026.  
[13] MITRE. CWE-287: Improper Authentication. Common Weakness Enumeration.  
Accessed 27 April 2026.  
[14] MITRE. CWE-306: Missing Authentication for Critical Function. Common  
Weakness Enumeration. Accessed 27 April 2026.  
[15] MITRE. CWE-613: Insufficient Session Expiration. Common Weakness  
Enumeration. Accessed 27 April 2026.  
[16] MITRE. CWE-639: Authorization Bypass Through User-Controlled Key.  
Common Weakness Enumeration. Accessed 27 April 2026.  
[17] MITRE. CWE-862: Missing Authorization. Common Weakness Enumeration.  
Accessed 27 April 2026.  
[18] MITRE. CWE-863: Incorrect Authorization. Common Weakness Enumeration.  
Accessed 27 April 2026.  
[19] CISA. Secure by Design. Cybersecurity and Infrastructure Security Agency.  
Accessed 27 April 2026.  
[20] CISA and international partners. Shifting the Balance of Cybersecurity Risk:  
Principles and Approaches for Secure by Design Software, 2023.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 37  
[21] T. Lodderstedt, J. Bradley, A. Labunets, and D. Fett. Best Current Practice for  
OAuth  
2.0  
Security.  
RFC  
9700,  
BCP  
240,  
RFC  
Editor,  
2025.  
doi:10.17487/RFC9700.  
[22] N. Sakimura, J. Bradley, M. Jones, B. de Medeiros, and C. Mortimore. OpenID  
Connect Core 1.0 incorporating errata set 2. OpenID Foundation. Accessed 27 April  
2026.  
[23] OpenID Foundation. OpenID Connect Session Management 1.0. Final  
specification, 2022.  
[24] FIRST. Common Vulnerability Scoring System Version 4.0: Specification  
Document. Forum of Incident Response and Security Teams. Accessed 27 April 2026.  
[25] FIRST. Exploit Prediction Scoring System (EPSS). Forum of Incident Response  
and Security Teams. Accessed 27 April 2026.  
[26] M. Rennhard, et al. Automating the Detection of Access Control Vulnerabilities in  
Web Applications. SN Computer Science, 2022.  
[27] L. Zhong, et al. A Survey of Prevent and Detect Access Control Vulnerabilities in  
Web Applications. arXiv preprint arXiv:2304.10600, 2023.  
[28] F. Liu, et al. BACScan: Automatic Black-Box Detection of Broken Access-Control  
Vulnerabilities. ACM CCS, 2025.  
[29] F. Sun, L. Xu, and Z. Su. Static Detection of Access Control Vulnerabilities in  
Web Applications. USENIX Security Symposium, 2011.  
[30] X. Li and Y. Xue. BLOCK: A Black-box Approach for Detection of State  
Violation Attacks Towards Web Applications. ACSAC, 2011.  
[31] V. Felmetsger, L. Cavedon, C. Kruegel, and G. Vigna. Toward Automated  
Detection of Logic Vulnerabilities in Web Applications. USENIX Security Symposium,  
2010.  
[32] G. Pellegrino and D. Balzarotti. Toward Black-Box Detection of Logic Flaws in  
Web Applications. Network and Distributed System Security Symposium, 2014.  
[33] P. P. S. Bisht, T. Hinrichs, N. Skrupsky, R. Bobrowicz, and V. N.  
Venkatakrishnan. NoTamper: Automatic Blackbox Detection of Parameter Tampering  
Opportunities in Web Applications. ACM CCS, 2010.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 38  
[34] X. Li, Y. Xue, and M. Chu. Automated Black-Box Detection of Access Control  
Vulnerabilities in Web Applications. SACMAT, 2014.  
[35] A. Borcherding, et al. SWaTEval: An Evaluation Framework for Stateful Web  
Application Testing. International Conference on Web Information Systems and  
Technologies, 2023.  
[36] R. Natella, et al. ProFuzzBench: A Benchmark for Stateful Protocol Fuzzing. ACM  
ISSTA, 2021.  
[37] E. Fong, V. Okun, and R. Gaucher. Web Application Scanners: Definitions and  
Functions. NIST SAMATE, 2007.  
[38] P. Nunes, J. Fonseca, and M. Vieira. Benchmarking Static Analysis Tools for Web  
Security. IEEE Transactions on Reliability, 2018.  
[39] M. Miltenberger, et al. Benchmarking the Benchmarks. ACM, 2023.  
[40] N. Risse, et al. On Benchmarking in Machine Learning for Vulnerability  
Detection. ISSTA, 2025.  
[41] H. Xu, S. Wang, N. Li, K. Wang, Y. Zhao, K. Chen, T. Yu, Y. Liu, and H. Wang.  
Large Language Models for Cyber Security: A Systematic Literature Review.  
arXiv:2405.04760, 2024.  
[42] S. M. Taghavi Far, et al. Large Language Models for Software Vulnerability  
Detection. International Journal of Information Security, 2025.  
[43] Y. Chen, et al. A Survey of Large Language Models for Cyber Threat Detection.  
Computers & Security, 2024.  
[44] Y. Zhu, A. Kellermann, D. Bowman, P. Li, A. Gupta, A. Danda, R. Fang, C. Jensen,  
E. Ihli, J. Benn, et al. CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-  
World Web Application Vulnerabilities. ICML, 2025.  
[45] R. Fang, et al. Cybench: A Framework for Evaluating Cybersecurity Capabilities  
and Risks of Language Models. arXiv:2408.08926, 2024.  
[46] M. Malkawi and R. Alhajj. AI-Powered Vulnerability Detection and Patch  
Management in Cybersecurity: A Systematic Review of Techniques, Challenges, and  
Emerging Trends. Machine Learning and Knowledge Extraction, vol. 8, no. 1, Article  
19, 2026. doi:10.3390/make8010019.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076  
G. 2076  
Page 39  
[47] ISO/IEC. ISO/IEC 27034-1:2011, Information technology - Security techniques  
- Application security - Part 1: Overview and concepts. International Organization for  
Standardization, 2011.  
[48] ISO/IEC/IEEE. ISO/IEC/IEEE 29119-1:2022, Software and systems  
engineering - Software testing - Part 1: General concepts. International Organization for  
Standardization, 2022.  
[49] NIST. Digital Identity Guidelines: Authentication and Lifecycle Management, SP  
800-63B. National Institute of Standards and Technology, latest available revision  
accessed 27 April 2026.  
GRJNST, Volume: 04 - Issue 2 (2026) / ISSN P: 2790-7643  
Article ID: 2076