Technical Debt Roadmap¶
Created: March 13, 2026 Status: Planning Purpose: Organize, refactor, and solidify the existing Autograder codebase into a cohesive, decoupled, and evolutionary architecture before adding new features.
Context¶
The Autograder is a pipeline-based grading system where submissions flow through ordered steps: Load Template → Build Tree → Pre-Flight → Grade → Focus → Feedback → Export. Each step receives a PipelineExecution object, performs its operation, and passes results forward.
The system has grown organically with features like multi-language support, setup configs, focus-based feedback, and AI execution being added incrementally. This has introduced coupling, incomplete abstractions, and inconsistencies that need to be resolved before the architecture can evolve cleanly.
This roadmap does not propose new features. It focuses exclusively on making the existing code solid, well-separated, and aligned with the pipeline architecture that is already in place.
Roadmap Items¶
🔴 Priority 1 — Broken or Incomplete Implementations¶
These items represent code that is actively broken, logically inverted, or fundamentally incomplete. They must be fixed before any other refactoring work because they affect correctness.
Item 1: Fix Inverted Reporter Mode Assignment in ReporterService¶
- File:
autograder/services/report/reporter_service.py - Problem: The constructor assigns the wrong reporter for each mode. When
feedback_mode == "ai", it instantiatesDefaultReporter(). When the mode is anything else (including"default"), it instantiatesAiReporter(). This is a logic inversion — the conditional is backwards. - Impact: Any pipeline configured with
feedback_mode="ai"silently gets the default reporter (which is a no-oppass), and any pipeline configured withfeedback_mode="default"gets the AI reporter (which returns a hardcoded string). Neither path produces correct feedback. - Action:
- Swap the conditional:
"ai"→AiReporter(), else →DefaultReporter(). - Add a
generate_feedback()method toReporterServicethat delegates to the internal reporter, sinceFeedbackStepcallsself._reporter_service.generate_feedback()but no such method exists onReporterService— onlygenerate_report()exists on the reporters themselves.
Item 2: Implement the DefaultReporter (Currently a No-Op)¶
- File:
autograder/services/report/default_reporter.py - Problem:
DefaultReporter.generate_report()ispass— it returnsNone. This means the entire default feedback path produces no output. TheFeedbackPreferencesdataclass defines a rich configuration model (GeneralPreferences,DefaultReporterPreferenceswith category headers, score visibility, summary toggles) but none of it is consumed anywhere. - Impact: The
FeedbackStepin the pipeline always producesNonedata for default mode, which meansGradingResult.feedbackis alwaysNoneunless AI mode is used (and even that is broken per Item 1). - Action:
- Implement
DefaultReporter.generate_report()to produce a structured text/markdown report using theResultTree,Focusdata, andFeedbackPreferences. - The report should use
ResultTreeFormatter(which already exists and has methods likeformat_test_results,format_failed_test_results,format_category, etc.) as the rendering engine. - Consume the
DefaultReporterPreferences.category_headersfor section titles andGeneralPreferencesflags for controlling what appears in the report.
Item 3: Remove Stale test_library Field from CriteriaConfig¶
- File:
autograder/models/config/criteria.py - Problem:
CriteriaConfighas atest_library: Optional[str]field with a TODO comment saying "Remove this attribute (it already sits in grading config)." The template name is already passed separately tobuild_pipeline()and stored in the grading configuration at the web layer. This field is never read by any service —CriteriaTreeService.build_tree()receives the template as a separate argument. - Impact: The field creates confusion about where the template name is authoritative. It also means criteria JSON files may include a
test_librarykey that is silently ignored, misleading configuration authors. - Action:
- Remove the
test_libraryfield fromCriteriaConfig. - Audit all criteria JSON files (e.g.,
docs/criteria_example.json, example configs) and remove anytest_librarykeys. - Since the model uses
extra = "forbid", any existing JSON with this field will start failing validation — which is the correct behavior, as it forces cleanup.
Item 4: Fix FeedbackStep Contract Mismatch¶
- File:
autograder/steps/feedback_step.py - Problem: The step calls
self._reporter_service.generate_feedback(grading_result=focused_tests, feedback_config=self._feedback_config), butReporterServicehas nogenerate_feedbackmethod. The underlying reporters only havegenerate_report(self, results). Additionally, the step doesreturn pipeline_exec.add_step_result(feedback)assumingfeedbackis already aStepResult, butgenerate_reportreturns a string (orNone). The step would crash at runtime. - Impact: The feedback pipeline step cannot execute successfully in its current form.
- Action:
- Add a
generate_feedback(grading_result, feedback_config)method toReporterServicethat parses the feedback config intoFeedbackPreferences, passes the relevant data to the internal reporter, and returns the feedback string. - Fix the step to wrap the returned feedback string in a
StepResult(step=StepName.FEEDBACK, data=feedback_string, status=StepStatus.SUCCESS).
🟡 Priority 2 — Architectural Coupling & Separation of Concerns¶
These items address structural problems where components are tightly coupled, responsibilities are mixed, or abstractions are leaky. Fixing these makes the codebase easier to extend and test.
Item 5: Decouple Sandbox Lifecycle from PreFlightStep¶
- Files:
autograder/steps/pre_flight_step.py,autograder/services/pre_flight_service.py,autograder/autograder.py - Problem: The
PreFlightStepcurrently has three distinct responsibilities: (1) validate required files, (2) execute setup commands, and (3) create and manage the sandbox container. Sandbox creation is an infrastructure concern that is conceptually separate from submission validation. The step stores the sandbox as itsStepResult.data, which means downstream steps (GradeStep) must know to look inside the pre-flight result to find the sandbox. There's also a TODO comment in the step about when to destroy the sandbox on setup command failure. - Impact: If a future step needs a sandbox but doesn't need pre-flight checks (e.g., a hypothetical "run student tests" step), there's no way to get a sandbox without going through pre-flight. The sandbox lifecycle is also split: creation happens in
PreFlightService, but cleanup happens inAutograderPipeline._cleanup_sandbox(), which directly importssandbox_managerand reaches into the pre-flight step result. - Action:
- Extract sandbox acquisition into a dedicated concern. Two approaches:
- Option A (Minimal): Keep sandbox creation in
PreFlightServicebut makePipelineExecutionhold a first-classsandboxattribute (instead of hiding it in a step result'sdatafield). The pipeline's_cleanup_sandboxwould then read frompipeline_execution.sandboxdirectly. - Option B (Full separation): Create a
SandboxStepthat runs after pre-flight and before grading. It acquires the sandbox, prepares the workdir, and stores it onPipelineExecution. Pre-flight becomes purely about validation.
- Option A (Minimal): Keep sandbox creation in
- Regardless of approach, centralize the sandbox cleanup logic so it's not split between
AutograderPipeline._cleanup_sandbox()and the implicit "don't destroy on failure" behavior inPreFlightStep.
Item 6: Decouple GraderService from Mutable State¶
- File:
autograder/services/grader_service.py - Problem:
GraderServiceis instantiated once perGradeStepbut accumulates mutable state through setter methods:set_sandbox(),set_submission_language(), and internal__submission_files. This makes the service stateful and order-dependent — callers must remember to call setters beforegrade_from_tree(). It also means the service cannot be safely reused across concurrent submissions. - Impact: The setter pattern (
set_sandbox,set_submission_language) is a code smell that indicates these values should be parameters, not instance state. It also creates a hidden coupling:GradeStep.execute()must know the exact sequence of setter calls before invoking grading. - Action:
- Refactor
grade_from_tree()to acceptsandbox,submission_language, andsubmission_filesas explicit parameters instead of relying on pre-set instance state. - Remove the setter methods and the instance variables they populate.
GraderServicebecomes stateless and can be a singleton or module-level utility.
Item 7: Extract Language Resolution from Test Execution¶
- Files:
autograder/services/grader_service.py,autograder/template_library/input_output.py,autograder/services/command_resolver.py - Problem: The
GraderService.process_test()method injects__submission_language__as a hidden parameter into every test's kwargs dict. TheExpectOutputTestandDontFailTestininput_output.pythen extract this hidden parameter to resolve theprogram_commandviaCommandResolver. This creates an implicit contract: tests must know to look for a magic key in their kwargs, and the grader must know to inject it. - Impact: The
__submission_language__convention is undocumented, fragile, and invisible to anyone reading theTestFunction.execute()signature. It also means command resolution logic is scattered:CommandResolverlives inservices/, but it's invoked inside individual test functions rather than at the pipeline level. - Action:
- Resolve commands at the
GradeSteporBuildTreeSteplevel, before tests execute. When buildingTestNodeobjects, resolve anyprogram_commandparameter usingCommandResolverand the submission's language. By the time a test executes, its parameters should contain the final, resolved command string. - Remove the
__submission_language__injection fromGraderService.process_test(). - This makes
TestFunction.execute()a pure function of its declared parameters — no hidden state.
Item 8: Unify the Template Registration and Discovery System¶
- Files:
autograder/template_library/__init__.py,autograder/services/template_library_service.py - Problem: There are two parallel systems for template access:
TEMPLATE_REGISTRYdict in__init__.pywithget_template()andget_template_instance()module-level functions.TemplateLibraryServicesingleton that wraps the registry, caches instances, and adds metadata methods. TheTemplateLoaderStepusesTemplateLibraryService, but the service internally callsget_template_instance()from the module. The web layer'slifespan.pyalso initializesTemplateLibraryServiceat startup. This creates a confusing dual-path where templates can be accessed through either system.- Impact: Two entry points for the same data means two places to maintain, two places where bugs can hide, and confusion about which to use. The module-level functions also make testing harder (no way to inject mocks without patching globals).
- Action:
- Make
TemplateLibraryServicethe single authority for template access. Move theTEMPLATE_REGISTRYdict into the service class. - Remove the module-level
get_template()andget_template_instance()functions, or make them thin wrappers that delegate toTemplateLibraryService.get_instance(). - Ensure all consumers (steps, web layer, GitHub action) go through the service.
Item 9: Formalize the PipelineExecution Data Flow Contract¶
- File:
autograder/models/pipeline_execution.py - Problem:
PipelineExecutionstores step results as a flat list ofStepResultobjects. Each step must know which previous step's data it needs and callget_step_result(StepName.X).datawith the correct step name. Thedatafield is typed as genericTbut in practice isAny— it could be aTemplate, aCriteriaTree, aSandboxContainer, aGradeStepResult, aFocus, or a string. There's no compile-time or runtime validation that a step receives the data type it expects. - Impact: Steps are coupled to each other's internal data shapes through implicit knowledge. If a step changes what it stores in
data, downstream steps break silently. Thefinish_execution()method also hardcodes knowledge of which steps exist (StepName.GRADE,StepName.FEEDBACK,StepName.FOCUS) to assemble the finalGradingResult. - Action:
- Add typed accessor properties to
PipelineExecutionfor commonly accessed data:template,criteria_tree,sandbox,grade_result,focus,feedback. Each property internally callsget_step_result()and casts to the expected type. - This doesn't change the storage mechanism but provides a typed interface that makes the data flow explicit and catches mismatches earlier.
- Refactor
finish_execution()to use these typed accessors instead of rawget_step_result()calls.
🟢 Priority 3 — Code Quality & Consistency¶
These items address inconsistencies, dead code, and patterns that make the codebase harder to understand and maintain. They don't affect correctness but reduce cognitive load.
Item 10: Clean Up the web_dev.py Monolith (1422 Lines)¶
- File:
autograder/template_library/web_dev.py - Problem: This single file contains 36 test function classes plus the
WebDevTemplateclass, totaling 1422 lines. Each test class follows the same pattern (name, description, parameter_description, required_file, execute) but they're all in one file. There's also an inconsistent test registration key:"Count Unused Css Classes"uses spaces and title case while all others usesnake_case. - Impact: The file is difficult to navigate, review, and test in isolation. Adding a new web dev test means modifying a 1400-line file. The inconsistent key means test lookup behavior differs for that one test.
- Action:
- Split into sub-modules:
template_library/web_dev/html_tests.py,template_library/web_dev/css_tests.py,template_library/web_dev/js_tests.py,template_library/web_dev/structure_tests.py. - Keep
WebDevTemplateintemplate_library/web_dev/__init__.pyas the aggregator that imports and registers all tests. - Fix the
"Count Unused Css Classes"key to"count_unused_css_classes"for consistency.
Item 11: Standardize Language Across the Codebase¶
- Files: Multiple files across
autograder/,web/ - Problem: The codebase mixes Portuguese and English inconsistently:
- Error messages:
"Erro: Arquivo ou diretório obrigatório não encontrado"(Portuguese) inPreFlightService, but"Error: Setup command failed"(English) in the same file. - Test descriptions in
web_dev.py: All in Portuguese (e.g.,"Verifica se o código JS usa um número mÃnimo de métodos comuns de manipulação do DOM."). FeedbackPreferencesdefaults: Portuguese category headers ("✅ Requisitos Essenciais").- AI executor prompts: Portuguese (
"Você é um assistente de avaliação de código"). - Logging and code comments: English.
- Impact: Contributors must context-switch between languages. Error messages shown to students are inconsistent. Internationalization becomes harder when strings are hardcoded in mixed languages.
- Action:
- Decide on a language strategy: either (a) all internal code/logs in English with a separate i18n layer for student-facing strings, or (b) all student-facing strings in Portuguese with English internals.
- Extract all hardcoded student-facing strings into a centralized location (e.g., a
messages.pyor locale file) so they can be managed and eventually translated. - This is a large task — start with the
PreFlightServiceerror messages andFeedbackPreferencesdefaults as the first pass.
Item 12: Remove Dead Code and Empty Modules¶
- Files:
autograder/services/parsers/__init__.py,autograder/models/dataclass/autograder_response.py - Problem:
autograder/services/parsers/is an empty package (only__init__.pywith 0 lines). No code imports from it. It appears to be a leftover from a planned feature that was never implemented.AutograderResponsedataclass has fields (status,final_score,feedback,test_report) that overlap withGradingResultbut is not used anywhere in the pipeline. The pipeline producesPipelineExecution→GradingResult, neverAutograderResponse. It may be a legacy model from before the pipeline architecture was introduced.- Impact: Dead code creates confusion about what's active and what's legacy. New contributors may try to use these thinking they're part of the current architecture.
- Action:
- Delete
autograder/services/parsers/directory. - Verify
AutograderResponsehas no imports anywhere. If confirmed unused, delete it. If it's used in the GitHub Action path, consolidate it withGradingResult.
Item 13: Consolidate the required_file / required_file_type Naming¶
- Files:
autograder/models/abstract/test_function.py, all template test classes - Problem: The abstract
TestFunctionclass definesrequired_file_type(returns a string like"HTML","CSS","JavaScript"), but every concrete test class implements a property calledrequired_file(notrequired_file_type). The abstract propertyrequired_file_typehas a default implementation returningNone, so concrete classes don't override it — they define their ownrequired_fileproperty that is not part of the abstract contract. - Impact: The
required_fileproperty on concrete tests is used byTemplateLibraryService.get_template_info()to list test metadata, but it's not enforced by the abstract class. A new test could omit it without any error. The naming mismatch (required_filevsrequired_file_type) also creates confusion about what the property represents. - Action:
- Decide on one name. Since the concrete implementations all use
required_fileand return descriptive strings like"HTML","CSS", rename the abstract property fromrequired_file_typetorequired_fileand make it abstract (or keep the defaultNonefor tests that don't require files). - Update all concrete test classes to explicitly override the abstract property.
Item 14: Clean Up print() Statements in Production Code¶
- Files:
autograder/utils/executors/ai_executor.py,autograder/services/upstash_driver.py,autograder/services/template_library_service.py - Problem: Several files use
print()for output instead of the logging framework: AiExecutor: 8+print()calls for debugging AI responses ("Sending AI engine batch request...","Found matching TestResult for AI result:", etc.)UpstashDriver:print(f"User '{username}' created."),print(f"Score '{score}' set...")TemplateLibraryService._load_all_templates():print(f"Warning: Failed to load template...")- Impact:
print()output goes to stdout with no level, timestamp, or source information. It can't be filtered, routed, or suppressed in production. It also mixes with structured log output. - Action:
- Replace all
print()calls with appropriatelogger.info(),logger.debug(), orlogger.warning()calls using the module's logger.
Item 15: Decouple UpstashDriver from Global Environment Loading¶
- File:
autograder/services/upstash_driver.py - Problem: The file calls
load_dotenv()at module import time (line 6), outside any function or class. This means importing the module has the side effect of loading.envinto the process environment. The TODO comment acknowledges this: "place this in application startup." - Impact: Module-level side effects make testing unpredictable and can interfere with other modules' environment expectations. It also means the
.envfile is loaded even ifUpstashDriveris never instantiated. - Action:
- Remove the module-level
load_dotenv()call. - Ensure environment loading happens once at application startup (in
web/core/lifespan.pyor the GitHub Action'smain.py). - Consider making
UpstashDriver.__init__()accept the Redis URL and token as constructor parameters instead of reading fromos.getenv()directly, enabling dependency injection and testability.
Item 16: Standardize the Template Abstract Class Contract¶
- File:
autograder/models/abstract/template.py - Problem: The
TemplateABC definesget_test(name)as abstract butget_tests()as a concrete method that accessesself.tests— an attribute that is not declared in the abstract class. Each concrete template (WebDevTemplate,InputOutputTemplate,ApiTestingTemplate) definesself.testsas a dict in__init__, but this is a convention, not a contract. - Impact: The abstract class doesn't fully describe what a template must provide. A new template author must read existing implementations to understand the implicit contract (must have
self.testsdict, etc.). - Action:
- Add
tests: Dict[str, TestFunction]as a declared attribute (or abstract property) on theTemplateABC. - Remove legacy/unused attributes from template implementations so the active contract only reflects currently used behavior.
- Audit all template properties to ensure the ABC is the single source of truth for the template contract.
🔵 Priority 4 — Evolutionary Architecture Preparation¶
These items prepare the codebase for future growth by establishing patterns and removing obstacles. They are not urgent but will pay dividends as the system scales.
Item 17: Introduce a Step Registry Pattern for Pipeline Construction¶
- File:
autograder/autograder.py - Problem:
build_pipeline()is a 60-line function with conditional logic for each optional step (pre-flight, feedback, export). Adding a new step requires modifying this function, understanding the ordering constraints, and knowing which services to instantiate. The function also hardcodes service instantiation (e.g.,FocusService(),ReporterService(feedback_mode),UpstashDriver). - Impact: Pipeline construction is monolithic. There's no way to compose pipelines from configuration or to add steps without touching the builder function.
- Action:
- Create a
StepRegistrythat mapsStepNameto a factory function. Each factory receives the relevant config slice and returns a configuredStepinstance. - Refactor
build_pipeline()to iterate over a list of desired step names and use the registry to instantiate them. - This makes it possible to define pipeline compositions declaratively (e.g., "this assignment uses steps: LOAD_TEMPLATE, BUILD_TREE, PRE_FLIGHT, GRADE, FOCUS, FEEDBACK") without modifying builder code.
Item 18: Formalize the Exporter as a Plugin Interface¶
- Files:
autograder/steps/export_step.py,autograder/services/upstash_driver.py,github_action/github_action_service.py - Problem: The
ExporterStepreceives anexporter_servicebut there's no abstract interface defining what an exporter must implement.UpstashDriverhasset_score(), but the GitHub Action'sexport_results()has a completely different signature. Thebuild_pipeline()function passesUpstashDriver(the class, not an instance) toExporterStep, which means the step would need to instantiate it — but the step just callsself._exporter_service.set_score()directly, which would fail on a class reference. - Impact: There's no way to swap exporters without modifying the step. The GitHub Action has its own export path (
GithubActionService.export_results()) that bypasses the pipeline's export step entirely. - Action:
- Define an
ExporterABC with aexport(user_id, score, feedback)method. - Make
UpstashDriverimplement this interface. - Create a
GithubClassroomExporterthat wraps the GitHub Action's export logic. - Fix
build_pipeline()to pass an exporter instance (not a class reference).
Item 19: Separate the PipelineExecution Summary Logic from the Model¶
- File:
autograder/models/pipeline_execution.py - Problem:
PipelineExecutionis a data model (holds step results, submission, status) but also contains 100+ lines of presentation logic inget_pipeline_execution_summary()and_extract_error_details(). The_extract_error_detailsmethod parses error message strings using string matching ("Arquivo ou diretório obrigatório não encontrado" in error_text,"Setup command" in error_text) to reconstruct structured data that was originally structured inPreFlightServicebut was flattened into a string byPreFlightStep._format_errors(). - Impact: The model is doing serialization, presentation, and reverse-parsing of its own error strings. This is fragile — if error message wording changes, the parser breaks. It also means the model knows about the internal format of every step's errors.
- Action:
- Store
PreflightErrorobjects (or their dicts) directly in the step result instead of formatting them into a string and then parsing the string back. ThePreFlightStepshould store structured error data, and the summary generator should read it directly. - Extract
get_pipeline_execution_summary()into a separatePipelineExecutionSerializeror utility function that takes aPipelineExecutionand produces the API-facing dict. This keeps the model clean and the serialization logic testable independently.
Item 20: Establish a Consistent Error Handling Strategy Across Steps¶
- Files: All files in
autograder/steps/ - Problem: Each step handles errors differently:
TemplateLoaderStep: Catches all exceptions, returnsStepResultwithFAILstatus and error string.BuildTreeStep: Same pattern but also passesoriginal_input=pipeline_exec.PreFlightStep: ReturnsFAILwith formatted error string, but also has inline comments questioning error handling ("Needs error handling?","Return Sandbox Here anyway?","How to deal with sandbox destruction").GradeStep: Catches all exceptions, returnsFAIL.FocusStep: Catches all exceptions, returnsFAIL.FeedbackStep: Catches all exceptions, returnsFAIL.ExporterStep: Catches all exceptions, returnsFAIL. Theoriginal_inputfield is set inconsistently — some steps set it, others don't. The pipeline'srun()method also catches exceptions separately and setsINTERRUPTEDstatus, creating a dual error-handling path.- Impact: Error handling is duplicated in every step with slight variations. The
original_inputfield onStepResultis sometimes set and sometimes not, making it unreliable. The dual error path (step-level catch vs pipeline-level catch) means some errors areFAILand others areINTERRUPTEDwith no clear semantic distinction. - Action:
- Create a base step class or decorator that wraps
execute()in a standard try/except, constructs theStepResultwithFAILstatus, and logs the error. Individual steps only implement the happy path. - Decide on the semantics of
FAILvsINTERRUPTED:FAIL= step detected a known error condition (e.g., missing file),INTERRUPTED= unexpected exception. Document this. - Remove
original_inputfromStepResultif it's not consistently used, or make it mandatory.
Item 21: Decouple the AiExecutor Batch Pattern from Test Execution¶
- Files:
autograder/utils/executors/ai_executor.py,autograder/services/grader_service.py - Problem: The
AiExecutorimplements a batch-send pattern where individual tests callexecutor.add_test()during tree traversal, and thenGraderService.grade_from_tree()callsexecutor.stop()after all tests are processed to send a single batch request to OpenAI. This means: GraderServicemust know aboutAiExecutorand check for it after grading (if hasattr(test_func, "executor") and test_func.executor).- Test functions that use AI don't actually execute during
execute()— they register themselves and return an emptyTestResultthat gets populated later viamapback(). - The executor stores mutable state (
tests,test_result_references,submission_files) and mutatesTestResultobjects in place after the fact. - Impact: This breaks the pipeline's mental model where each test executes and returns a result. Instead, AI tests return placeholder results that are silently mutated later. The
GraderServicehas to know about this special case, creating coupling between the grading service and the AI execution strategy. - Action:
- Refactor AI execution into a pipeline-aware pattern. Options:
- Option A: Make AI execution a post-processing step that runs after the grade step. The grade step marks AI tests as "pending," and a new
AiExecutionStepcollects them, sends the batch, and fills in the results. - Option B: Make the
AiExecutora strategy that theGradeStepcan invoke after tree traversal, removing the need forGraderServiceto know about it.
- Option A: Make AI execution a post-processing step that runs after the grade step. The grade step marks AI tests as "pending," and a new
- Either way, eliminate the in-place mutation of
TestResultobjects and thehasattrcheck inGraderService.
Progress Tracking¶
| # | Item | Priority | Status |
|---|---|---|---|
| 1 | Fix inverted reporter mode | 🔴 P1 | ✅ Done |
| 2 | Implement DefaultReporter | 🔴 P1 | ✅ Done |
| 3 | Remove stale test_library field | 🔴 P1 | ⬜ To Do |
| 4 | Fix FeedbackStep contract | 🔴 P1 | ✅ Done |
| 5 | Decouple sandbox from pre-flight | 🟡 P2 | ⬜ To Do |
| 6 | Decouple GraderService state | 🟡 P2 | ⬜ To Do |
| 7 | Extract language resolution | 🟡 P2 | ✅ Done |
| 8 | Unify template registration | 🟡 P2 | ⬜ To Do |
| 9 | Formalize PipelineExecution data flow | 🟡 P2 | ⬜ To Do |
| 10 | Split web_dev.py monolith | 🟢 P3 | ⬜ To Do |
| 11 | Standardize language (i18n) | 🟢 P3 | ⬜ To Do |
| 12 | Remove dead code | 🟢 P3 | ⬜ To Do |
| 13 | Consolidate required_file naming | 🟢 P3 | ⬜ To Do |
| 14 | Replace print() with logging | 🟢 P3 | ⬜ To Do |
| 15 | Decouple UpstashDriver env loading | 🟢 P3 | ⬜ To Do |
| 16 | Standardize Template ABC | 🟢 P3 | ⬜ To Do |
| 17 | Step registry pattern | 🔵 P4 | ✅ Done |
| 18 | Formalize exporter plugin | 🔵 P4 | ✅ Done |
| 19 | Separate summary from model | 🔵 P4 | ✅ Done |
| 20 | Consistent error handling | 🔵 P4 | ⬜ To Do |
| 21 | Decouple AiExecutor batch | 🔵 P4 | ⬜ To Do |
Recommended Execution Order¶
The items have dependency relationships. Here is the suggested execution sequence:
Phase 1 — Fix What's Broken (Items 1–4) These can be done in parallel. They fix code that would crash or produce wrong results at runtime.
Phase 2 — Structural Decoupling (Items 5–9) Start with Item 6 (GraderService state) and Item 7 (language resolution) as they're self-contained. Then Item 5 (sandbox lifecycle) which touches more files. Items 8 and 9 can be done in parallel.
Phase 3 — Code Quality (Items 10–16) These are independent of each other and can be tackled in any order. Item 10 (web_dev split) is the largest. Item 14 (print cleanup) is the quickest win.
Phase 4 — Architecture Evolution (Items 17–21) Item 19 (summary separation) should come before Item 20 (error handling) since the error handling strategy depends on how errors are stored. Item 17 (step registry) enables Item 18 (exporter plugin). Item 21 (AiExecutor) is independent.