12 KiB
Requirements Document
Introduction
This document specifies the requirements for improving the robustness, efficiency, and usability of the AI Data Analysis Agent. The improvements span five areas: a data privacy fallback mechanism for recovering from LLM-generated code failures when real data is unavailable, conversation history trimming to reduce token consumption and prevent data leakage, integration of the existing analysis template system, frontend progress bar display, and multi-file parallel/chunked analysis support.
Glossary
- Agent: The
DataAnalysisAgentclass indata_analysis_agent.pythat orchestrates LLM calls and IPython code execution for data analysis. - Safe_Profile: The schema-only data description generated by
build_safe_profile()inutils/data_privacy.py, containing column names, data types, null rates, and unique value counts — but no real data values. - Local_Profile: The full data profile generated by
build_local_profile()containing real data values, statistics, and sample rows — used only in the local execution environment. - Code_Executor: The
CodeExecutorclass inutils/code_executor.pythat runs Python code in an IPython sandbox and returns execution results. - Conversation_History: The list of
{"role": ..., "content": ...}message dictionaries maintained by the Agent across analysis rounds. - Feedback_Sanitizer: The
sanitize_execution_feedback()function inutils/data_privacy.pythat removes real data values from execution output before sending to the LLM. - Template_Registry: The
TEMPLATE_REGISTRYdictionary inutils/analysis_templates.pymapping template names to template classes. - Session_Data: The
SessionDataclass inweb/main.pythat tracks session state includingprogress_percentage,current_round,max_rounds, andstatus_message. - Polling_Loop: The
setInterval-based polling mechanism inweb/static/script.jsthat fetches/api/statusevery 2 seconds. - Data_Loader: The module
utils/data_loader.pyprovidingload_and_profile_data,load_data_chunked, andload_data_with_cachefunctions. - AppConfig: The
AppConfigdataclass inconfig/app_config.pyholding configuration values such asmax_rounds,chunk_size, andmax_file_size_mb.
Requirements
Requirement 1: Data Privacy Fallback — Error Detection
User Story: As a system operator, I want the Agent to detect when LLM-generated code fails due to missing real data context, so that the system can attempt intelligent recovery instead of wasting an analysis round.
Acceptance Criteria
- WHEN the Code_Executor returns a failed execution result, THE Agent SHALL classify the error as either a data-context error or a non-data error by inspecting the error message for patterns such as
KeyError,ValueErroron column values,NameErrorfor undefined data variables, or empty DataFrame conditions. - WHEN a data-context error is detected, THE Agent SHALL increment a per-round retry counter for the current analysis round.
- WHILE the retry counter for a given round is below the configured maximum retry limit, THE Agent SHALL attempt recovery by generating an enriched hint prompt rather than forwarding the raw error to the LLM as a normal failure.
- IF the retry counter reaches the configured maximum retry limit, THEN THE Agent SHALL fall back to normal error handling by forwarding the sanitized error feedback to the LLM and proceeding to the next round.
Requirement 2: Data Privacy Fallback — Enriched Hint Generation
User Story: As a system operator, I want the Agent to provide the LLM with enriched schema hints when data-context errors occur, so that the LLM can generate corrected code without receiving raw data values.
Acceptance Criteria
- WHEN a data-context error is detected and retry is permitted, THE Agent SHALL generate an enriched hint containing the relevant column's data type, unique value count, null rate, and a categorical description (e.g., "low-cardinality category with 5 classes") extracted from the Safe_Profile.
- WHEN the error involves a specific column name referenced in the error message, THE Agent SHALL include that column's schema metadata in the enriched hint.
- THE Agent SHALL append the enriched hint to the conversation history as a user message with a prefix indicating it is a retry context, before requesting a new LLM response.
- THE Agent SHALL NOT include any real data values, sample rows, or statistical values (min, max, mean) from the Local_Profile in the enriched hint sent to the LLM.
Requirement 3: Data Privacy Fallback — Configuration
User Story: As a system operator, I want to configure the maximum number of data-context retries, so that I can balance between recovery attempts and analysis throughput.
Acceptance Criteria
- THE AppConfig SHALL include a
max_data_context_retriesfield with a default value of 2. - WHEN the
APP_MAX_DATA_CONTEXT_RETRIESenvironment variable is set, THE AppConfig SHALL use its integer value to override the default. - THE Agent SHALL read the
max_data_context_retriesvalue from AppConfig during initialization.
Requirement 4: Conversation History Trimming — Sliding Window
User Story: As a system operator, I want the conversation history to be trimmed using a sliding window, so that token consumption stays bounded and early execution results containing potential data leakage are removed.
Acceptance Criteria
- THE AppConfig SHALL include a
conversation_window_sizefield with a default value of 10, representing the maximum number of recent message pairs to retain in full. - WHEN the Conversation_History length exceeds twice the
conversation_window_size(counting individual messages), THE Agent SHALL retain only the most recentconversation_window_sizepairs of messages in full detail. - THE Agent SHALL always retain the first user message (containing the original requirement and Safe_Profile) regardless of window trimming.
- WHEN messages are trimmed from the Conversation_History, THE Agent SHALL generate a compressed summary of the trimmed messages and prepend it after the first user message.
Requirement 5: Conversation History Trimming — Summary Compression
User Story: As a system operator, I want trimmed conversation rounds to be compressed into a summary, so that the LLM retains awareness of prior analysis steps without consuming excessive tokens.
Acceptance Criteria
- WHEN conversation messages are trimmed, THE Agent SHALL produce a summary string that lists each trimmed round's action type (generate_code, collect_figures), a one-line description of what was done, and whether execution succeeded or failed.
- THE summary SHALL NOT contain any code blocks, raw execution output, or data values from prior rounds.
- THE summary SHALL be inserted into the Conversation_History as a single user message immediately after the first user message, replacing any previous summary message.
- IF no messages have been trimmed, THEN THE Agent SHALL NOT insert a summary message.
Requirement 6: Analysis Template System — Backend Integration
User Story: As a user, I want to select a predefined analysis template when starting an analysis, so that the Agent follows a structured analysis plan tailored to my scenario.
Acceptance Criteria
- WHEN a template name is provided in the analysis request, THE Agent SHALL retrieve the corresponding template from the Template_Registry using the
get_template()function. - WHEN a valid template is retrieved, THE Agent SHALL call
get_full_prompt()on the template and prepend the resulting structured prompt to the user's requirement in the initial conversation message. - IF an invalid template name is provided, THEN THE Agent SHALL raise a descriptive error listing available template names.
- WHEN no template name is provided, THE Agent SHALL proceed with the default unstructured analysis flow.
Requirement 7: Analysis Template System — API Endpoint
User Story: As a frontend developer, I want API endpoints to list available templates and to accept a template selection when starting analysis, so that the frontend can offer template choices to users.
Acceptance Criteria
- THE FastAPI server SHALL expose a
GET /api/templatesendpoint that returns the list of available templates by callinglist_templates(), with each entry containingname,display_name, anddescription. - THE
POST /api/startrequest body SHALL accept an optionaltemplatefield containing the template name string. - WHEN the
templatefield is present in the start request, THE FastAPI server SHALL pass the template name to the Agent'sanalyze()method. - WHEN the
templatefield is absent or empty, THE FastAPI server SHALL start analysis without a template.
Requirement 8: Analysis Template System — Frontend Template Selector
User Story: As a user, I want to see and select analysis templates in the web interface before starting analysis, so that I can choose a structured analysis approach.
Acceptance Criteria
- WHEN the web page loads, THE frontend SHALL fetch the template list from
GET /api/templatesand render selectable template cards above the requirement input area. - WHEN a user selects a template card, THE frontend SHALL visually highlight the selected template and store the template name.
- WHEN the user clicks "Start Analysis" with a template selected, THE frontend SHALL include the template name in the
POST /api/startrequest body. - THE frontend SHALL provide a "No Template (Free Analysis)" option that is selected by default, allowing users to proceed without a template.
Requirement 9: Frontend Progress Bar Display
User Story: As a user, I want to see a real-time progress bar during analysis, so that I can understand how far the analysis has progressed.
Acceptance Criteria
- THE FastAPI server SHALL update the Session_Data's
current_round,max_rounds,progress_percentage, andstatus_messagefields during each analysis round in therun_analysis_taskfunction. - THE
GET /api/statusresponse SHALL includecurrent_round,max_rounds,progress_percentage, andstatus_messagefields. - WHEN the Polling_Loop receives status data with
is_runningequal to true, THE frontend SHALL render a progress bar element showing theprogress_percentagevalue and thestatus_messagetext. - WHEN
progress_percentagechanges between polls, THE frontend SHALL animate the progress bar width transition smoothly. - WHEN
is_runningbecomes false, THE frontend SHALL set the progress bar to 100% and display a completion message.
Requirement 10: Multi-File Chunked Loading
User Story: As a user, I want large data files to be loaded in chunks, so that the system can handle files that exceed available memory.
Acceptance Criteria
- WHEN a data file's size exceeds the
max_file_size_mbthreshold in AppConfig, THE Data_Loader SHALL useload_data_chunked()to stream the file in chunks ofchunk_sizerows instead of loading the entire file into memory. - WHEN chunked loading is used, THE Agent SHALL instruct the Code_Executor to make the chunked iterator available in the notebook environment as a variable, so that LLM-generated code can process data in chunks.
- WHEN chunked loading is used for profiling, THE Agent SHALL generate the Safe_Profile by reading only the first chunk plus sampling from subsequent chunks, rather than loading the entire file.
- IF a file cannot be loaded even in chunked mode, THEN THE Data_Loader SHALL return a descriptive error message indicating the failure reason.
Requirement 11: Multi-File Parallel Profiling
User Story: As a user, I want multiple data files to be profiled concurrently, so that the initial data exploration phase completes faster when multiple files are uploaded.
Acceptance Criteria
- WHEN multiple files are provided for analysis, THE Agent SHALL profile each file concurrently using thread-based parallelism rather than sequentially.
- THE Agent SHALL collect all profiling results and merge them into a single Safe_Profile string and a single Local_Profile string, maintaining the same format as the current sequential output.
- IF any individual file profiling fails, THEN THE Agent SHALL include an error entry for that file in the profile output and continue profiling the remaining files.
- THE AppConfig SHALL include a
max_parallel_profilesfield with a default value of 4, controlling the maximum number of concurrent profiling threads.