Implement Centralized File Processing State Management
Summary
Implemented a centralized file processing tracking system to resolve persistent state issues in the FileSystemObserver, preventing files from being skipped after observer restarts. Added new concepts watcher.
Why Care
This refactor addresses a critical reliability issue where the observer would skip files after restarts due to persistent state in the
processedFiles
set. The new implementation provides a robust solution with configurable options for tracking file processing state, ensuring consistent behavior across process restarts and preventing infinite processing loops.Implementation
Changes Made
New Files:
tidyverse/observers/utils/processedFilesTracker.ts
: Created a new utility using the singleton pattern to centralize file processing state management.
Modified Files:
tidyverse/observers/fileSystemObserver.ts
: Updated to use the centralized tracker instead of staticprocessedFiles
set.tidyverse/observers/userOptionsConfig.ts
: Added configuration option for critical files.tidyverse/observers/watchers/remindersWatcher.ts
: Updated to use the centralized tracker.tidyverse/observers/watchers/vocabularyWatcher.ts
: Updated to use the centralized tracker.tidyverse/observers/watchers/essaysWatcher.ts
: Updated to use the centralized tracker.tidyverse/observers/index.ts
: Updated initialization process.
Technical Details
ProcessedFilesTracker Utility
The core of this refactor is the new
ProcessedFilesTracker
utility, which implements a singleton pattern to ensure a single source of truth for file processing state: typescript
// tidyverse/observers/utils/processedFilesTracker.ts
class ProcessedFilesTracker {
// Singleton instance
private static instance: ProcessedFilesTracker;
// Map to track processed files with timestamps
private processedFiles = new Map<string, ProcessedFileInfo>();
// Configurable expiration time (default: 5 minutes)
private expirationMs = 5 * 60 * 1000;
// Critical files that should always be processed regardless of tracking
private criticalFiles: string[] = [];
// Get the singleton instance
public static getInstance(): ProcessedFilesTracker {
if (!ProcessedFilesTracker.instance) {
ProcessedFilesTracker.instance = new ProcessedFilesTracker();
}
return ProcessedFilesTracker.instance;
}
// Check if a file should be processed
public shouldProcess(filePath: string, forceProcess: boolean = false): boolean {
// Always process if forced
if (forceProcess) {
console.log(`[ProcessedFilesTracker] Force processing requested for: ${filePath}`);
return true;
}
// Check if file is in critical files list
const fileName = path.basename(filePath).toLowerCase();
if (this.criticalFiles.includes(fileName)) {
console.log(`[ProcessedFilesTracker] Critical file detected: ${filePath}, will process`);
return true;
}
// Check if file exists in processed set
const fileInfo = this.processedFiles.get(filePath);
if (!fileInfo) {
return true; // File not processed before
}
// Check if the entry has expired
const now = Date.now();
if (now - fileInfo.timestamp > this.expirationMs) {
console.log(`[ProcessedFilesTracker] Processing entry for ${filePath} has expired, will process again`);
return true;
}
// If we have a hash, check if the content has changed
if (fileInfo.hash) {
try {
if (fs.existsSync(filePath)) {
const fileContent = fs.readFileSync(filePath, 'utf8');
const currentHash = crypto.createHash('md5').update(fileContent).digest('hex');
if (currentHash !== fileInfo.hash) {
console.log(`[ProcessedFilesTracker] Content hash changed for ${filePath}, will process`);
return true;
}
}
} catch (error) {
console.error(`[ProcessedFilesTracker] Error checking hash for ${filePath}:`, error);
// If we can't check the hash, process the file to be safe
return true;
}
}
console.log(`[ProcessedFilesTracker] File ${filePath} was processed recently. Skipping.`);
return false;
}
}
FileSystemObserver Integration
The FileSystemObserver was updated to use the centralized tracker instead of its static
processedFiles
set: typescript
// tidyverse/observers/fileSystemObserver.ts
import {
initializeProcessedFilesTracker,
markFileAsProcessed,
shouldProcessFile,
resetProcessedFilesTracker,
shutdownProcessedFilesTracker,
processedFilesTracker
} from './utils/processedFilesTracker';
export class FileSystemObserver {
// ...
constructor(templateRegistry: TemplateRegistry, reportingService: ReportingService, contentRoot: string) {
// ...
// Initialize the processed files tracker with critical files from USER_OPTIONS
initializeProcessedFilesTracker({
criticalFiles: USER_OPTIONS.criticalFiles || []
});
console.log('[Observer] FileSystemObserver initialized with clean processed files state');
if (USER_OPTIONS.criticalFiles && USER_OPTIONS.criticalFiles.length > 0) {
console.log(`[Observer] Critical files configured: ${USER_OPTIONS.criticalFiles.join(', ')}`);
}
}
public markFileAsProcessed(filePath: string): void {
markFileAsProcessed(filePath);
}
public hasFileBeenProcessed(filePath: string): boolean {
return !shouldProcessFile(filePath);
}
private async handleShutdown() {
// ...
// CRITICAL: Explicitly shut down the processed files tracker before exiting
// This ensures that when the process is restarted, it starts with a clean slate
console.log('[Observer] Shutting down processed files tracker');
shutdownProcessedFilesTracker();
// ...
}
}
Configuration for Critical Files
Added configuration for critical files in
userOptionsConfig.ts
: typescript
// tidyverse/observers/userOptionsConfig.ts
export interface UserOptions {
directories: DirectoryConfig[];
AUTO_ADD_MISSING_FRONTMATTER_FIELDS?: boolean;
/**
* Critical files that should always be processed regardless of tracking status.
* These files will bypass the processed files check and always be processed on each run.
* Useful for files that need to be consistently monitored or that serve as triggers for other processes.
* File names should be specified without paths (e.g., "example.md").
*/
criticalFiles?: string[];
}
export const USER_OPTIONS: UserOptions = {
// ...
/**
* Critical files that should always be processed regardless of tracking status.
* These files will bypass the processed files check and always be processed on each run.
*/
criticalFiles: [
'Why Text Manipulation is Now Mission Critical.md'
],
};
Content Hashing for Change Detection
Implemented content hashing to detect actual file changes:
typescript
// tidyverse/observers/utils/processedFilesTracker.ts
public markAsProcessed(filePath: string, generateHash: boolean = false): void {
console.log(`[ProcessedFilesTracker] Marking file as processed: ${filePath}`);
// Special case: If filePath is 'RESET', reset the processed files set
if (filePath === 'RESET') {
console.log('[ProcessedFilesTracker] Received RESET signal');
this.reset();
return;
}
const fileInfo: ProcessedFileInfo = {
timestamp: Date.now()
};
// Optionally generate a content hash to detect actual changes
if (generateHash) {
try {
if (fs.existsSync(filePath)) {
const fileContent = fs.readFileSync(filePath, 'utf8');
fileInfo.hash = crypto.createHash('md5').update(fileContent).digest('hex');
console.log(`[ProcessedFilesTracker] Generated content hash for ${filePath}: ${fileInfo.hash.substring(0, 8)}...`);
} else {
console.warn(`[ProcessedFilesTracker] Cannot generate hash for non-existent file: ${filePath}`);
}
} catch (error) {
console.error(`[ProcessedFilesTracker] Error generating hash for ${filePath}:`, error);
}
}
this.processedFiles.set(filePath, fileInfo);
// Log periodically to avoid excessive output
if (this.processedFiles.size % 10 === 0) {
console.log(`[ProcessedFilesTracker] Total processed files: ${this.processedFiles.size}`);
}
// Persist state to file if enabled
if (this.persistStateToFile) {
this.saveStateToFile();
}
}
Robust Error Handling
Enhanced error handling in state persistence operations:
typescript
// tidyverse/observers/utils/processedFilesTracker.ts
private loadStateFromFile(): void {
if (!this.persistStateToFile) {
console.log('[ProcessedFilesTracker] State persistence is disabled, skipping state load');
return;
}
try {
console.log(`[ProcessedFilesTracker] Attempting to load state from: ${this.stateFilePath}`);
if (!fs.existsSync(this.stateFilePath)) {
console.log('[ProcessedFilesTracker] State file does not exist, starting with empty state');
return;
}
// Check if file is readable
try {
fs.accessSync(this.stateFilePath, fs.constants.R_OK);
} catch (accessError) {
console.error(`[ProcessedFilesTracker] Cannot read state file: ${this.stateFilePath}`, accessError);
return;
}
const data = fs.readFileSync(this.stateFilePath, 'utf8');
if (!data || data.trim() === '') {
console.log('[ProcessedFilesTracker] State file is empty, starting with empty state');
return;
}
// Parse and validate state
try {
const state = JSON.parse(data);
if (!state || typeof state !== 'object' || !state.processedFiles) {
console.error('[ProcessedFilesTracker] Invalid state file format, starting with empty state');
return;
}
// Convert the loaded state back to a Map
this.processedFiles = new Map(Object.entries(state.processedFiles));
// Validate and clean up loaded entries
let invalidEntries = 0;
for (const [filePath, info] of this.processedFiles.entries()) {
if (!info || typeof info !== 'object' || typeof info.timestamp !== 'number') {
this.processedFiles.delete(filePath);
invalidEntries++;
}
}
if (invalidEntries > 0) {
console.warn(`[ProcessedFilesTracker] Removed ${invalidEntries} invalid entries from loaded state`);
}
console.log(`[ProcessedFilesTracker] Successfully loaded ${this.processedFiles.size} processed file entries from state file`);
} catch (parseError) {
console.error('[ProcessedFilesTracker] Error parsing state file JSON:', parseError);
}
} catch (error) {
console.error('[ProcessedFilesTracker] Error loading state from file:', error);
// Ensure we start with a clean state in case of errors
this.processedFiles.clear();
console.log('[ProcessedFilesTracker] Reset to empty state due to load error');
}
}
Integration Points
Watchers Integration
All watchers (Essays, Vocabulary, Reminders) were updated to use the centralized tracker:
typescript
// tidyverse/observers/watchers/vocabularyWatcher.ts
constructor(
reportingService: ReportingService,
vocabularyDir: string,
markFileAsProcessed: (filePath: string) => void,
hasFileBeenProcessed: (filePath: string) => boolean
) {
// ...
this.markFileAsProcessed = markFileAsProcessed;
this.hasFileBeenProcessed = hasFileBeenProcessed;
// ...
}
private async handleFile(filePath: string, eventType: string) {
// === CRITICAL: Prevent infinite loop by skipping files already processed in this session ===
if (this.hasFileBeenProcessed(filePath)) {
console.log(`[VocabularyWatcher] [SKIP] File already processed in this session, skipping: ${filePath}`);
return;
}
// Add file to processed set to prevent future processing in this session
this.markFileAsProcessed(filePath);
// ...
}
Environment Variables
The implementation supports the following environment variables:
PERSIST_OBSERVER_STATE
: If set to"true"
, the tracker will persist its state to a file.OBSERVER_STATE_FILE
: Specifies the path to the state file (defaults to.observer-state.json
in the same directory asprocessedFilesTracker.ts
).
Documentation
This refactor follows several key design principles:
- Singleton Pattern: Ensures a single source of truth for file processing state.
- Centralized State Management: Moves file processing state tracking to a dedicated utility.
- Expiration-Based Tracking: Implements a timestamp-based expiration mechanism for processed files.
- Critical File Handling: Adds logic to force processing of specific files.
- Optional State Persistence: Provides an option to persist processed files state to disk.
- Content Hashing: Detects actual file changes to avoid unnecessary processing.
The code is extensively commented to explain the purpose and behavior of each component, following the project's aggressive commenting guidelines.