Implement Centralized File Processing State Management

Summary

Implemented a centralized file processing tracking system to resolve persistent state issues in the FileSystemObserver, preventing files from being skipped after observer restarts. Added new concepts watcher.

Why Care

This refactor addresses a critical reliability issue where the observer would skip files after restarts due to persistent state in the processedFiles set. The new implementation provides a robust solution with configurable options for tracking file processing state, ensuring consistent behavior across process restarts and preventing infinite processing loops.

Implementation

Changes Made

New Files:

  • tidyverse/observers/utils/processedFilesTracker.ts: Created a new utility using the singleton pattern to centralize file processing state management.

Modified Files:

  • tidyverse/observers/fileSystemObserver.ts: Updated to use the centralized tracker instead of static processedFiles set.
  • tidyverse/observers/userOptionsConfig.ts: Added configuration option for critical files.
  • tidyverse/observers/watchers/remindersWatcher.ts: Updated to use the centralized tracker.
  • tidyverse/observers/watchers/vocabularyWatcher.ts: Updated to use the centralized tracker.
  • tidyverse/observers/watchers/essaysWatcher.ts: Updated to use the centralized tracker.
  • tidyverse/observers/index.ts: Updated initialization process.

Technical Details

ProcessedFilesTracker Utility

The core of this refactor is the new ProcessedFilesTracker utility, which implements a singleton pattern to ensure a single source of truth for file processing state:
typescript
// tidyverse/observers/utils/processedFilesTracker.ts
class ProcessedFilesTracker {
  // Singleton instance
  private static instance: ProcessedFilesTracker;
  
  // Map to track processed files with timestamps
  private processedFiles = new Map<string, ProcessedFileInfo>();
  
  // Configurable expiration time (default: 5 minutes)
  private expirationMs = 5 * 60 * 1000;
  
  // Critical files that should always be processed regardless of tracking
  private criticalFiles: string[] = [];

  // Get the singleton instance
  public static getInstance(): ProcessedFilesTracker {
    if (!ProcessedFilesTracker.instance) {
      ProcessedFilesTracker.instance = new ProcessedFilesTracker();
    }
    return ProcessedFilesTracker.instance;
  }
  
  // Check if a file should be processed
  public shouldProcess(filePath: string, forceProcess: boolean = false): boolean {
    // Always process if forced
    if (forceProcess) {
      console.log(`[ProcessedFilesTracker] Force processing requested for: ${filePath}`);
      return true;
    }
    
    // Check if file is in critical files list
    const fileName = path.basename(filePath).toLowerCase();
    if (this.criticalFiles.includes(fileName)) {
      console.log(`[ProcessedFilesTracker] Critical file detected: ${filePath}, will process`);
      return true;
    }
    
    // Check if file exists in processed set
    const fileInfo = this.processedFiles.get(filePath);
    if (!fileInfo) {
      return true; // File not processed before
    }
    
    // Check if the entry has expired
    const now = Date.now();
    if (now - fileInfo.timestamp > this.expirationMs) {
      console.log(`[ProcessedFilesTracker] Processing entry for ${filePath} has expired, will process again`);
      return true;
    }
    
    // If we have a hash, check if the content has changed
    if (fileInfo.hash) {
      try {
        if (fs.existsSync(filePath)) {
          const fileContent = fs.readFileSync(filePath, 'utf8');
          const currentHash = crypto.createHash('md5').update(fileContent).digest('hex');
          
          if (currentHash !== fileInfo.hash) {
            console.log(`[ProcessedFilesTracker] Content hash changed for ${filePath}, will process`);
            return true;
          }
        }
      } catch (error) {
        console.error(`[ProcessedFilesTracker] Error checking hash for ${filePath}:`, error);
        // If we can't check the hash, process the file to be safe
        return true;
      }
    }
    
    console.log(`[ProcessedFilesTracker] File ${filePath} was processed recently. Skipping.`);
    return false;
  }
}

FileSystemObserver Integration

The FileSystemObserver was updated to use the centralized tracker instead of its static processedFiles set:
typescript
// tidyverse/observers/fileSystemObserver.ts
import { 
  initializeProcessedFilesTracker, 
  markFileAsProcessed, 
  shouldProcessFile, 
  resetProcessedFilesTracker, 
  shutdownProcessedFilesTracker,
  processedFilesTracker
} from './utils/processedFilesTracker';

export class FileSystemObserver {
  // ...
  
  constructor(templateRegistry: TemplateRegistry, reportingService: ReportingService, contentRoot: string) {
    // ...
    
    // Initialize the processed files tracker with critical files from USER_OPTIONS
    initializeProcessedFilesTracker({
      criticalFiles: USER_OPTIONS.criticalFiles || []
    });
    
    console.log('[Observer] FileSystemObserver initialized with clean processed files state');
    if (USER_OPTIONS.criticalFiles && USER_OPTIONS.criticalFiles.length > 0) {
      console.log(`[Observer] Critical files configured: ${USER_OPTIONS.criticalFiles.join(', ')}`);
    }
  }
  
  public markFileAsProcessed(filePath: string): void {
    markFileAsProcessed(filePath);
  }

  public hasFileBeenProcessed(filePath: string): boolean {
    return !shouldProcessFile(filePath);
  }
  
  private async handleShutdown() {
    // ...
    
    // CRITICAL: Explicitly shut down the processed files tracker before exiting
    // This ensures that when the process is restarted, it starts with a clean slate
    console.log('[Observer] Shutting down processed files tracker');
    shutdownProcessedFilesTracker();
    
    // ...
  }
}

Configuration for Critical Files

Added configuration for critical files in userOptionsConfig.ts:
typescript
// tidyverse/observers/userOptionsConfig.ts
export interface UserOptions {
  directories: DirectoryConfig[];
  AUTO_ADD_MISSING_FRONTMATTER_FIELDS?: boolean;
  
  /**
   * Critical files that should always be processed regardless of tracking status.
   * These files will bypass the processed files check and always be processed on each run.
   * Useful for files that need to be consistently monitored or that serve as triggers for other processes.
   * File names should be specified without paths (e.g., "example.md").
   */
  criticalFiles?: string[];
}

export const USER_OPTIONS: UserOptions = {
  // ...
  
  /**
   * Critical files that should always be processed regardless of tracking status.
   * These files will bypass the processed files check and always be processed on each run.
   */
  criticalFiles: [
    'Why Text Manipulation is Now Mission Critical.md'
  ],
};

Content Hashing for Change Detection

Implemented content hashing to detect actual file changes:
typescript
// tidyverse/observers/utils/processedFilesTracker.ts
public markAsProcessed(filePath: string, generateHash: boolean = false): void {
  console.log(`[ProcessedFilesTracker] Marking file as processed: ${filePath}`);
  
  // Special case: If filePath is 'RESET', reset the processed files set
  if (filePath === 'RESET') {
    console.log('[ProcessedFilesTracker] Received RESET signal');
    this.reset();
    return;
  }
  
  const fileInfo: ProcessedFileInfo = {
    timestamp: Date.now()
  };
  
  // Optionally generate a content hash to detect actual changes
  if (generateHash) {
    try {
      if (fs.existsSync(filePath)) {
        const fileContent = fs.readFileSync(filePath, 'utf8');
        fileInfo.hash = crypto.createHash('md5').update(fileContent).digest('hex');
        console.log(`[ProcessedFilesTracker] Generated content hash for ${filePath}: ${fileInfo.hash.substring(0, 8)}...`);
      } else {
        console.warn(`[ProcessedFilesTracker] Cannot generate hash for non-existent file: ${filePath}`);
      }
    } catch (error) {
      console.error(`[ProcessedFilesTracker] Error generating hash for ${filePath}:`, error);
    }
  }
  
  this.processedFiles.set(filePath, fileInfo);
  
  // Log periodically to avoid excessive output
  if (this.processedFiles.size % 10 === 0) {
    console.log(`[ProcessedFilesTracker] Total processed files: ${this.processedFiles.size}`);
  }

  // Persist state to file if enabled
  if (this.persistStateToFile) {
    this.saveStateToFile();
  }
}

Robust Error Handling

Enhanced error handling in state persistence operations:
typescript
// tidyverse/observers/utils/processedFilesTracker.ts
private loadStateFromFile(): void {
  if (!this.persistStateToFile) {
    console.log('[ProcessedFilesTracker] State persistence is disabled, skipping state load');
    return;
  }

  try {
    console.log(`[ProcessedFilesTracker] Attempting to load state from: ${this.stateFilePath}`);
    
    if (!fs.existsSync(this.stateFilePath)) {
      console.log('[ProcessedFilesTracker] State file does not exist, starting with empty state');
      return;
    }
    
    // Check if file is readable
    try {
      fs.accessSync(this.stateFilePath, fs.constants.R_OK);
    } catch (accessError) {
      console.error(`[ProcessedFilesTracker] Cannot read state file: ${this.stateFilePath}`, accessError);
      return;
    }
    
    const data = fs.readFileSync(this.stateFilePath, 'utf8');
    if (!data || data.trim() === '') {
      console.log('[ProcessedFilesTracker] State file is empty, starting with empty state');
      return;
    }
    
    // Parse and validate state
    try {
      const state = JSON.parse(data);
      
      if (!state || typeof state !== 'object' || !state.processedFiles) {
        console.error('[ProcessedFilesTracker] Invalid state file format, starting with empty state');
        return;
      }
      
      // Convert the loaded state back to a Map
      this.processedFiles = new Map(Object.entries(state.processedFiles));
      
      // Validate and clean up loaded entries
      let invalidEntries = 0;
      for (const [filePath, info] of this.processedFiles.entries()) {
        if (!info || typeof info !== 'object' || typeof info.timestamp !== 'number') {
          this.processedFiles.delete(filePath);
          invalidEntries++;
        }
      }
      
      if (invalidEntries > 0) {
        console.warn(`[ProcessedFilesTracker] Removed ${invalidEntries} invalid entries from loaded state`);
      }
      
      console.log(`[ProcessedFilesTracker] Successfully loaded ${this.processedFiles.size} processed file entries from state file`);
    } catch (parseError) {
      console.error('[ProcessedFilesTracker] Error parsing state file JSON:', parseError);
    }
  } catch (error) {
    console.error('[ProcessedFilesTracker] Error loading state from file:', error);
    // Ensure we start with a clean state in case of errors
    this.processedFiles.clear();
    console.log('[ProcessedFilesTracker] Reset to empty state due to load error');
  }
}

Integration Points

Watchers Integration

All watchers (Essays, Vocabulary, Reminders) were updated to use the centralized tracker:
typescript
// tidyverse/observers/watchers/vocabularyWatcher.ts
constructor(
  reportingService: ReportingService, 
  vocabularyDir: string,
  markFileAsProcessed: (filePath: string) => void,
  hasFileBeenProcessed: (filePath: string) => boolean
) {
  // ...
  this.markFileAsProcessed = markFileAsProcessed;
  this.hasFileBeenProcessed = hasFileBeenProcessed;
  // ...
}

private async handleFile(filePath: string, eventType: string) {
  // === CRITICAL: Prevent infinite loop by skipping files already processed in this session ===
  if (this.hasFileBeenProcessed(filePath)) {
    console.log(`[VocabularyWatcher] [SKIP] File already processed in this session, skipping: ${filePath}`);
    return;
  }
  
  // Add file to processed set to prevent future processing in this session
  this.markFileAsProcessed(filePath);
  
  // ...
}

Environment Variables

The implementation supports the following environment variables:
  • PERSIST_OBSERVER_STATE: If set to "true", the tracker will persist its state to a file.
  • OBSERVER_STATE_FILE: Specifies the path to the state file (defaults to .observer-state.json in the same directory as processedFilesTracker.ts).

Documentation

This refactor follows several key design principles:
  1. Singleton Pattern: Ensures a single source of truth for file processing state.
  2. Centralized State Management: Moves file processing state tracking to a dedicated utility.
  3. Expiration-Based Tracking: Implements a timestamp-based expiration mechanism for processed files.
  4. Critical File Handling: Adds logic to force processing of specific files.
  5. Optional State Persistence: Provides an option to persist processed files state to disk.
  6. Content Hashing: Detects actual file changes to avoid unnecessary processing.
The code is extensively commented to explain the purpose and behavior of each component, following the project's aggressive commenting guidelines.