Persistent File Processing State in Observer

Persistent File Processing State in Observer

Context

The FileSystemObserver is a core component of our content management system that watches for changes in Markdown files and processes their frontmatter. It uses a processedFiles set to track which files have already been processed to prevent infinite loops and duplicate processing.
The issue we encountered was that after restarting the observer (e.g., via nodemon during development), certain files were being skipped entirely. Specifically, files that had been processed before the restart were not being processed again, even though they should be.
This was causing problems with content updates not being properly reflected in the system, particularly with files like "Why Text Manipulation is Now Mission Critical.md" being consistently skipped after restarts.

Incorrect Attempts

Attempt 1: Adding a Reset Method Call in Constructor

Our first attempt was to add a call to resetProcessedFiles() in the constructor of the FileSystemObserver class:
typescript
constructor(
  private templateRegistry: TemplateRegistry,
  private reportingService: ReportingService
) {
  // Reset the processed files set when a new instance is created
  FileSystemObserver.resetProcessedFiles();
  
  // Initialize watchers
  this.remindersWatcher = new RemindersWatcher(this.templateRegistry);
  this.vocabularyWatcher = new VocabularyWatcher(this.templateRegistry);
  this.essaysWatcher = new EssaysWatcher(this.templateRegistry);
}
However, this didn't solve the problem because the processedFiles set was still defined as a static class property:
typescript
// Static set to track processed files and prevent infinite loops
private static processedFiles = new Set<string>();

// Method to reset the processed files set
public static resetProcessedFiles() {
  FileSystemObserver.processedFiles.clear();
  console.log('Processed files set has been reset.');
}
The issue with this approach was that even though we were calling resetProcessedFiles() in the constructor, the static variable persisted across Node.js process restarts when using tools like nodemon that don't fully terminate the process.

Attempt 2: Converting to Instance Property

Our second attempt was to convert the static processedFiles set to an instance property:
typescript
// Instance set to track processed files and prevent infinite loops
private processedFiles = new Set<string>();

// Method to reset the processed files set
public resetProcessedFiles() {
  this.processedFiles.clear();
  console.log('Processed files set has been reset.');
}
And we updated all references to use the instance property:
typescript
// In the onChange method
if (this.processedFiles.has(filePath)) {
  console.log(`File ${filePath} has already been processed. Skipping.`);
  return;
}

// Mark file as processed
this.processedFiles.add(filePath);
console.log(`Marked file as processed: ${filePath}`);
console.log(`Total processed files: ${this.processedFiles.size}`);
We also added a call to this.resetProcessedFiles() in the constructor:
typescript
constructor(
  private templateRegistry: TemplateRegistry,
  private reportingService: ReportingService
) {
  // Reset the processed files set when a new instance is created
  this.resetProcessedFiles();
  
  // Initialize watchers
  this.remindersWatcher = new RemindersWatcher(this.templateRegistry);
  this.vocabularyWatcher = new VocabularyWatcher(this.templateRegistry);
  this.essaysWatcher = new EssaysWatcher(this.templateRegistry);
}
We also added shutdown diagnostics to track the number of processed files at shutdown:
typescript
private handleShutdown = () => {
  // ... existing code ...
  
  console.log(`Number of processed files at shutdown: ${this.processedFiles.size}`);
  
  // ... rest of shutdown handling ...
};
However, this approach still didn't completely solve the issue.

Attempt 3: Multi-layered Reset Approach

Our third attempt involved a more comprehensive approach to ensure that the file processing state doesn't persist between restarts:
  1. FileSystemObserver Shutdown Reset:
    • Modified the handleShutdown method to explicitly reset the processedFiles set before exiting:
typescript
private async handleShutdown() {
  // ... existing code ...
  
  finally {
    // ... existing code ...
    
    // CRITICAL: Explicitly reset the processedFiles set before exiting
    // This ensures that when the process is restarted, it starts with a clean slate
    console.log('[Observer] Explicitly resetting processed files tracking before exit');
    this.resetProcessedFiles();
    console.log('[Observer] Processed files tracking has been reset');
    
    console.log('[Observer] Exiting in 250ms...');
    setTimeout(() => {
      console.log('[Observer] Process exit now.');
      process.exit(0);
    }, 250);
  }
}
  1. RemindersWatcher Reset:
    • Added a resetProcessedFiles static method to the RemindersWatcher class:
typescript
/**
 * Reset the processed files set
 * This ensures a clean slate for the next session
 */
public static resetProcessedFiles(): void {
  console.log('[RemindersWatcher] Resetting processed files tracking');
  RemindersWatcher.processedFiles.clear();
  console.log('[RemindersWatcher] Processed files tracking reset complete. Set size: 0');
}
  • Modified the stop method to call this reset method when the watcher is stopped:
typescript
public stop() {
  if (this.watcher) {
    this.watcher.close();
    this.watcher = null;
    console.log(`[RemindersWatcher] Stopped watching directory: ${this.directory}`);
    
    // Reset the processed files set to ensure a clean slate for the next session
    RemindersWatcher.resetProcessedFiles();
  }
}
  1. VocabularyWatcher and EssaysWatcher Reset:
    • Modified their stop methods to signal the main observer with a special 'RESET' value:
typescript
public stop() {
  this.watcher.close();
  console.log('[EssaysWatcher] Stopped watching for file changes');
  
  // Signal to the main observer that we're stopping
  // This helps ensure that the next time the watcher starts, it will process all files
  console.log('[EssaysWatcher] Signaling shutdown to main observer');
  this.markFileAsProcessed('RESET');
}
  1. FileSystemObserver Reset Signal Handling:
    • Enhanced the markFileAsProcessed method to recognize the special 'RESET' signal:
typescript
public markFileAsProcessed(filePath: string): void {
  // Special case: If filePath is 'RESET', reset the processed files set
  if (filePath === 'RESET') {
    console.log('[Observer] Received RESET signal from a watcher');
    this.resetProcessedFiles();
    return;
  }
  
  this.processedFiles.add(filePath);
  if (this.processedFiles.size % 10 === 0) {
    console.log(`[Observer] Processed files tracking: ${this.processedFiles.size} files marked as processed`);
  }
}
However, this approach still didn't solve the problem. After implementing these changes and restarting the observer, we still saw the following message in the logs:
text
[EssaysWatcher] [SKIP] File already processed in this session, skipping: /Users/mpstaton/code/lossless-monorepo/content/essays/Why Text Manipulation is Now Mission Critical.md
This indicates that despite our multi-layered approach to reset the processed files state, the state is still persisting between restarts.

The "Aha!" Moment

The key insight was understanding that the problem wasn't just about static vs. instance properties, but about how Node.js handles process restarts and module caching.
When using tools like nodemon, they don't fully terminate and restart the Node.js process in all cases. Instead, they may use various techniques to "hot reload" modules, which can lead to unexpected state persistence.
The real issue was that we needed a more robust way to determine if a file should be processed, rather than relying solely on an in-memory set that could be affected by the Node.js module caching and process lifecycle.
But more importantly, we realized that our approach to tracking processed files was scattered across multiple files and classes, making it difficult to maintain and debug. We needed a centralized, modular approach that follows our project's principles of Single Source of Truth.

Final Solution

Our final solution involves creating a dedicated utility for tracking processed files, following the singleton pattern to ensure there's only one instance tracking files across the entire application:
  1. Create a Centralized Utility: We created a new file processedFilesTracker.ts that exports a singleton instance and convenience functions for tracking processed files.
typescript
// processedFilesTracker.ts
/**
 * Processed Files Tracker
 * 
 * A centralized utility for tracking which files have been processed by the observer system.
 * This prevents infinite loops and duplicate processing while ensuring proper state management
 * across process restarts.
 * 
 * The tracker uses a singleton pattern to ensure there's only one instance tracking files
 * across the entire application, regardless of how many components access it.
 */

import fs from 'fs';
import path from 'path';
import crypto from 'crypto';

/**
 * Information about a processed file
 */
interface ProcessedFileInfo {
  // When the file was processed
  timestamp: number;
  // Optional content hash for detecting actual changes
  hash?: string;
}

/**
 * Singleton class for tracking processed files across the application
 */
class ProcessedFilesTracker {
  // Singleton instance
  private static instance: ProcessedFilesTracker;

  // Map to track processed files with timestamps
  private processedFiles = new Map<string, ProcessedFileInfo>();
  
  // Configurable expiration time (default: 5 minutes)
  private expirationMs = 5 * 60 * 1000;
  
  // File to persist processed state (optional)
  private stateFilePath: string;
  private persistStateToFile: boolean;
  
  // Critical files that should always be processed regardless of tracking
  private criticalFiles: string[] = [];
  
  // Flag to track if the tracker has been initialized
  private initialized = false;

  /**
   * Private constructor to enforce singleton pattern
   */
  private constructor() {
    // Default state file path in the same directory as this module
    this.stateFilePath = path.join(__dirname, '.observer-state.json');
    this.persistStateToFile = process.env.PERSIST_OBSERVER_STATE === 'true';
  }

  /**
   * Get the singleton instance
   */
  public static getInstance(): ProcessedFilesTracker {
    if (!ProcessedFilesTracker.instance) {
      ProcessedFilesTracker.instance = new ProcessedFilesTracker();
    }
    return ProcessedFilesTracker.instance;
  }

  /**
   * Initialize the tracker
   * @param options Configuration options
   */
  public initialize(options?: {
    expirationMs?: number;
    stateFilePath?: string;
    persistStateToFile?: boolean;
    criticalFiles?: string[];
  }): void {
    // ... implementation details ...
  }

  /**
   * Reset the processed files tracking
   */
  public reset(): void {
    console.log('[ProcessedFilesTracker] Resetting processed files tracking');
    this.processedFiles.clear();
    console.log('[ProcessedFilesTracker] Processed files tracking reset complete. Set size: 0');
  }

  /**
   * Mark a file as processed
   * @param filePath Path to the file to mark as processed
   * @param generateHash Whether to generate a content hash for the file
   */
  public markAsProcessed(filePath: string, generateHash: boolean = false): void {
    // ... implementation details ...
  }

  /**
   * Check if a file should be processed
   * @param filePath Path to the file to check
   * @param forceProcess Force processing regardless of tracking status
   * @returns True if the file should be processed, false otherwise
   */
  public shouldProcess(filePath: string, forceProcess: boolean = false): boolean {
    // Always process if force flag is set
    if (forceProcess) {
      console.log(`[ProcessedFilesTracker] Force processing enabled for ${filePath}`);
      return true;
    }

    // Always process critical files
    const fileName = path.basename(filePath).toLowerCase();
    if (this.criticalFiles.includes(fileName)) {
      console.log(`[ProcessedFilesTracker] Critical file detected: ${filePath}, will process`);
      return true;
    }
    
    // Check if file exists in processed set
    const fileInfo = this.processedFiles.get(filePath);
    if (!fileInfo) {
      return true; // File not processed before
    }
    
    // Check if the entry has expired
    const now = Date.now();
    if (now - fileInfo.timestamp > this.expirationMs) {
      console.log(`[ProcessedFilesTracker] Processing entry for ${filePath} has expired, will process again`);
      return true;
    }
    
    console.log(`[ProcessedFilesTracker] File ${filePath} was processed recently. Skipping.`);
    return false;
  }

  // ... other methods ...

  /**
   * Shutdown the tracker
   */
  public shutdown(): void {
    console.log('[ProcessedFilesTracker] Shutting down');
    
    // Log the number of processed files
    console.log(`[ProcessedFilesTracker] Number of processed files at shutdown: ${this.processedFiles.size}`);
    
    // Persist state to file if enabled
    if (this.persistStateToFile) {
      this.saveStateToFile();
    }
    
    // Reset the processed files set to ensure a clean state for the next run
    this.reset();
    
    console.log('[ProcessedFilesTracker] Shutdown complete');
  }
}

// Export singleton instance
export const processedFilesTracker = ProcessedFilesTracker.getInstance();

// Export convenience functions
export const initializeProcessedFilesTracker = (options?: {
  expirationMs?: number;
  stateFilePath?: string;
  persistStateToFile?: boolean;
  criticalFiles?: string[];
}) => processedFilesTracker.initialize(options);

export const markFileAsProcessed = (filePath: string, generateHash: boolean = false) => 
  processedFilesTracker.markAsProcessed(filePath, generateHash);

export const shouldProcessFile = (filePath: string, forceProcess: boolean = false) => 
  processedFilesTracker.shouldProcess(filePath, forceProcess);

export const resetProcessedFilesTracker = () => 
  processedFilesTracker.reset();

export const shutdownProcessedFilesTracker = () => 
  processedFilesTracker.shutdown();

export const addCriticalFile = (fileName: string) => 
  processedFilesTracker.addCriticalFile(fileName);
  1. Update FileSystemObserver to use the centralized tracker:
typescript
// fileSystemObserver.ts
import { 
  initializeProcessedFilesTracker, 
  markFileAsProcessed, 
  shouldProcessFile, 
  resetProcessedFilesTracker, 
  shutdownProcessedFilesTracker,
  addCriticalFile,
  processedFilesTracker
} from './utils/processedFilesTracker';

export class FileSystemObserver {
  // ... other properties ...

  /**
   * Reset the processed files set
   * This should be called when the observer is started to ensure a clean slate
   */
  public resetProcessedFiles(): void {
    console.log('[Observer] Resetting processed files tracking');
    resetProcessedFilesTracker();
    console.log('[Observer] Processed files tracking reset complete');
  }

  /**
   * Add a file to the processed files set
   * This prevents the file from being processed again in this session
   * @param filePath Path to the file to mark as processed
   */
  public markFileAsProcessed(filePath: string): void {
    markFileAsProcessed(filePath);
  }

  /**
   * Check if a file has been processed in this session
   * @param filePath Path to the file to check
   * @returns True if the file has been processed, false otherwise
   */
  public hasFileBeenProcessed(filePath: string): boolean {
    return !shouldProcessFile(filePath);
  }

  constructor(templateRegistry: TemplateRegistry, reportingService: ReportingService, contentRoot: string) {
    // ... other initialization ...
    
    // Initialize the processed files tracker with critical files
    initializeProcessedFilesTracker({
      criticalFiles: ['Why Text Manipulation is Now Mission Critical.md']
    });
    
    console.log('[Observer] FileSystemObserver initialized with clean processed files state');
  }

  // ... other methods ...

  private async handleShutdown() {
    // ... existing shutdown logic ...
    
    finally {
      // CRITICAL: Explicitly shut down the processed files tracker before exiting
      // This ensures that when the process is restarted, it starts with a clean slate
      console.log('[Observer] Shutting down processed files tracker');
      shutdownProcessedFilesTracker();
      
      console.log('[Observer] Exiting in 250ms...');
      setTimeout(() => {
        console.log('[Observer] Process exit now.');
        process.exit(0);
      }, 250);
    }
  }
}
  1. Update all watchers to use the centralized tracker:
typescript
// essaysWatcher.ts, vocabularyWatcher.ts, remindersWatcher.ts
import { markFileAsProcessed, shouldProcessFile } from '../utils/processedFilesTracker';

// In the handleFile method:
if (!shouldProcessFile(filePath)) {
  console.log(`[Watcher] [SKIP] File already processed in this session, skipping: ${filePath}`);
  return;
}

// Mark file as processed
markFileAsProcessed(filePath);
This solution provides several key benefits:
  1. Single Source of Truth: There's now only one place in the codebase responsible for tracking processed files, making it easier to maintain and debug.
  2. Modular Design: The tracker is implemented as a separate utility that can be used by any component in the system.
  3. Robust File Processing Logic: The tracker includes logic for handling critical files, expiration of processed entries, and optional file-based persistence.
  4. Proper Shutdown Handling: The tracker is explicitly shut down when the observer is stopped, ensuring a clean slate for the next run.
  5. Improved Logging: The tracker provides detailed logging about its operations, making it easier to diagnose issues.

Lessons Learned

  1. Centralize State Management: When multiple components need to share state, it's best to centralize that state in a single module that follows the singleton pattern.
  2. Follow Single Source of Truth: Having the same logic implemented in multiple places leads to bugs and maintenance issues. Always strive for a single source of truth.
  3. Understand Node.js Process Lifecycle: Node.js processes can behave in unexpected ways, especially when using development tools like nodemon. It's important to understand how module caching and process restarts work.
  4. Use Proper Design Patterns: The singleton pattern was the right choice for this use case, as we needed to ensure that there's only one instance of the tracker across the entire application.
  5. Explicit Initialization and Shutdown: Always explicitly initialize and shut down stateful components to ensure proper cleanup and prevent state leakage.
By following these principles, we were able to create a robust solution that properly handles file processing state across process restarts, ensuring that all files are processed correctly regardless of how the observer is restarted.