Persistent File Processing State in Observer
Persistent File Processing State in Observer
Context
The FileSystemObserver is a core component of our content management system that watches for changes in Markdown files and processes their frontmatter. It uses a
processedFiles
set to track which files have already been processed to prevent infinite loops and duplicate processing.The issue we encountered was that after restarting the observer (e.g., via
nodemon
during development), certain files were being skipped entirely. Specifically, files that had been processed before the restart were not being processed again, even though they should be.This was causing problems with content updates not being properly reflected in the system, particularly with files like "Why Text Manipulation is Now Mission Critical.md" being consistently skipped after restarts.
Incorrect Attempts
Attempt 1: Adding a Reset Method Call in Constructor
Our first attempt was to add a call to
resetProcessedFiles()
in the constructor of the FileSystemObserver
class: typescript
constructor(
private templateRegistry: TemplateRegistry,
private reportingService: ReportingService
) {
// Reset the processed files set when a new instance is created
FileSystemObserver.resetProcessedFiles();
// Initialize watchers
this.remindersWatcher = new RemindersWatcher(this.templateRegistry);
this.vocabularyWatcher = new VocabularyWatcher(this.templateRegistry);
this.essaysWatcher = new EssaysWatcher(this.templateRegistry);
}
However, this didn't solve the problem because the
processedFiles
set was still defined as a static class property: typescript
// Static set to track processed files and prevent infinite loops
private static processedFiles = new Set<string>();
// Method to reset the processed files set
public static resetProcessedFiles() {
FileSystemObserver.processedFiles.clear();
console.log('Processed files set has been reset.');
}
The issue with this approach was that even though we were calling
resetProcessedFiles()
in the constructor, the static variable persisted across Node.js process restarts when using tools like nodemon
that don't fully terminate the process.Attempt 2: Converting to Instance Property
Our second attempt was to convert the static
processedFiles
set to an instance property: typescript
// Instance set to track processed files and prevent infinite loops
private processedFiles = new Set<string>();
// Method to reset the processed files set
public resetProcessedFiles() {
this.processedFiles.clear();
console.log('Processed files set has been reset.');
}
And we updated all references to use the instance property:
typescript
// In the onChange method
if (this.processedFiles.has(filePath)) {
console.log(`File ${filePath} has already been processed. Skipping.`);
return;
}
// Mark file as processed
this.processedFiles.add(filePath);
console.log(`Marked file as processed: ${filePath}`);
console.log(`Total processed files: ${this.processedFiles.size}`);
We also added a call to
this.resetProcessedFiles()
in the constructor: typescript
constructor(
private templateRegistry: TemplateRegistry,
private reportingService: ReportingService
) {
// Reset the processed files set when a new instance is created
this.resetProcessedFiles();
// Initialize watchers
this.remindersWatcher = new RemindersWatcher(this.templateRegistry);
this.vocabularyWatcher = new VocabularyWatcher(this.templateRegistry);
this.essaysWatcher = new EssaysWatcher(this.templateRegistry);
}
We also added shutdown diagnostics to track the number of processed files at shutdown:
typescript
private handleShutdown = () => {
// ... existing code ...
console.log(`Number of processed files at shutdown: ${this.processedFiles.size}`);
// ... rest of shutdown handling ...
};
However, this approach still didn't completely solve the issue.
Attempt 3: Multi-layered Reset Approach
Our third attempt involved a more comprehensive approach to ensure that the file processing state doesn't persist between restarts:
- FileSystemObserver Shutdown Reset:
- Modified the
handleShutdown
method to explicitly reset theprocessedFiles
set before exiting:
typescript
private async handleShutdown() {
// ... existing code ...
finally {
// ... existing code ...
// CRITICAL: Explicitly reset the processedFiles set before exiting
// This ensures that when the process is restarted, it starts with a clean slate
console.log('[Observer] Explicitly resetting processed files tracking before exit');
this.resetProcessedFiles();
console.log('[Observer] Processed files tracking has been reset');
console.log('[Observer] Exiting in 250ms...');
setTimeout(() => {
console.log('[Observer] Process exit now.');
process.exit(0);
}, 250);
}
}
- RemindersWatcher Reset:
- Added a
resetProcessedFiles
static method to the RemindersWatcher class:
typescript
/**
* Reset the processed files set
* This ensures a clean slate for the next session
*/
public static resetProcessedFiles(): void {
console.log('[RemindersWatcher] Resetting processed files tracking');
RemindersWatcher.processedFiles.clear();
console.log('[RemindersWatcher] Processed files tracking reset complete. Set size: 0');
}
- Modified the
stop
method to call this reset method when the watcher is stopped:
typescript
public stop() {
if (this.watcher) {
this.watcher.close();
this.watcher = null;
console.log(`[RemindersWatcher] Stopped watching directory: ${this.directory}`);
// Reset the processed files set to ensure a clean slate for the next session
RemindersWatcher.resetProcessedFiles();
}
}
- VocabularyWatcher and EssaysWatcher Reset:
- Modified their
stop
methods to signal the main observer with a special 'RESET' value:
typescript
public stop() {
this.watcher.close();
console.log('[EssaysWatcher] Stopped watching for file changes');
// Signal to the main observer that we're stopping
// This helps ensure that the next time the watcher starts, it will process all files
console.log('[EssaysWatcher] Signaling shutdown to main observer');
this.markFileAsProcessed('RESET');
}
- FileSystemObserver Reset Signal Handling:
- Enhanced the
markFileAsProcessed
method to recognize the special 'RESET' signal:
typescript
public markFileAsProcessed(filePath: string): void {
// Special case: If filePath is 'RESET', reset the processed files set
if (filePath === 'RESET') {
console.log('[Observer] Received RESET signal from a watcher');
this.resetProcessedFiles();
return;
}
this.processedFiles.add(filePath);
if (this.processedFiles.size % 10 === 0) {
console.log(`[Observer] Processed files tracking: ${this.processedFiles.size} files marked as processed`);
}
}
However, this approach still didn't solve the problem. After implementing these changes and restarting the observer, we still saw the following message in the logs:
text
[EssaysWatcher] [SKIP] File already processed in this session, skipping: /Users/mpstaton/code/lossless-monorepo/content/essays/Why Text Manipulation is Now Mission Critical.md
This indicates that despite our multi-layered approach to reset the processed files state, the state is still persisting between restarts.
The "Aha!" Moment
The key insight was understanding that the problem wasn't just about static vs. instance properties, but about how Node.js handles process restarts and module caching.
When using tools like
nodemon
, they don't fully terminate and restart the Node.js process in all cases. Instead, they may use various techniques to "hot reload" modules, which can lead to unexpected state persistence.The real issue was that we needed a more robust way to determine if a file should be processed, rather than relying solely on an in-memory set that could be affected by the Node.js module caching and process lifecycle.
But more importantly, we realized that our approach to tracking processed files was scattered across multiple files and classes, making it difficult to maintain and debug. We needed a centralized, modular approach that follows our project's principles of Single Source of Truth.
Final Solution
Our final solution involves creating a dedicated utility for tracking processed files, following the singleton pattern to ensure there's only one instance tracking files across the entire application:
- Create a Centralized Utility: We created a new file
processedFilesTracker.ts
that exports a singleton instance and convenience functions for tracking processed files.
typescript
// processedFilesTracker.ts
/**
* Processed Files Tracker
*
* A centralized utility for tracking which files have been processed by the observer system.
* This prevents infinite loops and duplicate processing while ensuring proper state management
* across process restarts.
*
* The tracker uses a singleton pattern to ensure there's only one instance tracking files
* across the entire application, regardless of how many components access it.
*/
import fs from 'fs';
import path from 'path';
import crypto from 'crypto';
/**
* Information about a processed file
*/
interface ProcessedFileInfo {
// When the file was processed
timestamp: number;
// Optional content hash for detecting actual changes
hash?: string;
}
/**
* Singleton class for tracking processed files across the application
*/
class ProcessedFilesTracker {
// Singleton instance
private static instance: ProcessedFilesTracker;
// Map to track processed files with timestamps
private processedFiles = new Map<string, ProcessedFileInfo>();
// Configurable expiration time (default: 5 minutes)
private expirationMs = 5 * 60 * 1000;
// File to persist processed state (optional)
private stateFilePath: string;
private persistStateToFile: boolean;
// Critical files that should always be processed regardless of tracking
private criticalFiles: string[] = [];
// Flag to track if the tracker has been initialized
private initialized = false;
/**
* Private constructor to enforce singleton pattern
*/
private constructor() {
// Default state file path in the same directory as this module
this.stateFilePath = path.join(__dirname, '.observer-state.json');
this.persistStateToFile = process.env.PERSIST_OBSERVER_STATE === 'true';
}
/**
* Get the singleton instance
*/
public static getInstance(): ProcessedFilesTracker {
if (!ProcessedFilesTracker.instance) {
ProcessedFilesTracker.instance = new ProcessedFilesTracker();
}
return ProcessedFilesTracker.instance;
}
/**
* Initialize the tracker
* @param options Configuration options
*/
public initialize(options?: {
expirationMs?: number;
stateFilePath?: string;
persistStateToFile?: boolean;
criticalFiles?: string[];
}): void {
// ... implementation details ...
}
/**
* Reset the processed files tracking
*/
public reset(): void {
console.log('[ProcessedFilesTracker] Resetting processed files tracking');
this.processedFiles.clear();
console.log('[ProcessedFilesTracker] Processed files tracking reset complete. Set size: 0');
}
/**
* Mark a file as processed
* @param filePath Path to the file to mark as processed
* @param generateHash Whether to generate a content hash for the file
*/
public markAsProcessed(filePath: string, generateHash: boolean = false): void {
// ... implementation details ...
}
/**
* Check if a file should be processed
* @param filePath Path to the file to check
* @param forceProcess Force processing regardless of tracking status
* @returns True if the file should be processed, false otherwise
*/
public shouldProcess(filePath: string, forceProcess: boolean = false): boolean {
// Always process if force flag is set
if (forceProcess) {
console.log(`[ProcessedFilesTracker] Force processing enabled for ${filePath}`);
return true;
}
// Always process critical files
const fileName = path.basename(filePath).toLowerCase();
if (this.criticalFiles.includes(fileName)) {
console.log(`[ProcessedFilesTracker] Critical file detected: ${filePath}, will process`);
return true;
}
// Check if file exists in processed set
const fileInfo = this.processedFiles.get(filePath);
if (!fileInfo) {
return true; // File not processed before
}
// Check if the entry has expired
const now = Date.now();
if (now - fileInfo.timestamp > this.expirationMs) {
console.log(`[ProcessedFilesTracker] Processing entry for ${filePath} has expired, will process again`);
return true;
}
console.log(`[ProcessedFilesTracker] File ${filePath} was processed recently. Skipping.`);
return false;
}
// ... other methods ...
/**
* Shutdown the tracker
*/
public shutdown(): void {
console.log('[ProcessedFilesTracker] Shutting down');
// Log the number of processed files
console.log(`[ProcessedFilesTracker] Number of processed files at shutdown: ${this.processedFiles.size}`);
// Persist state to file if enabled
if (this.persistStateToFile) {
this.saveStateToFile();
}
// Reset the processed files set to ensure a clean state for the next run
this.reset();
console.log('[ProcessedFilesTracker] Shutdown complete');
}
}
// Export singleton instance
export const processedFilesTracker = ProcessedFilesTracker.getInstance();
// Export convenience functions
export const initializeProcessedFilesTracker = (options?: {
expirationMs?: number;
stateFilePath?: string;
persistStateToFile?: boolean;
criticalFiles?: string[];
}) => processedFilesTracker.initialize(options);
export const markFileAsProcessed = (filePath: string, generateHash: boolean = false) =>
processedFilesTracker.markAsProcessed(filePath, generateHash);
export const shouldProcessFile = (filePath: string, forceProcess: boolean = false) =>
processedFilesTracker.shouldProcess(filePath, forceProcess);
export const resetProcessedFilesTracker = () =>
processedFilesTracker.reset();
export const shutdownProcessedFilesTracker = () =>
processedFilesTracker.shutdown();
export const addCriticalFile = (fileName: string) =>
processedFilesTracker.addCriticalFile(fileName);
- Update FileSystemObserver to use the centralized tracker:
typescript
// fileSystemObserver.ts
import {
initializeProcessedFilesTracker,
markFileAsProcessed,
shouldProcessFile,
resetProcessedFilesTracker,
shutdownProcessedFilesTracker,
addCriticalFile,
processedFilesTracker
} from './utils/processedFilesTracker';
export class FileSystemObserver {
// ... other properties ...
/**
* Reset the processed files set
* This should be called when the observer is started to ensure a clean slate
*/
public resetProcessedFiles(): void {
console.log('[Observer] Resetting processed files tracking');
resetProcessedFilesTracker();
console.log('[Observer] Processed files tracking reset complete');
}
/**
* Add a file to the processed files set
* This prevents the file from being processed again in this session
* @param filePath Path to the file to mark as processed
*/
public markFileAsProcessed(filePath: string): void {
markFileAsProcessed(filePath);
}
/**
* Check if a file has been processed in this session
* @param filePath Path to the file to check
* @returns True if the file has been processed, false otherwise
*/
public hasFileBeenProcessed(filePath: string): boolean {
return !shouldProcessFile(filePath);
}
constructor(templateRegistry: TemplateRegistry, reportingService: ReportingService, contentRoot: string) {
// ... other initialization ...
// Initialize the processed files tracker with critical files
initializeProcessedFilesTracker({
criticalFiles: ['Why Text Manipulation is Now Mission Critical.md']
});
console.log('[Observer] FileSystemObserver initialized with clean processed files state');
}
// ... other methods ...
private async handleShutdown() {
// ... existing shutdown logic ...
finally {
// CRITICAL: Explicitly shut down the processed files tracker before exiting
// This ensures that when the process is restarted, it starts with a clean slate
console.log('[Observer] Shutting down processed files tracker');
shutdownProcessedFilesTracker();
console.log('[Observer] Exiting in 250ms...');
setTimeout(() => {
console.log('[Observer] Process exit now.');
process.exit(0);
}, 250);
}
}
}
- Update all watchers to use the centralized tracker:
typescript
// essaysWatcher.ts, vocabularyWatcher.ts, remindersWatcher.ts
import { markFileAsProcessed, shouldProcessFile } from '../utils/processedFilesTracker';
// In the handleFile method:
if (!shouldProcessFile(filePath)) {
console.log(`[Watcher] [SKIP] File already processed in this session, skipping: ${filePath}`);
return;
}
// Mark file as processed
markFileAsProcessed(filePath);
This solution provides several key benefits:
- Single Source of Truth: There's now only one place in the codebase responsible for tracking processed files, making it easier to maintain and debug.
- Modular Design: The tracker is implemented as a separate utility that can be used by any component in the system.
- Robust File Processing Logic: The tracker includes logic for handling critical files, expiration of processed entries, and optional file-based persistence.
- Proper Shutdown Handling: The tracker is explicitly shut down when the observer is stopped, ensuring a clean slate for the next run.
- Improved Logging: The tracker provides detailed logging about its operations, making it easier to diagnose issues.
Lessons Learned
- Centralize State Management: When multiple components need to share state, it's best to centralize that state in a single module that follows the singleton pattern.
- Follow Single Source of Truth: Having the same logic implemented in multiple places leads to bugs and maintenance issues. Always strive for a single source of truth.
- Understand Node.js Process Lifecycle: Node.js processes can behave in unexpected ways, especially when using development tools like nodemon. It's important to understand how module caching and process restarts work.
- Use Proper Design Patterns: The singleton pattern was the right choice for this use case, as we needed to ensure that there's only one instance of the tracker across the entire application.
- Explicit Initialization and Shutdown: Always explicitly initialize and shut down stateful components to ensure proper cleanup and prevent state leakage.
By following these principles, we were able to create a robust solution that properly handles file processing state across process restarts, ensuring that all files are processed correctly regardless of how the observer is restarted.