Frontmatter consistency through filesystem observer

Objective

Create a robust filesystem observer system that monitors Markdown files, validates frontmatter against predefined templates, and automatically corrects inconsistencies while preserving the most accurate metadata.

Implementation Status

This system has been successfully implemented in the tidyverse/observers directory with the following key features:
  • Template-based frontmatter validation
  • Automatic correction of missing required fields
  • Special handling for date_created using file birthtime
  • Kebab-case to snake_case property conversion
  • Proper YAML formatting for tags and arrays

System Architecture

graph TD A[File System] -->|File Events| B[FileSystemObserver] B -->|Read File| C[Extract Frontmatter] C -->|Validate| D[TemplateRegistry] D -->|Get Template| E[Template Definitions] C -->|Missing Fields?| F[addMissingRequiredFields] F -->|Special Handling| G[date_created] G -->|Compare with| H[File Birthtime] F -->|Update| I[Write Updated File] B -->|Log Activity| J[ReportingService] J -->|Generate| K[Markdown Reports]

Data Flow

  1. File Detection:
    text
    File System (new/modified file)
      → FileSystemObserver (event)
        → Extract Frontmatter
          → Validate Against Template
  2. Field Processing:
    text
    Template Registry (find matching template)
      → Check Required Fields
        → Special Handling for date_created
          → Compare with File Birthtime
            → Keep Earlier Date
  3. Reporting Flow:
    text
    Observer Activity
      → ReportingService
        → Log Property Conversions
          → Generate Markdown Reports

Key Components

1. Template Registry

typescript
// Template definition pattern
interface Template {
  id: string;
  name: string;
  description: string;
  
  // Path matching rules
  pathPatterns: string[];
  
  // Schema definition
  required: {
    [key: string]: {
      type: string;
      description: string;
      defaultValueFn?: (filePath: string) => any;
    }
  };
  
  optional: {
    [key: string]: {
      type: string;
      description: string;
      defaultValueFn?: (filePath: string) => any;
    }
  };
}

2. File Observer

typescript
class FileSystemObserver {
  constructor(
    private templateRegistry: TemplateRegistry,
    private contentRoot: string
  ) {
    this.watcher = chokidar.watch(contentRoot);
  }
  
  async onFileChanged(filePath: string) {
    // Read file and extract frontmatter
    const content = await fs.readFile(filePath, 'utf8');
    const frontmatterResult = this.extractFrontmatter(content);
    
    if (frontmatterResult.frontmatter) {
      // Find matching template
      const template = this.templateRegistry.findTemplate(filePath);
      
      // Add missing required fields
      const { updatedFrontmatter, changed } = addMissingRequiredFields(
        frontmatterResult.frontmatter,
        template,
        filePath
      );
      
      // Write updated file if changes were made
      if (changed) {
        await this.writeUpdatedFile(filePath, updatedFrontmatter, frontmatterResult.content);
      }
    }
  }
}

3. Special Handling for date_created

typescript
// In addMissingRequiredFields function
if (key === 'date_created') {
  try {
    // Get file birthtime
    const fs = require('fs');
    if (fs.existsSync(filePath)) {
      const stats = fs.statSync(filePath);
      const birthtime = stats.birthtime;
      const birthtimeIso = birthtime.toISOString();
      
      // If date_created exists, check if birthtime is earlier
      if (updatedFrontmatter[key]) {
        const existingDate = new Date(updatedFrontmatter[key]);
        
        // If birthtime is earlier than the existing date_created, update it
        if (birthtime < existingDate) {
          console.log(`Updating date_created for ${filePath} from ${updatedFrontmatter[key]} to ${birthtimeIso} (file birthtime is earlier)`);
          updatedFrontmatter[key] = birthtimeIso;
          changed = true;
        } else {
          console.log(`Keeping existing date_created for ${filePath}: ${updatedFrontmatter[key]} (earlier than file birthtime ${birthtimeIso})`);
        }
      } 
      // If date_created doesn't exist, add it
      else {
        console.log(`Adding date_created for ${filePath}: ${birthtimeIso}`);
        updatedFrontmatter[key] = birthtimeIso;
        changed = true;
      }
      
      // Skip the standard field processing for date_created
      continue;
    }
  } catch (error) {
    console.error(`Error handling date_created for ${filePath}:`, error);
    // Continue with standard processing if there was an error
  }
}

4. Template Definition for Tooling

typescript
const toolingTemplate = {
  id: 'tooling',
  name: 'Tooling Document',
  description: 'Template for tooling documentation',
  
  pathPatterns: ['content/tooling/**/*.md'],
  
  required: {
    site_uuid: {
      type: 'string',
      description: 'Unique identifier for the site',
      defaultValueFn: () => uuidv4()
    },
    tags: {
      type: 'array',
      description: 'Categorization tags',
      defaultValueFn: (filePath) => {
        // Extract directory structure as tags
        try {
          // Extract all directory names after 'tooling'
          const pathParts = filePath.split('/');
          const toolingIndex = pathParts.findIndex(part => part === 'tooling');
          
          if (toolingIndex >= 0) {
            // Get all directory names after 'tooling' and before the filename
            const tags = pathParts.slice(toolingIndex + 1, -1).map(tag => tag.replace(/\s+/g, '-'));
            return tags.length > 0 ? tags : ['Uncategorized'];
          }
          return ['Uncategorized'];
        } catch (error) {
          console.error(`Error generating tags for ${filePath}:`, error);
          return ['Uncategorized'];
        }
      }
    },
    date_created: {
      type: 'date',
      description: 'Creation date',
      defaultValueFn: (filePath) => {
        try {
          // Use the Node.js fs module for synchronous operations
          const fs = require('fs');
          
          // Check if file exists
          if (fs.existsSync(filePath)) {
            // Get file stats to access creation time
            const stats = fs.statSync(filePath);
            
            // Use birthtime (actual file creation time) which is reliable on Mac
            const timestamp = stats.birthtime;
            
            // Return full ISO string with timezone
            return timestamp.toISOString();
          } else {
            // Return null instead of current date
            return null;
          }
        } catch (error) {
          // Return null instead of current date
          return null;
        }
      }
    },
    date_modified: {
      type: 'date',
      description: 'Last modified date',
      defaultValueFn: (filePath) => {
        // Similar to date_created but using mtime
        // Implementation details...
      }
    }
  },
  
  optional: {
    // Optional fields definition
    // Implementation details...
  }
};

Best Practices

  1. Reliable File Timestamps:
    • Use birthtime for date_created which is reliable on Mac systems
    • Compare existing values with file timestamps and keep the earlier date
    • Add proper error handling to prevent fallbacks to current date
  2. Frontmatter Consistency:
    • Convert kebab-case properties to snake_case
    • Format tags as proper YAML lists with hyphens
    • Preserve content while updating frontmatter
  3. Reporting and Monitoring:
    • Log all property conversions and validation issues
    • Generate periodic reports in markdown format
    • Create a final report on system shutdown
  4. Code Reuse and Shared Functionality:
    • Extract common functionality into shared utility modules
    • Implement a single source of truth for operations used across multiple templates
    • All templates should use the same shared code for common operations like:
      • UUID generation
      • Date handling
      • File stats retrieval
      • Tag formatting
    • Never duplicate functionality across template files
    • When adding new functionality to one template, ensure it's available to all templates that need it

Constraints and Limitations

  1. File System Compatibility:
    • The birthtime property is reliable on Mac but may not be on all systems
    • Error handling is in place to prevent incorrect timestamps
  2. Performance Considerations:
    • Synchronous file operations are used for simplicity but may impact performance with large numbers of files
    • Consider batch processing for large directories
  3. Template Management:
    • Templates must be manually updated when frontmatter requirements change
    • No automatic detection of new frontmatter patterns
  4. Preventing Infinite Loops:
    • The observer must track files currently being processed to prevent infinite loops
    • Implement async promise-based processing with proper error handling

CRITICAL: Preventing Infinite Loops

The most critical aspect of the filesystem observer implementation is preventing infinite loops. This is ABSOLUTELY ESSENTIAL for proper functioning:
typescript
class FileSystemObserver {
  private processingFiles: Set<string> = new Set(); // Track files currently being processed
  
  async onFileChanged(filePath: string): Promise<void> {
    // Skip if this file is already being processed to prevent infinite loops
    if (this.processingFiles.has(filePath)) {
      console.log(`Skipping ${filePath} as it's already being processed (preventing loop)`);
      return;
    }
    
    try {
      // Mark file as being processed
      this.processingFiles.add(filePath);
      
      // Process the file...
      
    } finally {
      // CRITICAL: Always remove from processing set when done
      this.processingFiles.delete(filePath);
    }
  }
}

Why This Is Critical

  1. Infinite Loop Prevention: Without this mechanism, the observer will enter an infinite loop because:
    • Observer detects file change
    • Observer updates file
    • Update triggers another file change event
    • Process repeats indefinitely
  2. Async Promise-Based Processing: All file processing must use async/await with proper Promise handling to ensure:
    • Operations complete fully before releasing the file lock
    • Error handling doesn't prevent cleanup
    • File state remains consistent
  3. User-Triggered Changes Only: The observer should ONLY process changes that are actually made by the USER, not changes made by the observer itself.
  4. Resource Protection: Infinite loops can quickly:
    • Consume all available CPU
    • Fill up disk space with logs
    • Corrupt files with partial updates
    • Crash the entire application
This is not an optional feature - it is the single most important aspect of the implementation that must be implemented correctly.

Two-Phase Observer Approach

Another critical implementation detail is using a two-phase approach to prevent observer loops while still ensuring all files are properly processed:
typescript
class FileSystemObserver {
  private initialProcessingComplete: boolean = false;
  private initialProcessingTimeout: NodeJS.Timeout | null = null;
  
  constructor(
    templateRegistry: TemplateRegistry,
    reportingService: ReportingService,
    contentRoot: string,
    private options: {
      ignoreInitial?: boolean;
      processExistingFiles?: boolean;
      initialProcessingDelay?: number; // Delay in ms before switching to regular observer mode
    } = {}
  ) {
    // Set default options
    this.options.initialProcessingDelay = this.options.initialProcessingDelay ?? 90000; // Default 90 seconds
    
    // Set up initial processing timeout
    if (this.options.processExistingFiles) {
      console.log(`Initial processing mode active. Will switch to regular observer mode after ${this.options.initialProcessingDelay / 1000} seconds.`);
      this.initialProcessingTimeout = setTimeout(() => {
        console.log('Switching to regular observer mode...');
        this.initialProcessingComplete = true;
        
        // Generate a report after initial processing
        this.reportingService.generateReport();
      }, this.options.initialProcessingDelay);
    }
  }
  
  async onFileChanged(filePath: string): Promise<void> {
    // Standard loop prevention first
    if (this.processingFiles.has(filePath)) {
      return;
    }
    
    // Additional loop prevention for regular observer mode
    if (this.initialProcessingComplete) {
      // Check if this is a file we just updated
      const lastModified = (await fs.stat(filePath)).mtime.getTime();
      const currentTime = Date.now();
      const timeSinceModification = currentTime - lastModified;
      
      // If the file was modified very recently (within 5 seconds) and we're in regular observer mode,
      // it's likely our own update, so skip it
      if (timeSinceModification < 5000) {
        console.log(`Skipping recently modified file ${filePath} to prevent observer loop (modified ${timeSinceModification}ms ago)`);
        return;
      }
    }
    
    // Process the file...
  }
}

Why This Approach Is Essential

  1. Initial Processing Phase:
    • Processes all existing files once at startup
    • Runs for a fixed duration (90 seconds by default)
    • Generates a comprehensive report after completion
  2. Regular Observer Phase:
    • Automatically activates after the initial processing phase
    • Only processes files that were genuinely modified by users
    • Includes a smart detection system to ignore self-triggered changes
  3. Benefits:
    • Ensures all files are processed once during startup
    • Automatically transitions to a stable monitoring mode
    • Self-triggered changes don't cause infinite processing loops
    • Provides clear logging about which phase the observer is in
  4. Configuration Options:
    • initialProcessingDelay: Adjustable based on content directory size (default: 90 seconds)
    • processExistingFiles: Can be disabled if only new changes should be processed
This two-phase approach complements the processingFiles tracking mechanism and provides an additional layer of protection against observer loops.

Next Steps

  1. Enhanced Validation:
    • Add more sophisticated validation rules for specific field types
    • Implement cross-field validation (e.g., date_created should be before date_modified)
  2. User Interface:
    • Create a dashboard for monitoring observer activity
    • Add interactive controls for managing templates
  3. Integration:
    • Connect with build process to ensure frontmatter is valid before deployment
    • Add hooks for custom processing of specific fields