YAML Prescreening and Triaging System

Executive Summary

Our content repository contains hundreds of markdown files with YAML frontmatter that drives critical site functionality. As we scale our content creation through both human authors and AI assistance, maintaining consistency in this frontmatter has become increasingly challenging.

The YAML Prescreening and Triaging System addresses this challenge by:

Proactively identifying formatting inconsistencies before they cause downstream build failures
Automatically correcting common issues without manual intervention
Triaging content files into categories based on their YAML health
Generating detailed reports of issues found and corrections made
Preventing problematic files from entering subsequent build processes

This system serves as a critical first line of defense in our content pipeline, significantly reducing build failures, improving reliability, and streamlining the content management process. For non-technical stakeholders, it can be understood as a "content quality firewall" that:

Reduces engineering time spent troubleshooting build failures
Improves publication reliability by identifying issues before they affect the live site
Provides visibility into content quality through clear reporting
Accelerates content production by allowing automatic fixes for common issues
Protects downstream systems from corrupted or malformed data

In a repository with over 700 content files, our initial implementation successfully identified and corrected 584 formatting issues across the content library, with 623 files passing all validation checks. This represents a significant improvement in content quality and builds system reliability with minimal manual intervention required.

Technical Specification

1. System Overview

The YAML Prescreening and Triaging System (prescreenFilesWithFilesystemRegex.cjs) is a Node.js utility designed to scan a content directory for markdown files, identify YAML frontmatter inconsistencies, and sort files into categories based on the severity of issues found. The script applies automated corrections where possible and generates comprehensive reports on the results.

This system operates as a prerequisite to the main build process, acting as a gatekeeper that:

Categorizes files as "clean", "minor inconsistencies", or "critical issues"
Attempts to fix common formatting problems automatically
Outputs detailed reports of its findings and actions
Provides a clear signal to downstream processes about which files are safe to process

2. Core Capabilities

2.1 Issue Detection

The system detects various types of YAML formatting issues:

Missing YAML delimiters: Detects files with opening but not closing delimiters
Block scalar formatting: Identifies improper use of block scalar indicators (>, |)
Duplicate keys: Finds repeated property keys in frontmatter
Quote formatting: Detects improperly quoted values, especially in error messages
URL formatting: Identifies URLs broken across multiple lines
Special property handling: Provides special detection for properties containing colons
Balanced quotes: Verifies that quoted values have proper closing quotes

2.2 Automated Correction

For each type of detected issue, the system implements specialized correction functions:

Delimiter correction: Attempts to add missing closing delimiters
Block scalar fixing: Properly formats block scalar content
Duplicate key removal: Eliminates redundant keys while preserving the first occurrence
Quote standardization: Fixes quotes in error messages and special properties
URL reconstruction: Rebuilds URLs broken across multiple lines
Quote balancing: Adds missing closing quotes to property values

2.3 File Categorization

Files are sorted into three categories:

Totally clean files: No issues detected, safe for all operations
Files with minor inconsistencies: Issues that might affect specific operations but don't prevent processing
Files with critical issues: Problems that make the file unsafe for any processing

2.4 Report Generation

The system produces three detailed reports:

Screened-In Files: Details on files that passed validation or had minor, fixable issues
Non-Blocking Observations: Analysis of minor inconsistencies and their impact
Screened-Out Files: Details on files with critical issues that prevent processing

These reports provide visibility into content quality and specific actions needed to improve it.

3. Technical Architecture

3.1 Configuration Parameters

The script uses configurable parameters:

javascript

// Content directory to scan
const CONTENT_DIR = process.env.CONTENT_DIR || 'site/src/content/tooling';

// Output file paths for reports
const TARGET_SCREENED_OUT_FILES_FILE_PATH = '...';
const TARGET_SCREENED_IN_FILES_FILE_PATH = '...';
const TARGET_NON_BLOCKING_OBSERVATIONS_FILE_PATH = '...';

// Special property lists
const URL_PROPERTIES = [
  'url', 'image', 'favicon', 'og_screenshot_url', 'og_image'
];

const SPECIAL_PROPERTIES = [
  'title', 'description', 'jina_error', 'og_error_message', 'zinger'
];

3.2 Pattern-Based Issue Detection

Issues are detected using a collection of regex patterns, each with metadata about impact:

javascript

const CORRUPTION_PATTERNS = [
  { 
    pattern: /pattern/,
    messageToLog: 'Human-readable description',
    preventsOperations: ['operation1', 'operation2'],
    correctionFunction: 'functionName',
    isCritical: boolean
  },
  // Additional patterns...
];

Each pattern includes:

A regex pattern for detection
A message to log when the pattern is found
A list of operations that would be affected
A reference to the correction function to apply (if available)
A flag indicating if the issue is critical

3.3 File Processing Workflow

The system follows this processing sequence:

Initialization
- Configure directories and output paths
- Initialize tracking arrays and maps
File Discovery
- Recursively search for markdown files in the content directory
- Track total file count
File Analysis and Correction
- For each file:
  - Read file content
  - Check for empty files
  - Check frontmatter structure
  - Apply each pattern to detect issues
  - Attempt corrections when possible
  - Track results of correction attempts
  - Categorize file based on results
Report Generation
- Compile statistics on processing results
- Format detailed reports for each category
- Write reports to designated output files

3.4 Correction Function Implementation

Correction functions follow a consistent pattern:

javascript

attemptToFixIssue: (filePath, content) => {
  try {
    // 1. Extract frontmatter if needed
    // 2. Apply the specific correction logic
    // 3. Return an object with results:
    return {
      success: boolean,
      message: 'Description of action taken',
      updatedContent: content // If successful
    };
  } catch (error) {
    return {
      success: false,
      message: `Error: ${error.message}`
    };
  }
}

These functions provide a uniform interface while implementing specialized logic for each issue type.

4. Report Format and Content

4.1 Screened-In Files Report

markdown

---
updated_on: "{timestamp}"
included_file_count: {count}
totally_clean_files: {count}
files_with_minor_inconsistencies: {count}
total_file_count: {count}
successful_corrections: {count}
---

# Markdown Files Screened-In for Processing
Last updated: {timestamp}

## Summary
- Totally clean files: {count}
- Files with minor inconsistencies: {count}
- Total included files: {count} (out of {total} total)
- Successful auto-corrections: {count}

## Files with Minor Inconsistencies
These files have issues that prevent specific operations, but can still be processed by other operations:

- {filePath}
  - Prevented operations: {operationList}
...

## Auto-Correction Attempts
- {filePath}
  - {issue}: {message}
  - {issue}: {message}
...

## Totally Clean Files
These files have no issues and can be fully processed:

- {filePath}
...

4.2 Non-Blocking Observations Report

markdown

---
updated_on: "{timestamp}"
---

# Non-Blocking Observations
Last updated: {timestamp}

## Files with Minor Inconsistencies
These files have issues that might affect some operations but are not critical:

- {filePath}
  - Prevented operations: {operationList}
...

## Operation Impact Summary
The following operations are affected by minor inconsistencies:

- {operation}: {count} files affected
...

4.3 Screened-Out Files Report

markdown

---
updated_on: "{timestamp}"
included_file_count: {count}
total_file_count: {count}
---

# Markdown Files Screened-Out from Processing
Last updated: {timestamp}

These {count} files (out of {total} total) have been screened out due to critical YAML issues:

- {filePath}
  - Correction attempts:
    - {issue}: {message}
    - {issue}: {message}
...

5. Integration with Build Pipeline

The prescreening system is designed to be run as an early step in the build process:

Execution timing: Run before any content processing scripts
Exit status: Returns success/failure status for pipeline integration
Report consumption: Downstream build steps can read reports to determine which files to process
Environmental integration: Can be configured via environment variables

The system exposes several exported functions for integration:

javascript

module.exports = {
  sortFilesIntoTwoArraysBasedOnFilesystemRegex,
  getScreenedInFiles: () => [...],
  getFileOperationMap: () => {...},
  getSuccessfulCorrections: () => count,
  canPerformOperation: (filePath, operation) => boolean
};

6. Performance and Scalability

The system is designed to handle large content repositories:

Progress tracking: Reports progress every 100 files
Error isolation: Errors in individual files don't stop the overall process
Memory efficient: Processes files sequentially to minimize memory footprint

On a repository with 700+ files, processing typically completes in under 30 seconds.

7. Results and Impact

In our production deployment, the script successfully processed:

Files Evaluated: 729 files
Totally Clean Files: 623 files (85.5%)
Files with Minor Inconsistencies: 69 files (9.5%)
Files with Critical Issues: 37 files (5.1%)
Correction Attempts: 584 instances

These corrections have:

Eliminated many build errors related to YAML parsing
Reduced manual content fixes needed
Improved reliability of the build process
Provided visibility into content quality issues

Constraints for AI Code Assistants

When working with this system, AI Code Assistants MUST follow these strict guidelines:

1. Report Generation Imperatives

NEVER add validation that prevents report files from being overwritten
ALWAYS ensure reports are generated regardless of other errors or conditions
NEVER create dependencies between report file generation and other processing steps
ALWAYS use direct file writing operations rather than conditional checks before writing

2. Error Handling Directives

DO NOT prevent report generation due to errors in processing
ISOLATE errors to individual files rather than stopping the entire process
LOG errors but continue processing remaining files
DO NOT add additional validation logic without explicit permission

3. File Handling Requirements

ALWAYS create output directories recursively if they don't exist
ALWAYS use w flag when writing files to ensure overwriting
IGNORE errors during directory creation
DO ensure file handles are properly closed

4. Pattern Modification Guidelines

DO NOT modify existing corruption patterns without explicit approval
CAREFULLY TEST any new patterns to avoid false positives
MAINTAIN pattern metadata including operations affected and criticality
DOCUMENT the purpose and impact of new patterns

5. Integration Rules

PRESERVE all exported functions and their signatures
MAINTAIN backward compatibility with any code that uses this module
DO NOT change the category definitions (clean, minor issues, critical issues)
RESPECT the environment variable configuration options

6. URL Property Requirements

NEVER add quotes around URL properties (url, image, favicon, og_screenshot_url, og_image, etc.)
ALWAYS ensure that any correction function removes quotes from URL values
VERIFY that pattern detection properly identifies quoted URLs as issues to be fixed
ENSURE all URL-related correction logic preserves unquoted URL formatting

7. Code Modification Restrictions

When modifying the prescreening system, AI Assistants MUST:

Present a clear explanation of proposed changes
Show exact code modifications
Explain potential impacts on processing and reporting
Be especially conservative with any changes to:
- File categorization logic
- Critical issue definitions
- Report writing functions
- Exported interface functions
- URL property handling

These constraints are designed to prevent regressions in a critical early-pipeline component that has direct impact on build reliability.

Implementation Tips

For Engineers

Run in isolation: Test the script against a subset of files first
Compare reports: Verify that reports are generated as expected
Validate corrections: Check a sample of corrected files to ensure changes are appropriate
Monitor performance: For large repositories, track processing time
Set up automation: Include the script in CI/CD pipelines as an early step

For AI Assistants

Reference patterns: When diagnosing issues, reference the specific patterns that detected them
Suggest focused corrections: Recommend specific changes to fix issues
Provide context: Explain how an issue might affect downstream processing
Be conservative: When in doubt, preserve original content
Explain trade-offs: When suggesting changes, explain the pros and cons

This specification provides a comprehensive framework for understanding, implementing, and extending the YAML Prescreening and Triaging System. It balances the need for rigorous validation with practical considerations for automated content management at scale.