Create a Content Registry for Markdown Files
Executive Summary
The Content Registry system (
trackMarkdownFilesInRegistry.cjs
) is a critical component of our content management infrastructure. It maintains a centralized, UUID-based registry of all markdown files, tracking their metadata, relationships, and complete history of changes.Business Impact
- Enables efficient content discovery and relationships
- Provides robust version tracking and change history
- Supports future database migration with UUID-first design
- Maintains data integrity with non-destructive operations
- Creates foundation for advanced content features
Key Features
- UUID-based document identification
- Comprehensive history tracking with ISO timestamps
- Multiple indexing strategies for efficient lookups
- Relationship tracking between documents
- Detailed error reporting and validation
Technical Specification
Architecture Overview
graph TD
A[Markdown Files] --> B[Extract Frontmatter]
B --> C[Process Document]
C --> D[Generate/Verify UUID]
D --> E[Extract Metadata]
E --> F[Build Relationships]
F --> G[Update History]
G --> H[Update Indices]
H --> I[Merge with Registry]
I --> J[Write Registry]
J --> K[Generate Report]
Core Components
1. Registry Data Model
json
{
"documents": {
"[uuid]": {
"referredToAs": {
"primaryFileName": "string",
"aliases": []
},
"urls": {
"siteUrl": "string",
"youtubeChannelUrl": "string",
// ... other URLs
},
"primaryFiles": {
"canonical": {
"path": "string"
},
"document_variants": []
},
"connectedDocuments": {
"connected_documents": [
{
"type": "string",
"reference": "string"
}
]
},
"history": [
{
"timestamp": "ISO-8601",
"type": "event_category",
"action": "specific_action",
"details": {}
}
],
"metadata": {
"siteVisibility": "string",
"semanticVersion": {
"version": "number",
"created_at": "ISO-8601",
"last_modified": "ISO-8601",
"status": "string"
}
}
}
},
"indices": {
"by_filename": {
"[filename]": {
"uuid": "string",
"context": "string",
"is_canonical": "boolean"
}
},
"by_path": {
"[path]": "uuid"
},
"by_uuid": {
"[uuid]": {
"memory": "number",
"timestamp": "ISO-8601"
}
}
}
}
2. Core Functions
- Document Processing
- UUID generation/verification
- Frontmatter extraction
- Property mapping and normalization
- History entry creation
- Relationship building
- Registry Management
- Non-destructive updates
- Index maintenance
- Version tracking
- Change detection
- Error Handling
- Validation checks
- Error reporting
- Recovery mechanisms
Implementation Details
1. Property Mapping
- Snake case to camel case conversion
- URL property standardization
- Special handling for parent organizations
- Timestamp normalization
2. History Tracking
json
{
"history": [
{
"timestamp": "2025-03-17T06:02:15.000Z",
"type": "content_creation",
"action": "initial_creation",
"details": {
"source": "markdown_file",
"path": "/path/to/file.md"
}
},
{
"timestamp": "2025-03-17T06:02:15.000Z",
"type": "reference_update",
"action": "parent_org_linked",
"details": {
"type": "parentOrganization",
"value": "Organization Name",
"source": "frontmatter"
}
}
]
}
Event Types and Actions
- Content Events
content_creation
: Initial document creationcontent_update
: Modifications to content- Example: Adding URLs, changing text
- Metadata Events
metadata_update
: Changes to document metadata- Actions:
version_increment
,status_change
- Example: Updating visibility settings
- Reference Events
reference_update
: Changes to document relationships- Actions:
parent_org_linked
,parent_org_changed
- Example: Linking parent organizations
- Path Events
path_change
: File location changes- Example: Document moves or renames
- AI Interaction Events
ai_interaction
: AI service operations- Example: OpenGraph fetches
History Best Practices
- Timestamps
- Always use ISO 8601 format
- Include timezone information
- Example:
2025-03-17T06:02:15.000Z
- Event Structure
- Chronological order
- Append-only updates
- Detailed context in details object
- Change Tracking
- Record both old and new values
- Include change source
- Track user operations
- Version Control
- Increment on meaningful changes
- Track change rationale
- Maintain status history
3. File Name Handling
javascript
// Primary File Name Extraction
const primaryFileName = path.basename(filePath, '.md');
// Example: 'site/src/content/tooling/AI-Toolkit/Limitless AI.md' -> 'Limitless AI'
// Context Path Generation
const context = path.dirname(filePath).split('/').slice(-2).join('/');
// Example: 'site/src/content/tooling/AI-Toolkit/Limitless AI.md' -> 'AI-Toolkit'
// Index Entry Creation
const indexEntry = {
uuid: documentUuid,
context: context,
is_canonical: true
};
Document Relationships and Indexing
1. Document Relationships
json
{
"connectedDocuments": {
"connected_documents": [
{
"type": "parentOrganization",
"reference": "Organization Name"
},
{
"type": "canonical",
"reference": "Primary Document UUID"
}
]
}
}
Relationship Types
- Parent Organizations
- Links to organizational entities
- Maintains clean hierarchy
- Example: Company -> Product
- Canonical References
- Points to primary document
- Handles content variants
- Example: Original -> Translation
- Content Hierarchies
- Supports nested structures
- Maintains parent-child links
- Example: Course -> Lesson
- Alternative Versions
- Tracks document variants
- Links related content
- Example: Draft -> Published
2. Index Structure
json
{
"indices": {
"by_filename": {
"Document Name": {
"uuid": "32e4500c-1d6b-40ac-8524-b566904e5dc5",
"context": "tooling/Productivity",
"is_canonical": true
}
},
"by_path": {
"/absolute/path/to/file.md": "32e4500c-1d6b-40ac-8524-b566904e5dc5"
},
"by_uuid": {
"32e4500c-1d6b-40ac-8524-b566904e5dc5": {
"memory": 4.0355987548828125,
"timestamp": "2025-03-17T06:02:15.000Z"
}
}
}
}
Index Benefits
- Multiple Access Patterns
- Fast filename lookups
- Efficient path resolution
- Direct UUID access
- Context Awareness
- Directory-based context
- Disambiguation support
- Hierarchical organization
- Performance Optimization
- O(1) lookups by UUID
- Quick path resolution
- Efficient caching
- Data Integrity
- Minimal duplication
- Easy validation
- Clean separation
Index Management
- Filename Index
- Stores document context
- Tracks canonical status
- Supports disambiguation
- Path Index
- Maps absolute paths
- Quick file location
- Efficient updates
- UUID Index
- Primary lookup table
- Performance metrics
- Timestamp tracking
Integration Points
1. Build Process
- Part of the master build orchestration
- Pre-build validation
- Post-build reporting
2. Content Management
- Markdown file processing
- Frontmatter standardization
- Relationship mapping
- Version tracking
Error Handling and Reporting
1. Validation Checks
- UUID presence and format
- Required property validation
- URL format verification
- Relationship integrity
2. Error Reports
- Detailed error messages
- File location information
- Suggested fixes
- Impact assessment
Performance Considerations
1. UUID-First Design Benefits
- O(1) document lookups
- Efficient relationship tracking
- Natural sharding capability
- Clean content/index separation
- Duplicate handling support
2. Resource Management
- Memory-efficient operations
- Controlled file I/O
- Proper cleanup procedures
Documentation Requirements
1. Code Documentation
- Function documentation
- Type definitions
- Usage examples
- Error handling guidelines
2. User Documentation
- Configuration options
- Usage instructions
- Troubleshooting guide
- Best practices
Testing Requirements
1. Test Cases
- UUID generation/verification
- Property mapping
- History tracking
- Index management
- Error handling
2. Validation
- Data integrity checks
- Format validation
- Relationship verification
- Index consistency
Security Considerations
1. Data Protection
- Safe file operations
- Error message sanitization
- Input validation
- Access control
2. Error Prevention
- Type checking
- Path validation
- Format verification
- Relationship integrity
Maintenance and Support
1. Monitoring
- Error tracking
- Performance metrics
- Usage statistics
- Health checks
2. Updates
- Version compatibility
- Data migration
- Schema evolution
- Feature additions
DataStore/Registry Handling for Content-Wide Syntax (Draft Guidance)
Some classes of content observation—such as citations, media links, embeds, and images—require a persistent registry ("dataStore") in the form of a JSON file. This registry tracks all unique instances of specific syntax across the entire content library.
Why Use a Registry?
- De-duplication and normalization: Ensures each unique reference (e.g., a citation, image, or media embed) is tracked once, even if referenced in multiple files.
- Cross-file analytics: Enables reporting and analysis of usage patterns, orphaned references, and content relationships.
- Atomic updates: Guarantees that registry changes are never left in a partial or corrupted state.
- Extensibility: New content types (e.g., images, embeds) can adopt the same registry pattern as citations.
General Principles
- Single Source of Truth: Each registry must be a single, well-known JSON file (e.g.,
site/src/content/citations/citation-registry.json
). - Schema-Driven: Every registry should have a documented, versioned schema/interface, validated on every update.
- Idempotency: Re-processing the same file/content must not introduce duplicates or inconsistent state.
- Atomicity: Updates must be atomic; never leave a registry in a partially written state.
- Extensibility: New registry types (e.g., for images, media, embeds) should follow the same service pattern as citations.
Example: Citation Registry
File:
site/src/content/citations/citation-registry.json
Interface:
typescript
interface CitationRegistry {
sources: Record<string, { title: string; author: string; year: number; url: string }>;
citations: Record<string, Array<{ hex: string; context: string }>>;
}
Service Pattern:
- Singleton pattern for registry access (e.g.,
CitationRegistry.getInstance()
) - Methods for adding, updating, and saving citations
- Loading and saving to disk with error handling
Example: Media/Image Registry (Proposed)
File:
site/src/content/media/media-registry.json
Interface:
typescript
interface MediaRegistry {
media: Record<string, {
type: 'image' | 'video' | 'audio' | 'embed';
url: string;
files: string[]; // Markdown files where this media appears
metadata?: Record<string, any>;
dateCreated: string;
dateUpdated: string;
}>;
}
Service Pattern:
- Singleton and atomic update pattern as with citations
- On file observation, extract all media links/embeds, normalize, and update registry
- Always update the
files
array to include the referencing markdown
Implementation Checklist
- Registry Service
- Each registry (citations, media, etc.) must have a dedicated service (e.g.,
citationService.ts
,mediaService.ts
). - Service must provide:
addEntry
,updateEntry
,getEntry
,saveToDisk
,loadFromDisk
.
- Template Configuration
- Templates that require registry updates must declare the registry path and config in their template definition (see
citationConfig
incitations.ts
).
- Observer Integration
- On file event, observer extracts relevant syntax (citations, media, etc.).
- Calls the appropriate service to update the registry.
- All registry updates are logged in the reporting service.
- Error Handling
- If the registry file is locked/corrupted, log the error, skip the update, and flag for manual intervention.
- Never block the entire observer pipeline due to registry errors—fail gracefully.
- Reporting
- Registry changes (new entries, updates, removals) must be summarized in the period-based report.
- Include before/after snapshots or diffs for transparency.
Example Registry Update Flow (Pseudocode)
typescript
// On file change event:
const fileMediaLinks = extractMediaLinks(fileContent);
for (const link of fileMediaLinks) {
mediaRegistryService.addOrUpdateMedia(link, filePath);
}
await mediaRegistryService.saveToDisk();
reportingService.logRegistryUpdate('media', link, filePath);
Open Questions
- Should registry updates be batched and flushed at intervals, or written immediately?
- How to handle concurrent updates (e.g., via multiple observer processes)?
- Should registries include a changelog/history for auditability?
This section is intended as a living draft and should be refined as the first registry-backed observer (e.g., citations) is stabilized and new content types are added.