Report Template Service
Report Template Service
1. Executive Summary
The Report Template Service provides standardized reporting templates focused on operational health across the Augment-It platform's distributed architecture. This service generates concise, actionable summaries highlighting critical issues like error spikes, performance bottlenecks, and resource consumption problems that require immediate attention.
The service integrates with the Log Assembler Service to automatically generate reports on:
- Error Summaries: Grouped errors and failure patterns
- Performance Issues: Slow operations and user wait times
- Resource Problems: Memory usage spikes and CPU bottlenecks
- User Impact: Operations that frustrate or block users
2. Service Overview
Core Focus Areas
- Error Summaries
- Grouped error patterns across services
- Critical errors affecting multiple users
- New error types that just appeared
- Services with high error rates
- Performance Bottlenecks
- Operations taking too long (>2-5 seconds)
- Database queries running slow
- API calls timing out
- Module Federation load times
- Resource Issues
- Memory usage spikes
- CPU usage sustained above 80%
- Container restart patterns
- Services hitting resource limits
- User Impact
- Features users can't access
- Operations that make users wait
- Repeated user retry patterns
- Failed user workflows
Key Features
- Simple Templates: Focus on "what's broken" and "what's slow"
- Automated Generation: Reports triggered by thresholds
- Action-Oriented: Each report includes next steps
- Multi-Format Output: Slack, email, dashboard widgets
- Historical Trending: "Getting better" or "getting worse"
3. Report Templates
Template 1: Error Summary Report
yaml
template_id: error-summary
name: "System Error Summary"
trigger:
- error_rate > 5/minute
- new_error_pattern_detected
- critical_service_down
format:
title: "🚨 Error Summary - {{timeRange}}"
sections:
- type: alert_summary
content: |
**Critical Issues:** {{criticalCount}}
**New Errors:** {{newErrorCount}}
**Affected Users:** {{affectedUserCount}}
**Worst Service:** {{worstService}} ({{worstServiceErrorRate}}% errors)
- type: error_list
limit: 5
content: |
**Top Errors:**
{{#each topErrors}}
• **{{service}}**: {{message}} ({{count}} times)
- First seen: {{firstSeen}}
- Affects: {{affectedUsers}} users
{{/each}}
- type: action_items
content: |
**Immediate Actions:**
{{#if criticalErrors}}
• 🔥 **CRITICAL**: Check {{criticalService}} - service may be down
{{/if}}
{{#if newErrors}}
• 🆕 **NEW**: Investigate new error in {{newErrorService}}
{{/if}}
{{#if highErrorRate}}
• ⚠️ **HIGH RATE**: {{highErrorRateService}} needs attention
{{/if}}
example_output: |
🚨 Error Summary - Last 30 minutes
**Critical Issues:** 2
**New Errors:** 1
**Affected Users:** 47
**Worst Service:** prompt-manager (12% errors)
**Top Errors:**
• **user-auth-service**: JWT token expired (23 times)
- First seen: 2 minutes ago
- Affects: 23 users
• **api-connector**: OpenAI API timeout (15 times)
- First seen: 15 minutes ago
- Affects: 15 users
**Immediate Actions:**
• 🆕 **NEW**: Investigate new error in prompt-manager
• ⚠️ **HIGH RATE**: api-connector needs attention Template 2: Performance Issues Report
yaml
template_id: performance-issues
name: "Performance Issues Summary"
trigger:
- avg_response_time > 3000ms
- memory_usage > 85%
- cpu_usage > 80%
- slow_query_detected
format:
title: "🐌 Performance Issues - {{timeRange}}"
sections:
- type: performance_summary
content: |
**Slow Operations:** {{slowOperationCount}}
**Memory Issues:** {{memoryIssueCount}} services
**Slowest Service:** {{slowestService}} ({{slowestTime}}ms avg)
**Users Waiting:** {{usersAffected}} experiencing delays
- type: slow_operations
limit: 5
content: |
**Operations Taking Too Long:**
{{#each slowOperations}}
• **{{service}}**: {{operation}} ({{avgTime}}ms)
- Normal time: {{normalTime}}ms
- {{affectedRequests}} requests affected
{{/each}}
- type: resource_issues
content: |
**Resource Problems:**
{{#each resourceIssues}}
• **{{service}}**: {{resourceType}} at {{usage}}%
- Trend: {{trend}}
- Action needed: {{action}}
{{/each}}
example_output: |
🐌 Performance Issues - Last hour
**Slow Operations:** 3
**Memory Issues:** 2 services
**Slowest Service:** insight-assembler (4.2s avg)
**Users Waiting:** 12 experiencing delays
**Operations Taking Too Long:**
• **insight-assembler**: Generate insight report (4200ms)
- Normal time: 800ms
- 8 requests affected
• **api-connector**: Claude API call (3100ms)
- Normal time: 1200ms
- 15 requests affected
**Resource Problems:**
• **prompt-manager**: Memory at 91%
- Trend: Increasing
- Action needed: Check for memory leaks Template 3: User Impact Report
yaml
template_id: user-impact
name: "User Impact Summary"
trigger:
- user_retry_rate > 20%
- feature_unavailable
- user_wait_time > 5000ms
format:
title: "👥 User Impact Summary - {{timeRange}}"
sections:
- type: impact_summary
content: |
**Users Affected:** {{totalUsersAffected}}
**Features Broken:** {{brokenFeatureCount}}
**User Retries:** {{retryCount}} ({{retryRate}}%)
**Longest Wait:** {{longestWait}}s for {{slowestFeature}}
- type: broken_features
content: |
**Features Users Can't Access:**
{{#each brokenFeatures}}
• **{{feature}}**: {{issue}}
- Users affected: {{userCount}}
- Since: {{duration}} ago
{{/each}}
- type: user_frustration
content: |
**User Frustration Indicators:**
{{#each frustrationPoints}}
• {{description}}
- Pattern: {{pattern}}
- Impact: {{impact}}
{{/each}}
example_output: |
👥 User Impact Summary - Last 2 hours
**Users Affected:** 34
**Features Broken:** 1
**User Retries:** 67 (23%)
**Longest Wait:** 8.3s for AI response generation
**Features Users Can't Access:**
• **Template Library**: Database connection failed
- Users affected: 12
- Since: 45 minutes ago
**User Frustration Indicators:**
• Users clicking "Generate" button multiple times
- Pattern: 15 users, avg 3 clicks
- Impact: AI requests backing up Template 4: Resource Alert Report
yaml
template_id: resource-alert
name: "Resource Alert Summary"
trigger:
- container_restart_count > 3
- memory_usage > 90%
- disk_usage > 85%
- pod_evicted
format:
title: "⚡ Resource Alert - {{timeRange}}"
sections:
- type: resource_summary
content: |
**Services at Risk:** {{atRiskCount}}
**Container Restarts:** {{restartCount}}
**Memory Pressure:** {{memoryPressureServices}} services
**Immediate Action Required:** {{actionRequired}}
- type: resource_details
content: |
**Resource Problems:**
{{#each resourceProblems}}
• **{{service}}** ({{container}})
- {{resourceType}}: {{currentUsage}} (limit: {{limit}})
- Trend: {{trend}} over {{timeframe}}
- Risk: {{riskLevel}}
{{/each}}
- type: actions
content: |
**Required Actions:**
{{#each actions}}
• {{priority}} **{{service}}**: {{action}}
{{/each}}
example_output: |
⚡ Resource Alert - Current
**Services at Risk:** 2
**Container Restarts:** 4
**Memory Pressure:** 3 services
**Immediate Action Required:** YES
**Resource Problems:**
• **insight-assembler** (pod-xyz-123)
- Memory: 1.8GB (limit: 2GB)
- Trend: +200MB over 30min
- Risk: HIGH - approaching limit
• **api-connector** (pod-abc-456)
- CPU: 850m (limit: 1000m)
- Trend: sustained high over 20min
- Risk: MEDIUM - performance impact
**Required Actions:**
• 🔥 **insight-assembler**: Increase memory limit or investigate leak
• ⚠️ **api-connector**: Check for CPU-intensive operations 4. Technical Implementation
Core Service Architecture
typescript
export class ReportTemplateService {
private logAssembler: LogAssemblerClient;
private templates: Map<string, ReportTemplate>;
private triggers: Map<string, TriggerCondition[]>;
constructor() {
this.logAssembler = new LogAssemblerClient();
this.templates = this.loadTemplates();
this.triggers = this.setupTriggers();
// Check for triggered reports every minute
setInterval(() => this.checkTriggers(), 60000);
}
async checkTriggers(): Promise<void> {
const currentMetrics = await this.logAssembler.getCurrentMetrics();
for (const [templateId, triggers] of this.triggers.entries()) {
const triggeredConditions = triggers.filter(trigger =>
this.evaluateTrigger(trigger, currentMetrics)
);
if (triggeredConditions.length > 0) {
await this.generateReport(templateId, currentMetrics, triggeredConditions);
}
}
}
async generateReport(
templateId: string,
metrics: SystemMetrics,
triggers: TriggerCondition[]
): Promise<GeneratedReport> {
const template = this.templates.get(templateId);
if (!template) throw new Error(`Template ${templateId} not found`);
// Gather data based on template requirements
const reportData = await this.gatherReportData(template, metrics);
// Generate report content
const report = await this.renderTemplate(template, reportData);
// Determine severity and recipients
const severity = this.calculateSeverity(triggers, reportData);
const recipients = this.getRecipients(severity, template.channels);
// Send the report
await this.distributeReport(report, recipients, severity);
return report;
}
private async gatherReportData(
template: ReportTemplate,
metrics: SystemMetrics
): Promise<ReportData> {
const timeRange = template.timeRange || '30m';
// Get error data from Log Assembler
const errors = await this.logAssembler.getErrorPatterns(timeRange);
const performance = await this.logAssembler.getPerformanceMetrics(timeRange);
const resources = await this.logAssembler.getResourceMetrics(timeRange);
return {
errors: this.processErrorData(errors),
performance: this.processPerformanceData(performance),
resources: this.processResourceData(resources),
userImpact: await this.calculateUserImpact(errors, performance),
timeRange,
timestamp: new Date().toISOString(),
};
}
private processErrorData(errors: ErrorPattern[]): ProcessedErrorData {
const critical = errors.filter(e => e.severity === 'critical');
const newErrors = errors.filter(e =>
Date.now() - new Date(e.firstOccurrence).getTime() < 3600000 // 1 hour
);
const topErrors = errors
.sort((a, b) => b.count - a.count)
.slice(0, 5);
return {
criticalCount: critical.length,
newErrorCount: newErrors.length,
totalErrors: errors.length,
topErrors: topErrors.map(error => ({
service: error.affectedServices[0] || 'unknown',
message: this.simplifyErrorMessage(error.signature),
count: error.count,
firstSeen: this.formatTime(error.firstOccurrence),
affectedUsers: error.affectedTraces.size,
})),
worstService: this.findWorstService(errors),
};
}
private simplifyErrorMessage(signature: string): string {
// Convert technical error signatures into user-friendly messages
const patterns = {
'jwt.*expired': 'JWT token expired',
'timeout.*api': 'API call timeout',
'memory.*limit': 'Memory limit exceeded',
'connection.*refused': 'Database connection failed',
'module.*federation.*load': 'Module failed to load',
};
for (const [pattern, message] of Object.entries(patterns)) {
if (new RegExp(pattern, 'i').test(signature)) {
return message;
}
}
return signature; // fallback to original
}
} Template Engine
typescript
export class TemplateRenderer {
private handlebars: typeof Handlebars;
constructor() {
this.handlebars = Handlebars;
this.registerHelpers();
}
private registerHelpers(): void {
// Helper for formatting time ranges
this.handlebars.registerHelper('timeAgo', (timestamp: string) => {
const now = Date.now();
const time = new Date(timestamp).getTime();
const diff = Math.floor((now - time) / 1000);
if (diff < 60) return `${diff} seconds ago`;
if (diff < 3600) return `${Math.floor(diff / 60)} minutes ago`;
return `${Math.floor(diff / 3600)} hours ago`;
});
// Helper for severity indicators
this.handlebars.registerHelper('severityIcon', (severity: string) => {
const icons = {
critical: '🔥',
high: '🚨',
medium: '⚠️',
low: '📋',
};
return icons[severity] || '📋';
});
// Helper for trend indicators
this.handlebars.registerHelper('trendIcon', (trend: string) => {
const icons = {
increasing: '📈',
decreasing: '📉',
stable: '➡️',
};
return icons[trend] || '➡️';
});
}
async renderTemplate(template: ReportTemplate, data: ReportData): Promise<string> {
const compiled = this.handlebars.compile(template.format.content);
return compiled(data);
}
} 5. Integration with Log Assembler
Data Flow
sequenceDiagram
participant LA as Log Assembler
participant RT as Report Template Service
participant Alert as Alert System
participant User as Operations Team
Note over LA, User: Automated Report Generation
LA->>RT: Metrics exceed threshold
Note right of LA: Error rate > 5/min
RT->>LA: Request detailed data
Note right of RT: Last 30 min errors
LA->>RT: Return error patterns, performance data
Note right of LA: Grouped by service, impact
RT->>RT: Generate report using template
Note right of RT: Apply "error-summary" template
RT->>Alert: Send formatted report
Note right of RT: Slack, email, dashboard
Alert->>User: Notify with actionable summary
Note right of Alert: "🚨 prompt-manager has 12% error rate"
User->>LA: Investigate specific errors
Note right of User: Click through to detailed logs
6. Report Distribution
Output Channels
typescript
export class ReportDistributor {
private channels: Map<string, ReportChannel>;
constructor() {
this.channels = new Map([
['slack', new SlackChannel()],
['email', new EmailChannel()],
['dashboard', new DashboardChannel()],
['webhook', new WebhookChannel()],
]);
}
async distributeReport(
report: GeneratedReport,
recipients: string[],
severity: 'low' | 'medium' | 'high' | 'critical'
): Promise<void> {
const channels = this.selectChannels(severity);
for (const channelType of channels) {
const channel = this.channels.get(channelType);
if (channel) {
await channel.send(report, recipients, severity);
}
}
}
private selectChannels(severity: string): string[] {
switch (severity) {
case 'critical':
return ['slack', 'email']; // Immediate notification
case 'high':
return ['slack', 'dashboard'];
case 'medium':
return ['dashboard'];
default:
return ['dashboard']; // Low priority
}
}
} Slack Integration
typescript
export class SlackChannel implements ReportChannel {
private webhook: string;
constructor() {
this.webhook = process.env.SLACK_WEBHOOK_URL!;
}
async send(
report: GeneratedReport,
recipients: string[],
severity: string
): Promise<void> {
const color = this.getSeverityColor(severity);
const icon = this.getSeverityIcon(severity);
const message = {
text: `${icon} ${report.title}`,
attachments: [{
color,
text: this.formatForSlack(report.content),
footer: `Generated at ${new Date().toLocaleString()}`,
mrkdwn_in: ['text'],
}],
channel: this.getChannel(severity),
};
await fetch(this.webhook, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(message),
});
}
private formatForSlack(content: string): string {
// Convert markdown-style formatting to Slack format
return content
.replace(/\*\*(.*?)\*\*/g, '*$1*') // Bold
.replace(/• /g, '• ') // Keep bullets
.substring(0, 3000); // Slack message limit
}
private getSeverityColor(severity: string): string {
const colors = {
critical: '#FF0000',
high: '#FF8C00',
medium: '#FFD700',
low: '#32CD32',
};
return colors[severity] || colors.low;
}
} 7. Configuration
yaml
# Report Template Service Configuration
service:
name: report-template-service
port: 8090
templates:
error-summary:
enabled: true
triggers:
- error_rate > 5/minute
- critical_error_detected
- new_error_pattern
channels: [slack, dashboard]
performance-issues:
enabled: true
triggers:
- avg_response_time > 3000ms
- memory_usage > 85%
- cpu_usage > 80%
channels: [dashboard]
user-impact:
enabled: true
triggers:
- user_retry_rate > 20%
- feature_unavailable
channels: [slack, email]
resource-alert:
enabled: true
triggers:
- container_restart_count > 3
- memory_usage > 90%
channels: [slack, email]
channels:
slack:
webhook_url: ${SLACK_WEBHOOK_URL}
channels:
critical: "#alerts-critical"
high: "#alerts-high"
medium: "#alerts-medium"
email:
smtp_host: ${SMTP_HOST}
recipients:
critical: ["ops@company.com", "oncall@company.com"]
high: ["ops@company.com"]
dashboard:
endpoint: "http://grafana:3000/api/alerts"
integrations:
log_assembler:
url: "http://log-assembler:9090"
timeout: 5000ms
thresholds:
error_rate: 5 # errors per minute
response_time: 3000 # milliseconds
memory_usage: 85 # percentage
cpu_usage: 80 # percentage
retry_rate: 20 # percentage This focused Report Template Service gives you exactly what you need - simple, actionable summaries of the problems that actually matter: errors that are breaking things, performance issues slowing users down, and resource problems that could cause outages. Each report tells you what's wrong and what to do about it, without overwhelming detail.