Observability: Seeing Inside Your ERP#
Modern ERP systems generate enormous amounts of operational data—transaction logs, performance metrics, error traces, and business events. Without proper observability, this data is noise. With the right monitoring and observability stack, it becomes actionable intelligence.
The difference between monitoring and observability:
Monitoring answers "is the system healthy?" through predefined metrics and alerts.
Observability answers "why is the system behaving this way?" through exploration of metrics, logs, and traces.
ERP systems need both.
---
The Three Pillars of Observability#
Metrics#
Numerical measurements collected at regular intervals.
System metrics: - CPU utilisation - Memory usage - Disk I/O - Network throughput
Application metrics: - Transaction response time - Error rate - Request throughput - Queue depth
Business metrics: - Orders processed per hour - Invoice generation time - Batch job duration - User session count
Logs#
Timestamped records of events.
Log types: - Application logs (errors, warnings, info) - Access logs (user actions, API calls) - Audit logs (security, compliance) - System logs (OS, database)
Log management challenges: - Volume (ERP generates massive logs) - Format (structured vs. unstructured) - Retention (compliance vs. cost) - Search (finding relevant logs)
Traces#
End-to-end tracking of requests through distributed systems.
Useful for: - Identifying latency bottlenecks - Understanding integration dependencies - Debugging distributed transactions - Performance optimisation
---
Monitoring Strategy#
What to Monitor#
Infrastructure: - Server health - Database performance - Network connectivity - Storage capacity
Application: - Response times - Error rates - Throughput - Resource consumption
Business processes: - Batch job status - Integration health - User activity - Transaction volumes
Alerting Strategy#
Alert severity levels:
Critical (Page immediately): - ERP system down - Database unavailable - Critical integration failure - Data corruption detected
Warning (Investigate soon): - Performance degradation - Disk space low - Error rate elevated - Batch job delayed
Info (Review regularly): - Capacity trends - Usage patterns - Minor errors - Configuration changes
ANZ-Specific Alerting#
Time zone considerations: - Alert routing based on ANZ business hours - Follow-the-sun support for global operations - On-call rotation aligned with ANZ time zones
Business cycle alerts: - End of financial year (June AU, March NZ) - Month-end processing - Payroll processing windows
---
Tools and Technologies#
Commercial Options#
APM tools: - Dynatrace - New Relic - AppDynamics - Datadog
Infrastructure monitoring: - Datadog - Splunk - LogicMonitor - SolarWinds
Log management: - Splunk - Elastic Stack (ELK) - Datadog Logs - Sumo Logic
Open Source Options#
Monitoring: - Prometheus + Grafana - Nagios - Zabbix
Logging: - Elastic Stack (ELK) - Loki + Grafana - Graylog
Tracing: - Jaeger - Zipkin
---
Dashboards That Matter#
Executive Dashboard#
Audience: C-level, senior management
Content: - System availability (SLA) - Business transaction volumes - Key performance indicators - Trend visualisation
Operations Dashboard#
Audience: IT operations team
Content: - System health status - Resource utilisation - Active alerts - Recent incidents
Application Team Dashboard#
Audience: ERP administrators
Content: - Application-specific metrics - Error analysis - Performance trends - Integration status
---
Monday Morning Action Plan#
- Inventory Your Current Monitoring: What metrics do you collect? What alerts exist?
- Identify Monitoring Gaps: What critical aspects of your ERP are not monitored?
- Define Alert Thresholds: What thresholds indicate problems requiring action?
- Build Executive Dashboards: Make ERP health visible to leadership.
- Test Your Alerting: Simulate problems to verify alerts work correctly.
---
Conclusion: Observability Is Operational Insurance#
Monitoring and observability are operational insurance. The investment pays off when problems occur—and problems will occur. The organisations that detect issues quickly and understand their root causes recover faster and suffer less business impact.