The promise of endpoint detection and response (EDR) is that telemetry will catch what signatures miss. In practice, though, raw telemetry streams often bury zero-day activity under noise, false positives, and poorly tuned rules. This guide is for detection engineers and SOC leads who already know the basics — we skip the sales pitch and focus on the trade-offs, patterns, and maintenance realities of tuning for novel threats.
Where Telemetry Tuning Meets Zero-Day Reality
Most zero-day exploits don't announce themselves with a known hash or a clean IoC match. They arrive through legitimate processes — a browser, a PDF reader, a scripting host — and the malicious behavior is a deviation from the application's normal activity. Telemetry tuning for zero-days means building detection logic that spots these deviations without drowning in false positives.
In a typical mid-size environment, a single endpoint generates tens of thousands of events per day: process creation, network connections, file writes, registry changes. Tuning for zero-days requires selecting which of these events matter, how to aggregate them into meaningful signals, and where to set thresholds. Teams often start by enabling every possible telemetry source, then quickly discover that storage costs and analyst burnout force them to prune. The key is to prune intelligently — keeping the data that reveals novel attack patterns while discarding noise.
One composite scenario: a team I read about deployed Sysmon with all event IDs enabled. Within a week, they had over 10 million events per host. They tuned by disabling event IDs that produced high volume with low signal (e.g., network connection events for trusted processes), but kept process creation and file creation events with command-line logging. This allowed them to detect a zero-day that abused mshta.exe to download a payload — an activity that looked unusual compared to the host's baseline, even though the binary was unsigned and unknown.
The lesson is that tuning is not about collecting everything; it's about collecting the right things and then building detection logic that compares current behavior to a learned baseline. This section sets the stage: zero-days exploit gaps in telemetry coverage, but also gaps in how we interpret that telemetry.
Why Traditional Signature Approaches Fail
Signature-based detection relies on known patterns — file hashes, IP addresses, registry keys. Zero-days, by definition, have no known pattern. Telemetry tuning shifts the focus to behavioral anomalies: a process spawning an unusual child, a script writing to a sensitive directory, an outbound connection to a never-seen-before domain. These behaviors can be detected if the telemetry is granular enough and the tuning is calibrated to the environment's normal noise floor.
The Role of Baseline Profiling
Effective tuning requires understanding what 'normal' looks like for each host or user group. A developer workstation with frequent PowerShell usage has a different baseline than a point-of-sale terminal. Tuning must account for these differences, either through per-host baselines or by grouping similar endpoints. Without baseline profiling, detection rules will either miss true positives or generate overwhelming false positives.
Foundations: Telemetry Sources and Their Blind Spots
Not all telemetry is equally useful for zero-day detection. Understanding the strengths and limitations of each source is critical before writing a single detection rule.
Process Telemetry
Process creation events (e.g., Event ID 4688, Sysmon Event ID 1) are the backbone of endpoint detection. They provide the parent-child relationship, command line, and often the user context. For zero-days, process telemetry can reveal unusual execution chains — for example, a Microsoft Word process spawning PowerShell, or a scheduled task launching a script from a temp directory. However, process telemetry alone misses fileless attacks that operate entirely in memory, and it can be noisy if command-line logging captures sensitive data or every routine task.
Network Telemetry
Network connections tell you where a process is talking to. For zero-days, unexpected outbound connections to unknown domains or IP ranges can be a strong signal. But network telemetry is high-volume: every DNS query, every HTTP request. Tuning often involves filtering out connections to trusted CDNs, update servers, and internal resources. The challenge is that zero-days may use legitimate cloud services for C2, making it hard to distinguish malicious from benign without additional context like process lineage or domain reputation.
File and Registry Telemetry
File creation and modification events (Sysmon Event ID 11, 2) are crucial for detecting payload drops, persistence mechanisms, and ransomware encryption. Registry events (Event ID 13, 12) help spot autorun entries, COM hijacking, and other persistence techniques. These sources are relatively low-volume compared to process and network, but they can generate false positives from legitimate software installers and updates. Tuning requires allowing known-good installer paths and digital signers.
Memory Telemetry
Memory scanning (e.g., for code injection, reflective DLL loading) is the most direct way to detect fileless zero-days. However, memory telemetry is expensive in terms of performance and storage. Many EDR solutions sample memory events or only scan on specific triggers. Tuning memory telemetry involves balancing coverage with system impact — often a trade-off that teams must evaluate based on their risk tolerance.
Detection Patterns That Work Against Novel Threats
With telemetry sources understood, we can now focus on detection patterns that are effective against zero-days. These patterns rely on anomaly detection, behavioral thresholds, and correlation across multiple event streams.
Unusual Child Process Chains
One of the most reliable patterns is detecting a process spawning a child that is atypical for that parent. For example, a PDF reader spawning cmd.exe is suspicious because PDF readers rarely need command-line access. This pattern can be implemented as a rule that logs all instances where a specific parent (like a document viewer or browser) spawns a scripting host, shell, or binary from a temp directory. Tuning involves excluding known legitimate scenarios (e.g., a browser spawning a download helper) and adjusting for environment-specific parent-child pairs.
Outbound Connections to Newly Observed Domains
Most zero-days require C2 communication. A detection pattern that flags outbound connections to domains that have never been seen in the environment (or were first observed within the last 24 hours) can catch C2 traffic early. This requires a baseline of domain lookup history — often maintained by a SIEM or a dedicated DNS analytics tool. Tuning involves setting a threshold for 'new' (e.g., first seen in the last hour vs. last day) and excluding domains that are known to be legitimate but rarely queried (e.g., a seldom-used SaaS service).
High-Frequency File Encryption Events
Ransomware zero-days often encrypt files in bulk. A detection pattern that monitors file modification events across multiple directories and triggers when a single process modifies more than, say, 50 files in a minute can catch ransomware before it completes. Tuning is critical here: backup software, build tools, and database applications may also cause high-frequency file writes. Exclusions must be carefully scoped to avoid false positives that could lead to rule disabling.
Registry Autorun Changes with Unusual Process Parents
Persistence mechanisms often involve registry changes (e.g., adding a Run key). A detection pattern that flags registry modifications where the modifying process is not a known installer or system component can catch zero-day persistence. Tuning requires building a whitelist of trusted installers (e.g., Microsoft, Adobe, enterprise deployment tools) and excluding changes that occur during software updates.
Anti-Patterns and Why Teams Revert
Even experienced teams fall into traps that undermine their detection tuning. Recognizing these anti-patterns is the first step to avoiding them.
Over-Tuning to Zero False Positives
The desire to eliminate all false positives is understandable, but it often leads to rules that are so narrow they miss true positives. A rule that only triggers on a specific combination of process, command line, and parent process may be silent for months — but also silent when a novel variant appears. Teams sometimes 'tune' by adding exclusion after exclusion until the rule fires only on known-bad, effectively turning it into a signature. The anti-pattern is treating false positives as a binary problem rather than a risk trade-off. A better approach is to accept a manageable false positive rate and invest in triage automation.
Ignoring Environmental Baselines
Deploying detection rules from a generic threat intelligence feed without adjusting for the environment is a common mistake. A rule that flags PowerShell downloading an executable may be perfect for a corporate network but generate thousands of alerts in a DevOps environment where automated builds run similar commands. Teams that don't baseline often revert to disabling the rule entirely, losing coverage. The fix is to group endpoints by role and apply rules with environment-specific thresholds.
Relying on a Single Telemetry Source
Zero-days often evade detection if only one telemetry source is monitored. For example, a fileless attack that uses reflective DLL loading might not generate process creation events (if the DLL is injected into a running process) but could be caught by memory scanning or network telemetry. Teams that rely solely on process telemetry may miss the attack entirely. The anti-pattern is building detection logic that only uses one event type, ignoring correlation across sources. Multi-source correlation is more complex but more resilient.
Neglecting Maintenance
Detection rules drift as the environment changes. New software, updated operating systems, and changing user behavior can render once-effective rules noisy or obsolete. Teams that don't regularly review and adjust their rules often see false positive rates climb over time, leading to rule disablement during incident response. The anti-pattern is 'set and forget' — deploying rules and never revisiting them. Maintenance must be scheduled, ideally with automated testing against a baseline of known-good activity.
Maintenance, Drift, and Long-Term Costs
Telemetry tuning is not a one-time project. It requires ongoing effort to keep detection quality high as the environment evolves.
Rule Lifecycle Management
Each detection rule should have a lifecycle: creation, testing, deployment, monitoring, and retirement. Rules that are no longer relevant (e.g., because the software they targeted is no longer used) should be retired to reduce noise. New rules should be tested in a staging environment against a sample of production traffic to estimate false positive rates. Many teams use a 'tuning window' — a period (e.g., two weeks) during which a new rule generates alerts but does not trigger automated responses, allowing analysts to tune exclusions.
Telemetry Volume Growth
As endpoints increase in number and telemetry sources expand, storage and processing costs grow. Teams must decide how long to retain raw telemetry versus aggregated summaries. For zero-day detection, retaining raw process and network telemetry for at least 30 days is common, but file and registry telemetry may be retained for longer to support historical hunting. The cost of storage must be weighed against the value of historical data for post-incident analysis. Some teams tier storage: hot storage for 7 days, warm for 30, cold for 90.
Personnel and Skill Maintenance
Tuning requires skilled analysts who understand both the environment and the threat landscape. Turnover can lead to knowledge loss, especially if tuning decisions are not documented. Maintaining a 'tuning playbook' that explains why certain exclusions exist and how baseline profiles were created can reduce dependency on specific individuals. Regular cross-training and rotation also help.
When Not to Use Telemetry Tuning
There are scenarios where investing in telemetry tuning is not the right move — at least not first.
Immature Environments with No Baseline
If an environment has no existing monitoring or the baseline is completely unknown, jumping straight to tuning is premature. The first step should be to establish basic logging and collect a baseline of normal activity over several weeks. Without a baseline, tuning is guessing. In such cases, teams should focus on deploying EDR with default rules and gradually collecting data before customizing.
Overwhelming Alert Volume from Other Sources
If the SOC is already drowning in alerts from un-tuned signature-based tools, adding more telemetry and custom rules may make things worse. The priority should be to reduce noise from existing systems first — by tuning those signatures or implementing alert deduplication — before layering on zero-day detection. Otherwise, tuning for zero-days will be lost in the noise.
Lack of Skilled Analysts
Telemetry tuning requires analysts who understand both the technical details of endpoint telemetry and the operational context. If the team lacks this expertise, outsourcing or investing in training may be more effective than DIY tuning. A poorly tuned detection can be worse than no detection — it can create a false sense of security or generate alert fatigue.
Regulatory or Compliance Constraints
Some regulations require specific logging (e.g., all authentication events, all network connections) that cannot be filtered. In these cases, tuning must be done at the detection layer rather than the telemetry collection layer. If regulations mandate full collection, tuning must focus on alerting rules rather than data reduction.
Open Questions and FAQ
Even with solid tuning practices, several open questions remain. This section addresses common uncertainties.
How do we balance telemetry volume with detection depth?
There's no universal answer. Teams often start with all telemetry enabled, then prune based on observed signal-to-noise ratio. A practical approach is to prioritize telemetry sources that have the highest detection value per event: process creation, network connections to unknown destinations, and file writes to sensitive directories. Lower-value sources like DNS queries for known domains can be sampled or aggregated.
What false positive rate is acceptable for zero-day detection?
It depends on the team's capacity to triage. A rule that generates 10 false positives per day but catches one zero-day per quarter may be acceptable if analysts can quickly dismiss the false positives. A rule that generates 500 false positives per day will likely be disabled. The key is to measure false positive rates and adjust thresholds accordingly. Many teams aim for a false positive rate below 1% for automated response rules, and up to 5% for alert-only rules.
Should we use machine learning for tuning?
Machine learning can help with baseline profiling and anomaly detection, but it introduces its own challenges: model drift, explainability, and training data requirements. For most teams, rule-based tuning with statistical thresholds (e.g., standard deviation from baseline) is more maintainable. ML can be used as a complementary layer, but not as a replacement for well-tuned rules.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!