Telemetry is the raw material of detection, but more data does not automatically mean better security. Many teams collect gigabytes of endpoint logs only to be buried in alerts that rarely signal real threats. This guide is for SOC analysts, detection engineers, and security architects who already understand the basics of EDR and want to sharpen their anomaly detection approach. We will walk through practical strategies for selecting telemetry sources, designing detection logic, and maintaining baselines over time—without relying on vendor hype or fake statistics.
Why Telemetry Strategy Matters More Than Volume
Anomaly detection depends on having the right signals, not all signals. A common mistake is to enable every possible event collection source—process creation, network connections, registry changes, file system modifications, DNS queries—and then try to correlate everything. The result is a high signal-to-noise ratio that buries true positives under thousands of benign deviations.
We have seen teams spend months tuning rules on noisy data streams that would have been better replaced by a smaller, curated set of high-fidelity events. For example, collecting every process creation event with full command-line arguments is useful, but collecting every registry query from legitimate software generates mostly noise. The key is to prioritize telemetry that maps to known attacker behaviors: lateral movement, privilege escalation, credential access, and execution of untrusted binaries.
A practical starting point is to categorize telemetry into three tiers: critical (always collect), valuable (collect with filtering), and optional (collect only if storage and analysis capacity allow). This tiered approach prevents resource waste and keeps detection logic manageable. Teams that skip this categorization often find themselves spending more time managing data pipelines than actually detecting threats.
Critical Telemetry Sources for Anomaly Detection
At a minimum, every endpoint protection program should collect process creation events (including parent-child relationships, command lines, and hashes), network connections (source, destination, port, protocol), file system modifications in sensitive directories (e.g., system32, startup folders), registry changes to auto-start locations, and authentication logs (successes and failures). These events cover the majority of common attack techniques, from phishing payloads to ransomware encryption.
Without these core sources, anomaly detection becomes guesswork. For instance, detecting a lateral movement attempt often requires correlating a remote process creation event with a network connection from a non-admin workstation. If either source is missing, the detection is blind.
Three Approaches to Anomaly Detection: Behavioral, Statistical, and ML-Based
There is no single best approach; each has strengths and weaknesses depending on your environment, team skills, and tolerance for false positives. We will compare three common methodologies so you can decide which fits your context.
Behavioral Detection
Behavioral detection relies on predefined rules that describe malicious patterns: a process spawning cmd.exe and connecting to an external IP, or a script interpreter launched from an Office application. These rules are transparent, easy to tune, and do not require large training datasets. The downside is they only catch known patterns; a novel technique that does not match any rule will pass undetected. Behavioral detection works best for environments with stable workloads and a mature threat intelligence feed to update rules regularly.
Statistical Anomaly Detection
Statistical methods build a baseline of normal activity and flag deviations. For example, if a workstation normally makes 50 DNS queries per hour and suddenly makes 5,000, that is an anomaly. This approach can catch unknown threats, but it requires careful baseline calibration. Normal behavior changes over time—new software installs, user work patterns, seasonal variations—so baselines must be periodically recalculated. False positives are common during onboarding until the model learns what is normal. Statistical detection is a good complement to behavioral rules, especially for detecting data exfiltration or compromised accounts.
Machine Learning-Based Detection
ML models can identify subtle patterns that rules and statistics miss, such as a sequence of events that resembles a known attack chain but with slight variations. However, ML introduces complexity: feature engineering, model training, validation, and ongoing monitoring for concept drift. Many commercial EDR products include pre-trained ML models, but teams that build custom models must invest in data science expertise. ML is most valuable in large, heterogeneous environments where manual rule writing cannot keep up with the diversity of endpoints.
We recommend starting with behavioral detection for the most common attack patterns, layering statistical baselines for volumetric anomalies, and only adding ML after the first two layers are stable. Trying to jump straight to ML without foundational telemetry hygiene often leads to opaque alerts that are hard to investigate.
How to Evaluate Detection Approaches for Your Environment
Choosing between behavioral, statistical, and ML-based detection depends on three factors: your team's capacity to maintain detection logic, the diversity of your endpoint fleet, and your tolerance for false positives. We have seen teams with a small, homogeneous environment (e.g., all Windows 11 with standard software) succeed with behavioral rules alone, while a large university with thousands of unique devices and user behaviors needed statistical baselines to reduce noise.
Another critical criterion is the skill level of the analysts who will triage alerts. Behavioral rule alerts are usually straightforward to investigate because the rule itself explains the pattern. Statistical alerts often require more context—why is this deviation significant? ML alerts can be the hardest to explain, sometimes producing a black-box score with no clear rationale. If your team is junior-heavy, prioritize transparency over theoretical detection coverage.
We also recommend evaluating the cost of false negatives versus false positives. In a high-security environment (e.g., finance, healthcare), missing a true positive is worse than investigating many false alarms. In a resource-constrained team, too many false positives lead to alert fatigue and missed real threats. There is no universal right answer; you must calibrate based on your risk appetite.
Practical Decision Matrix
Create a simple matrix with your top three threat scenarios (e.g., ransomware execution, credential theft, data exfiltration). For each scenario, rate how well each detection approach would perform in your environment. Behavioral rules often excel at execution-based attacks, statistical methods at volumetric exfiltration, and ML at subtle credential access patterns. This exercise forces you to think about gaps and prioritize which approach to implement first.
Trade-Offs Between Detection Coverage and Operational Cost
Every detection strategy involves trade-offs. More coverage usually means more alerts, more storage, and more analyst time. We have observed teams that deployed every possible detection rule and quickly overwhelmed their SOC, leading to a mass tuning exercise that disabled half the rules. The smarter path is to start with a small set of high-fidelity rules, measure their performance (true positive rate, false positive rate, mean time to detect), and then expand gradually.
One common trade-off is between detection latency and accuracy. Real-time detection rules (e.g., alert on any PowerShell execution from Office) catch attacks early but generate many false positives. Delayed detection (e.g., aggregating events over an hour and then scoring) reduces noise but may miss fast-moving attacks like ransomware. We suggest using real-time rules for critical, low-volume patterns (e.g., service installation from a temp directory) and batch detection for noisier signals.
Another trade-off involves data retention. Storing months of telemetry enables retrospective analysis and baseline building, but it is expensive. Many teams keep 30 days of full telemetry and 12 months of aggregated summaries. If you plan to use statistical or ML-based detection, ensure you have enough historical data to train meaningful baselines. A common failure is to deploy a statistical detector with only one week of baseline data, resulting in constant alerts during the learning period.
When to Avoid ML-Based Detection
If your environment has fewer than 500 endpoints or your team lacks data science support, ML-based detection may do more harm than good. The models will likely overfit to sparse data and produce unreliable alerts. In such cases, stick with behavioral rules and statistical baselines until you have enough scale and expertise.
Implementation Path: From Telemetry to Actionable Alerts
Once you have chosen your detection approach, the next step is to implement it in a structured way. We recommend a phased rollout to avoid overwhelming your team.
Phase 1: Audit existing telemetry. List all event sources currently collected and categorize them by tier (critical, valuable, optional). Identify gaps—for example, are you collecting process command lines? Are network connections logged with full IP and port details? Fill critical gaps first.
Phase 2: Deploy a small set of behavioral rules for the most common attack techniques. Use frameworks like MITRE ATT&CK to map rules to techniques (e.g., T1059 for script execution, T1078 for valid accounts). Run these rules in monitoring mode for two weeks to measure baseline alert volume. Tune thresholds based on observed noise.
Phase 3: Add statistical baselines for key metrics: number of process creations per hour, bytes sent over the network, failed login attempts. Use a rolling window (e.g., 7-day average) and flag deviations beyond 3 standard deviations. Again, run in monitoring mode first and adjust thresholds.
Phase 4: If you have the resources, train a simple ML model on historical alerts to classify false positives vs. true positives. This can be a binary classifier using features like event frequency, time of day, and parent process reputation. Start with a small feature set and iterate.
Phase 5: Establish a feedback loop. Analysts should tag alerts as true or false positive, and that feedback should be used to retune rules and retrain models. Without feedback, detection quality degrades over time as the environment changes.
Pitfall: Skipping the Monitoring Mode
We have seen teams deploy detection rules directly into blocking mode without first measuring their false positive rate. The result is blocked legitimate software and angry users. Always run new detection logic in alert-only mode for at least two weeks, review all alerts, and tune before enabling any automated response.
Risks of Getting Telemetry Strategy Wrong
Choosing the wrong telemetry sources or detection approach can have serious consequences. The most obvious risk is missing real attacks. If you focus on network telemetry but ignore process execution, you might miss a malware dropper that runs locally. Conversely, if you collect everything without prioritization, you may drown in alerts and miss the one signal that matters.
Another risk is alert fatigue. When analysts see too many false positives, they start ignoring alerts, disabling rules, or even leaving the team. We have heard of SOCs where the average analyst spends 80% of their time investigating false positives. That is not only inefficient but also dangerous—real incidents get buried.
There is also a financial risk. Storing terabytes of telemetry in a SIEM or data lake costs money. If you collect low-value events, you are paying for noise. A careful telemetry audit can often reduce storage costs by 30-50% without sacrificing detection coverage.
Finally, there is the risk of over-reliance on a single detection method. If you only use behavioral rules, you will miss novel attacks. If you only use ML, you may not understand why an alert fired. A layered approach mitigates this: use behavioral rules for known patterns, statistical baselines for anomalies, and ML as a second opinion.
Case in Point: A Composite Scenario
Consider a mid-sized company that deployed a commercial EDR with all default rules enabled. Within a week, they received 10,000 alerts, of which only 12 were true positives. The SOC was overwhelmed and disabled most rules. Later, a ransomware attack exploited a technique that the disabled rules would have caught. The root cause was not a bad product but a lack of telemetry strategy—they collected everything and tuned nothing. A tiered approach with gradual enablement would have prevented this.
Mini-FAQ: Common Telemetry and Anomaly Detection Questions
How long should we retain endpoint telemetry for anomaly detection? For behavioral rules, 30 days of full telemetry is usually enough to investigate incidents. For statistical baselines, you need at least 14 days of data to establish a baseline, but longer windows (30-90 days) improve accuracy. If you plan to train ML models, retain at least 90 days of historical data to capture seasonal patterns.
What is the best way to handle baseline drift? Baselines should be recalculated periodically—weekly for fast-changing environments, monthly for stable ones. Use a rolling window that excludes the most recent day to avoid contamination. Also, flag when the baseline itself changes significantly (e.g., a new software rollout), and consider resetting the baseline after major changes.
How many false positives are acceptable? There is no universal number, but a good target is a false positive rate below 5% for high-fidelity rules and below 10% for statistical anomalies. If your false positive rate exceeds 20%, the rule or baseline needs tuning. Track the ratio of true positives to false positives over time and aim to improve it.
Should we use anomaly detection for blocking or only alerting? Start with alerting only. Once you have confidence in the detection logic (low false positive rate, clear investigative steps), you can consider automated blocking for specific high-confidence patterns. Never block based on a statistical anomaly without human review first.
What is the most common mistake in telemetry collection? Collecting too much too soon. Many teams enable every event source and then struggle to manage the data volume. Start with critical sources, add valuable ones with filtering, and only collect optional sources if you have a specific use case. Less is often more when it comes to actionable detection.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!