Engineering Leader · DC Metropolitan Area

Sahil Shah

I build systems as well as the teams behind them, with a proven track record of releasing products loved by thousands of customers

Engineering leader specializing in observability, analytics platforms, and generating insights from large datasets. I have scaled teams from scratch, and stay close enough to the code to set a high bar.

Sahil Shah
01About

An engineer who leads.

I'm an engineering leader who never stopped being an engineer.

Over the last twelve years, I've worked at the intersection of large-scale data pipelines and the products built on top of them — starting with a company I founded called Quotail that produced real-time alerts based on options activity. Then I worked on video analytics at JW Player, implementing a generic API for publishers to ask advanced questions about their audiences.

After that I made the jump to Facebook, where I worked on the internal tracing platform team. Here I built a processing pipeline and data schema to aggregate low level hardware metrics across the span of a distributed request, onboarding Instagram to the platform.

Then I joined Datadog, where I founded and lead the APM Trace Retrieval team, scaling the team from myself to 6 engineers. I created a team charter and roadmap, leading the launch of Trace Queries , Trace Previews, and a Latency Investigator agent. I was also responsible for launching Datadog's tracing platform into GovCloud and maintaining FedRAMP compliance.

My throughline is observability and distributed tracing: making it possible to understand what enormous systems are actually doing, and turning that signal into features customers rely on. I'm equally comfortable architecting a service, debugging a gnarly production incident at 2am, and writing the team charter that tells everyone where we're going and why.

What people tend to say about working with me: I dive fearlessly into hard problems and keep iterating until the solution is solid, I build genuine cross-functional relationships that get things shipped, and I lead a team with trust and autonomy while staying hands-on in code and reviews.

Toolkitstack
Languages
PythonJavaScript / Node.jsJavaGo
Platform
KubernetesDockerKafkaPostgres
Domain
Distributed TracingObservability / APMReliabilityCost Attribution
Web
d3.jsAngularExpressFlask APIs
Leadership
Team buildingMentorshipRoadmappingStakeholder XFN
Based inDC Metropolitan Area
02Career Journey

A path through scale.

From shipping core infrastructure at Facebook to founding and leading a product team at Datadog — the same throughline of observability, reliability, and craft.

Datadog

Engineering Manager — APM Trace Retrieval

Observability · Analytics · Agentic Engineering

June 2022 — March 2026

Founded and scaled the Trace Retrieval team, spinning it off the tracing infrastructure team into a product-facing org with its own charter, roadmap, and on-call.

  • Grew the team from 1 to 6 engineers, conducting weekly 1:1s, quarterly OKR planning, and interfacing with upper management.
  • Led the GA launch of Trace Queries with adoption that exceeded targets.
  • Led an initiative to build a latency investigator agent that analyzed traces and other telemetry to derive the root cause of slowdowns. The product was demo'd at Datadog's annual DASH conference.
  • Managed relationships with high profile customers including PayPal and Mercadolibre, building custom APIs to meet their business needs.
  • Owned GovCloud deployment and maintenance of the end-to-end tracing infrastructure, from ingestion and storage pipelines to retrieval APIs.
  • Promoted a report to Senior Engineer, grew an SRE into a product engineer, and converted two interns to return offers for the team.

Facebook

Senior Software Engineer - Distributed Tracing

Distributed Tracing · Cost Attribution

August 2018 — December 2021

Core engineer on Facebook's internal cost attribution platform called TRU (Transition Resource Utilization), driving Instagram adoption and the reliability of the platform underneath it.

  • Onboarded Instagram onto the TRU tracing platform, hitting stretch goals including Django-layer trace collection and A/B testing framework integration.
  • Built a tracing Python SDK for Instagram, with full feature parity to the flagship C++ SDK.
  • Collaborated with Data Scientists to create a model for converting low level hardware metrics like CPU cycles into relevant metrics like Power for capacity planning.
  • Designed and implemented Declarative Trace Stream, an abstraction that simplified trace processing and boosted adoption by internal teams.
  • Created a uniform schema and set of hive tables for all onboarded teams to view aggregated and granular resource utilization, across CPU, power, network I/O, and memory.
  • Mentored a junior engineer and helped them grow from an iOS developer to a backend dev.
03Strengths

How I operate.

Themes drawn from years of peer and manager feedback — the consistent signal of what it's like to build with me.

01

Fearless technical depth

I dive into the most complex part of the problem and keep iterating until the solution is genuinely solid — then I keep reviewing code thoroughly even as a manager.

“Fearlessness in diving into topics and tenacity to keep iterating.”

02

Cross-functional glue

I build the relationships across teams and stakeholders that turn isolated work into shipped, adopted products.

Trusted partner across PayPal, Driveline, LLM Observability, IG Efficiency.

03

People leadership

I lead with trust and autonomy, give engineers room to grow, and stay hands-on enough to set the bar by example.

Drove promotions, grew an SRE into a product engineer, high-trust team.

04

Operational ownership

I take pragmatic, end-to-end ownership of reliability — on-call, FedRAMP/GovCloud, autoscaling, and the operational cadence around it.

FedRAMP-High platform, incident-averting autoscaling, dependable on-call.

05

Bias for impact

I move fast and find the high-leverage opportunities — framing them in clear RFCs that align efforts across the company.

Identified & framed LLM/MCP tracing tools, async queries, trace archival.

Fast, fearless, and close to the craft — without losing the team along the way.

the through-line
04Selected Work

Things I've built.

A portfolio of deeper case studies is on the way. Here's a preview of the work that defines it.

Product · DatadogCase study coming soon

Trace Queries (Flows API)

Designed and launched a query language and API for searching and aggregating distributed traces — from beta to GA, with usage that beat projections.

Distributed systemsAPI designTrino
Platform · DatadogCase study coming soon

APM Reliability & GovCloud

FedRAMP-High tracing platform, ITAR on-call program, and autoscaling that prevented incidents during a major customer migration.

ReliabilityFedRAMPKubernetes
Infrastructure · FacebookCase study coming soon

Tracing Platform & Python SDK

First-class Python SDK and the Declarative Trace Stream abstraction that powered Instagram's adoption of internal tracing and cost attribution.

SDKTracingAdoption
R&D · DatadogCase study coming soon

LLM & MCP Trace Tooling

Early initiative building LLM-friendly tools for fetching and summarizing traces, integrated into the Datadog MCP server.

LLMMCPDeveloper tooling

Full case studies, writing, and side projects are coming to this space soon.

Portfolio in progress
05Contact

Let's build something solid.

Open to conversations about engineering leadership, observability, and hard distributed-systems problems. The fastest way to reach me is email.

Location
DC Metropolitan Area