One stop solution to your remote job hunt!

By signing up you get access to highly customizable remote jobs newsletter, An app which helps you in your job hunt by providing you all the necessary tools.

Try Worqstrap Remote Jobs commitment free for 7 days, no credit card necessary.

Subscribe to our highly customizable newsletter to get remote jobs from top remote job boards delivered to your inbox.

Staff Site Reliability Engineer

Toast, Inc. Remote.coabout 1 year ago

Apply Nowabout 1 year ago

location: remoteus

Apply Now

Staff Site Reliability Engineer

R8116
Remote
Remote, United States
Engineering

Toast is driven by building the restaurant platform that helps restaurants adapt, take control, and get back to what they do best: building the businesses they love.

At Toast, our Site Reliability Engineers (SREs) are responsible for keeping all customer-facing services and other Toast production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople who apply sound software engineering principles, operational discipline, and mature automation to our environments and our codebase. Our decisions are based on instrumentation and continuous observability as well as through predictions and capacity planning.

About this roll* (Responsibilities)

Design, build and drive adoption of a platform that enables service resilience testing/chaos engineering to validate and test Toast’s architecture is resilient to failure. Build and own a performance testing framework/environment to enable our R&D teams to understand the constraints of their services and improve performance (25%)
Define, implement and evolve a world-class observability technology stack that allows rapid detection of issues in our system and enables root cause analysis (20%)
- Provide scalable metrics and dashboarding solutions for R&D
- Provide distributed tracing capabilities to visualize and track issues across our complex system
- Provide log aggregation and insights for R&D using best in class technology
- Provide a global view of the true customer experience through usage of Real-User Monitoring & external cloud-based solutions
Act as a champion for reliability and work with partner teams in different lines of business to influence product roadmaps to improve resiliency and reliability of all services. Champion our uptime targets and enable other teams to improve the way we measure the reliability of the system (20%)
Provide technical leadership in production triage, incident resolution, and retrospective/root cause analysis to maintain the world-class reliability and uptime of our platform (20%)
- Leverage a strong understanding of Cloud Architecture
- Knowledge of Java and the JVM (Java Virtual Machine) to triage and understand issues within services
- Implement strategies to increase system reliability and performance through on-call rotation and process optimization
- Lead incident post-mortem/retrospectives to surface reliability improvements and drive to completion
Mentor and coach peers and reliability champions on SRE best practices. Contribute to running an SRE Guild (15%)

Do you have the right ingredients*? (Requirements)

Extensive and broad industry experience with at least 5 years in building and running production systems and participating in incident calls
Proficient in object oriented languages- Java and Python etc.
Well-versed in software architecture and deep understanding of cloud and microservices
Demonstrated experience working with at least one major cloud platform (AWS, GCP, or Azure)
Exposure to complex, mission critical, and large scale distributed systems
Ability to set an example for the team with positive and inclusive leadership and discussion on work.
General knowledge of most technical expertise areas, with deep knowledge in two areas.
- Observability platforms – APM
- Prometheus, Thanos, and Grafana: service catalog metrics and recording rules for alerts
- Log shipping pipelines and incident debugging visualizations
- Operating system (Linux) configuration, package management, startup and troubleshooting
- Block and object storage configuration and debugging
- Advanced Chef (syntax, recipes, cookbooks) and Ansible (syntax, tasks, playbooks)
- Advanced Terraform syntax and GitLab CI/CD configuration, pipelines, jobs
- Containers: cluster provisioning and new services

Our Spread of Total Rewards

Unlimited Vacation
Sabbatical opportunity after five years
Professional Development Reimbursement Program
Commitment to Employee Wellness through resources such as a quarterly Wellness Stipend
Various peer and company recognition programs
401(k) and matching
Medical, Dental, & Vision Coverage
Mental Health Benefits
Subsidized backup childcare

*Bread puns encouraged but not required

The starting pay rate for this role is below. Please note, there is not a range for this role, the number listed below is the rate.

Pay Rate

$142,000$227,000 USD

We are Toasters

Diversity, Equity, and Inclusion is Baked into our Recipe for Success.

At Toast our employees are our secret ingredient. When they are powered to succeed, Toast succeeds.

The restaurant industry is one of the most erse industries. We embrace and are excited by this ersity, believing that only through authenticity, inclusivity, high standards of respect and trust, and leading with humility will we be able to achieve our goals.

Baking inclusive principles into our company and ersity into our design provides equitable opportunities for all and enhances our ability to be first in class in all aspects of our industry.