One stop solution to your remote job hunt!

By signing up you get access to highly customizable remote jobs newsletter, An app which helps you in your job hunt by providing you all the necessary tools.

Try Worqstrap Remote Jobs commitment free for 7 days, no credit card necessary.

Subscribe to our highly customizable newsletter to get remote jobs from top remote job boards delivered to your inbox.

Senior Site Reliability Engineer

Knack We Work Remotelyover 1 year ago

Apply Nowover 1 year ago

devops and sysadminfull-timeusa only

Apply Now

Time zones: EST (UTC -5), CST (UTC -6), MST (UTC -7), PST (UTC -8), AKST (UTC -9), HST (UTC -10)

We’re looking for someone to help improve our reliability and performance through deep analysis and remediation of our AWS infrastructure, monitors, alerts, and code.

Key Responsibilities

Refactor our existing monitors and alerts to be actionable and reliable, recommending and implementing diagnostic techniques and monitoring tools.

Deep e and analysis into RDS (Aurora PostgreSQL) performance, using that data to inform scaling policies and automation

Help discover correlations between customer experience and performance indicators to determine what is noticeable by customers, and suggest and implement improvements based on findings

Help us to develop SLI’s, SLO’s, and SLA’s that are impactful as they relate to our customer’s experience

Help triage outages and issues across multiple teams, services, and codebases as they arise, leading root cause analysis and creating stories to prevent and/or detect those issues in the future

Serve as technical lead for deep es to identify solutions to prevent future incidents

Introduce chaos engineering, promoting experimentation in production to discover and remediate systemic weaknesses and improve performance and reliability

*Skills Knowledge and Expertise**

Expertise in AWS

Expertise with RDS, preferably Aurora PostgreSQL engine

Expertise with containerization

Experience with open source monitoring and visualization systems and tools, i.e. Prometheus (monitoring + tracing), Grafana/Kibana (dashboards), GrayLog (logging)

Experience implementing, maintaining, and troubleshooting continuous integration/continuous delivery (CI/CD) tooling

Experience with implementing improvements in areas such as maintainability, scalability, availability, extensibility and security

Ability to work with many teams across disciplines (cloud, platform, development, qa, and security) to resolve issues as they arise and implement improvements

Experience with distributed tracing, diagnostic tooling, application performance monitoring, and the golden signals

*Our Stack**

Our stack is evolving over the next year and we’d love you to be a part of that!

Currently we’re using:

Back-end: JavaScript/TypeScript, Node.js, ES6, GoLang

Data: Aurora PostgreSQL, Redis, ElasticSearch

DevOps & Deployment: All things AWS, Terraform (and Terraform Cloud), Jenkins, Github, Grafana, GrayLog

Testing: Playwright, Mocha, Jest

Front-end: Vue.js, Webpack, SCSS