Context
For one of our Clients, our team of two people was maintaining an Android app with 40,000 daily active users. After releasing a new version with fresh features, we began receiving reports of crashes and lags affecting some users. We needed to quickly identify and resolve the issue to maintain user satisfaction and app stability.
Problem Identification and Consequences
The core issue was an out of memory
error, as reported by Firebase Crashlytics. Key aspects of the problem included:
- The crash only occurred for a single user, not on any of the team's test devices
- Previous app versions functioned without issues for all users
- The affected user was a "power user" with 4,000 active conversations, placing them in the top 100 most engaged users
- The crash started happening on the publicly available app version, not just the beta
- Several crashes correlated with a high number of blocked users — the affected user had blocked about 1,000 users
The consequences of this issue, if left unresolved, could include:
- Loss of highly engaged users
- Decreased app stability and performance
- Potential negative impact on user retention and app reputation
Solution Implementation
We took a systematic approach to solve the problem:
- Created a script to automatically generate a large number of conversations, mimicking the affected user's scenario.
- Modified the script to also create a significant number of blocked users.
- Successfully reproduced the crash, confirming the interplay between high conversation numbers and blocked users as the cause.
- Implemented a simple code adjustment to fix the issue.
- Released a new version and proactively tested it with heavy user accounts.
- Received confirmation from the affected user that the crash was resolved.
Business and Product Gains
By prioritizing power users and implementing robust testing processes, we solved the immediate problem and also set up safeguards for future development, ensuring a consistently smooth experience for all users. In detail, we:
- Resolved the critical crash affecting power users.
- Implemented new processes to prevent similar issues:
- Heavy User Approval: Seeking approval from power users before releasing new versions.
- Enhanced Sanity Testing: Including scenarios with high conversation and blocked user volumes.
- Integration Testing: Every code change undergoes testing with power user scenarios.
- Benchmarking: A new suite monitors login times and app responsiveness for power user accounts with each release.
- Maintained positive metrics:
- The North Star metric (key indicator of app success) remained largely unaffected.
- Daily Active Users and crash rates improved for the production release.
- Strengthened the focus on user-centric development, particularly for highly engaged users.
- Improved Client’s ability to quickly identify, reproduce, and resolve complex issues.