Cintia Del Rio and Sparsha.
UTC time | Nepal time | Australia Time | Comments | ||
---|---|---|---|---|---|
Ubuntu meltdown patch was released (today) https://wiki.ubuntu.com/SecurityTeam/KnowledgeBase/SpectreAndMeltdown | |||||
4:00 | 9:45 | 15:00 | Sparsha was migrating crowd database to salima. | ||
4:25 | 10:10 | 15:27 | 12 new patches showing up in datadog for ako (ldap VM) | ||
5:00 | 10:45 | 16:00 | Sparsha detected the error in ako.
/ | ||
5:09 | 10:54 | 16:09 | No more data sent to datadog from ako | ||
5:28 | 11:13 | 16:28 | Sparsha attempted to restart ako. No ssh access available after. Quite possibly the restart applied the new kernel. | ||
5:29 | 11:14 | 16:29 | ako reported as down in datadog | ||
5:49 | 11:34 | 16:49 | Skype call and comms sent. Backups were checked and were successfully uploaded to S3. | ||
6:15 | 12:00 | 17:15 | VM didn't respond to reboot nor hard reboot from openstack. Cintia decided to recreate the VM, keeping the data volume (and avoid data loss). There was the belief that the meltdown kernel patch was applied and caused trouble. | ||
6:45 | 12:30 | 17:45 | VM recreated, but data partition was corrupted and couldn't be mounted.
After several different attempts, Cintia decided it was beyond repair, and tried to reformat partition (even if it meant the data would be lost) | ||
6:55 | 12:40 | 17:55 | Cintia decided that the volume should be deleted instead, because it was not possible to repartition it. Cintia attempted to convince terraform to recreate the volume in OpenStack. | ||
7:20 | 13:05 | 18:20 | Even after several attempts, the old data volume couldn't be deleted from OpenStack (neither by openstack cli nor terraform) Cintia removed the disk from terraform state file, and forced a new volume. | ||
7:40 | 13:25 | 18:40 | Machine again in datadog | ||
8:18 | 14:03 | 19:18 | Backup files being copied to /data, after ansible finished. | ||
8:27 | 14:12 | 19:27 | Backups restored, but crowd refused to connect to ldap, so does telnet from crowd and ID dashboard machines. | ||
8:40 | 14:25 | 19:40 | We discoverd UFW is configured wrong in ansible for ldap server. How was that even working before??????? | ||
9:00 | 14:45 | 20:00 | UFW is configured and reloaded. Telnet appears to be working again. | ||
9:08 | 14:53 | 20:08 | Comms being sent as login to JIRA and confluence are working again |
...