Delta Airline Datacenter Failure

Status
Not open for further replies.

ron

Senior Member
http://www.nytimes.com/2016/08/09/business/delta-air-lines-delays-computer-failure.html?_r=0

Does anyone know someone in the Georgia area that might share what actually happened?

From other reports, it sounds like they were doing a test transition to generator at 2AM and there was a catastrophic failure that caused them to not be able to distribute either utility or generator power to the downstream IT equipment (single point of failure).

I'm curious what may have happened to learn from or if anyone was hurt?
 
I haven't heard much about it but I heard on the news today that we are likely to encounter more computer crashes like this in the future. The reason being that these computers are not only being used for bookings, but for seat assignments, baggage claims, mileage plus programs, etc. One would think that as long as you have enough memory the computer can handle the work load but when they crash they go down hard and cause chaos.
 
I haven't heard much about it but I heard on the news today that we are likely to encounter more computer crashes like this in the future. The reason being that these computers are not only being used for bookings, but for seat assignments, baggage claims, mileage plus programs, etc. One would think that as long as you have enough memory the computer can handle the work load but when they crash they go down hard and cause chaos.
Never mind the IT side of things, they will always "innovate" and hopefully they have data disaster recovery.

Reports say this was an infrastructure issue that caused the downstream load failure.
 
My knowledge of IT pretty much stops at what the initials stand for, but wouldn't you segregate functions on different servers so if your mileage server went down you could still book flights, etc?
 
There should always be redundancy for continuity of operations. They learned from it and will be better going forward.
Nobody said an education is free.
 
Yes, I would also be interested in the details as to what actually failed whether it was on the backup generator side or the UPS equipment. And as Ron stated was a Single Point of Failure (SPOF) uncovered as a result of some equipment failure. If this was the case then this points to a faulty design of the system as most large data centers that serve critical loads have redundant systems with double, sometime triple layers of backup on the UPS side before the front ends of the IT servers. I have worked in enough large centers to know that the underlying philosophy was always to eliminate any SPOF in the system design. Some sites use "data mirroring" where "the data is copied from one location to a storage device in real time. Because the data is copied in real time, the information stored from the original location is always an exact copy of the data from the production device. Data mirroring is useful in the speedy recovery of critical data after a disaster. Data mirroring can be implemented locally or offsite at a completely different location."
On the other side of the house I have seen systems that use redundant ATSs (2 in parallel) where if there was a failure in one ATS to transfer the load (UPS) to the alternate source the other ATS would switch and feed the UPS while the front ends were fed from the batteries.
Maybe there is someone out there that is close to what happened that reads this forum and can share more details. Although a lot of this stuff, because of the high dollars lost, is hushed up due to insurance issues and future litigation.
 
Problem for most companies is that full hot-standby redundancy means effectively building two data centers and enough datacom capacity between to keep the data replicated, and that costs $$$$$$$$$$. Large securities firms do that sort of thing because being offline for 30 seconds is millions of dollars of loss (and sometimes a congressional visit to explain), airlines don't loose at near that rate.

ETA- proper data center design has completely redundant power into each cabinet, from separate PDUs back to separate panels to separate UPSs to separate generators and separate incoming feeds. Only thing common is the grid-tie that keeps everything in sync.
 
Status
Not open for further replies.
Top