A place where the Joyent community can gather, help each other out, and stay informed.
You are not logged in.
We got bit by a massive ZFS bug. That's the long and the short of it. The ZFS corruption got onto/into the backups. The good news is we can unravel the corruption. The bad news, given the fact that Strongspace and BingoDisk ran on a Thumper (aka SunFire X4500) (48 500GiB drives), was that we have to use other Thumpers to stage the uncoding of the ZFS mess. Moving so much data around to decode the ZFS corruption has taken time. It's just the laws of physics how quickly we can move bits around between Thumpers. The good news is we will be able to restore all data. That's the current consensus. The bad news is, while we could bring the server on-line right now, we can't guarantee it would be stable from a ZFS standpoint. We're taking more time to ensure we don't go right back to where we were. A full, technical explanation will be forthcoming once the service is brought back on-line.
Thanks for your patience.
Offline
Dave has posted an update at Joyeur. I will go ahead and re-post here so you don't have to link back and forth:
HIGH-LEVEL EXPLANATION
Joyent is working to bring our Strongspace and BingoDisk products back on-line after they were taken offline this past Saturday (January 12) due to instabilities (e.g. read errors, checksum failures) Joyent has experienced with Sunfire X4500 hardware (aka "Thumper") only exposed after a series of upgrades to the operating system itself. A ZFS bug prevented a speedy recovery. No other Joyent products have been affected. Strongspace and BingoDisk are the only services running on X4500s. The rest of our X4500 inventory is used for cold backups of Joyent server nodes.
ZFS continues to be a competitive advantage for Joyent and our customers. Typical nodes at Joyent running ZFS can recover from crashes in a matter of seconds. As I said, the interruption of Strongspace and BingoDisk has not affected other Joyent services.
TECHNICAL EXPLANATION
The nature of the Strongspace and BingoDisk products has exacerbated the recovery of these services. Each service runs together on a single Sunfire X4500. The X4500 is a dual-socket, 48 by 500GiB drive server/storage device. It has been pointed out elsewhere that we were running an older version of the OpenSolaris operating system on this X4500. That is true. However, since this particular X4500 also housed two services (rather than just backups), we had been waiting to upgrade the X4500 in anticipation of some software updates that were/are in the pipeline for OpenSolaris itself. The improvements include an improved "scrub" (which can stall or hang, currently), faster ZFS lists (sometimes take more than an hour to list datasets on machines such as an X4500 with sizeable data), the ability to recursively replicate datasets, and graceful recovery when a device (i.e. drive) fails. Unfortunately, OpenSolaris does not currently provide a straightforward upgrade process from build-to-build. If all the stars aligned, an upgrade takes about six hours. Realistically, we estimated we would have needed to schedule a multiday downtime given the historical uncertainties around importing zpools from older version of ZFS into newer versions of ZFS. Further, the sheer amount of data managed by these services means moving data around for recovery purposes takes lots of time. We're up against the laws of physics. As predicted, the upgrade of the operating system, in response to the interruption, went relatively smoothly. We have been working for 3 days to get the data usable. We have swapped X4500 chassis completely. Work continues.
CONCLUSION
We continue to work to restore Strongspace and BingoDisk. We are still in the midst of the data recovery process. I will continue to update this post as I learn more. If you have questions, please post a comment and we'll try to answer.
Offline