A place where the Joyent community can gather, help each other out, and stay informed.
You are not logged in.
(I hope I'm not pissing anyone off by posting this story; Rackspace and Joyent aren't really competitors since they provide very different services at very different price points.)
I've got a very expensive dedicated box on Rackspace. Rackspace had some pretty nasty downtime recently -- actually several instances of hours of downtime over a period of weeks -- which really isn't supposed to happen when you're paying many hundreds of dollars per month.
However, I was blown away by how well they handled it. I got an email as soon as the problems started and perhaps a dozen updates as they learned more are tried to mitigate damage. After everything was back online the CEO posted a video apology along with a detailed technical explanation with a timeline of what went wrong and detailed steps they're taking to prevent it from happening again. (the slightly more upbeat public version is here: http://www.rackspace.com/information/an … center.php). It actually made me feel better about hosting with them going forward. There's something to learn here.
Offline
Hi gang: it doesn't matter if you host a personal site or a business site with us. The issues dogging some of the Accelerators needs to be remedied. Will be remedied one way or another. My hope is we can find a short term resolution while we continue to dig into this beast. The Systems team is focused on it and I will do my best to get another update from Jason or Dave on Thursday.
Offline
kristiewells wrote:
Hi gang: it doesn't matter if you host a personal site or a business site with us. The issues dogging some of the Accelerators needs to be remedied. Will be remedied one way or another. My hope is we can find a short term resolution while we continue to dig into this beast. The Systems team is focused on it and I will do my best to get another update from Jason or Dave on Thursday.
This seems like a good time for me to repose the question that I posted in the second post in this thread (and was then apparently ignored by Joyent staff):
someguy wrote:
By "older" do you mean shared accelerators that were some of the first to be deployed? If so, does it make sense to update the shared accelerator configuration or just redeploy those on the "older" setups? Do you have a sense for when these issues will be resolved? (N hours, N days, N weeks, or N months, where N<10)
I'm sure the Joyent staff has been putting in extra hours and effort to try to get to the bottom of this. THANKS!
Offline
FWIW I was having fairly regular issues with this (incredibly long response times via web and SSH), but within the last week or so that's gone away.
/me knocks on wood.
Offline
I am also noticing the performance issues on Bolinas seem to be getting much worse. The monpage shows response times are increasing substantially in the last couple of hours (http://joyent.monpage.com/index.php?n=b … &z=out).
Could we have an update on the status of this issue? Any visibility on the steps you are taking to resolve this issue would be greatly appreciated.
Thanks,
Tom
Offline
We have the remaining storage in place to complete resolving the issues for the remaining 5 cohorts, and we have begun a "snapmirroring" process to migrate the storage LUNs and bring the mirroring window down to 10 minutes. The process then is that all services will be brought down on a shared accelerator, a final sync will be performed, the zpools will be exported and then re-imported from the new storage. We're capping the total downtime to a 1 hour window on each shared host.
The final cutovers will be staged per day and are scheduled:
Wednesday, December 12
Cohort 1
- bridgeway
- johnson
- humboldt
- bonita
- litho
- cooper
Thursday, December 13
Cohort 2
- magnolia
- myrtle
- tamalpais
- tunstead
- crescent
- mariposa
Friday, December 14
Cohort 3
- belle
- jones
- bolinas
- kemp
- madrone
- karl
Saturday, December 15
Cohort 4
- berlin
- caledonia
- turney
- napa
- spinnaker
- prospect
Sunday, December 16
Cohort 5
- girard
- harrison
- excelsior
- miller
- glen
- spencer
The schedules are taking into account that things may not go perfectly, if they do go perfectly we'll notify you of an increase in the schedule and adjust the days.
Thank you again for your patience in this.
Offline
Dang! This is the Mother of All Updates.
Offline
This is much appreciated Jason.
I like the way Joyent's tech issues and their solutions are transparent.
-Dale
Offline
Awaiting resolution by Jason's schedule.
Offline
alexxale wrote:
No comments for a few days.
Is this issue 100% fixed already?I still experience slowdowns.
Look a little bit up the thread at the dates listed by the admins. Hasn't truly started yet.
Offline
alexxale wrote:
No comments for a few days.
Is this issue 100% fixed already?I still experience slowdowns.
Did you actually read Jason's update above?
jason wrote:
The final cutovers will be staged per day and are scheduled:
Wednesday, December 12
...
Given that today's the 10th, it's not unexpected that things are still slow. My expectation would be that the staff update us again if the schedule is delayed, and once all the cutovers are complete.
Tom
Offline
We are still on track to start the move this Wednesday based on Jason's schedule. The systems team is doing minor tweaks each day trying to reduce some of the load, but Wednesday is the big day.
Offline
Just curious, will the uptime reset or something? I want something to check back on to know when it's done because I need to do some pro-active work to basically turn my site back on.
Offline
Yeah, I'd love to know when my downtime is going to be so we can plan accordingly. Failing that, it'd be cool to know exactly when it's done, so I can assure my users that our troubles are mostly over.
Obviously, if neither of those is possible, I'll get over it, it'd just be nice to have some definitive info.
My, we are needy little bastards, aren't we? :)
Offline
I am in the process of sending an email to everyone listed on CoHort 1 that will be moved today. The plan is to start this around 11:59pm PST, I am just awaiting confirmation from Jason. Expected downtime is up to two hours. Hoping it is a lot less than this, but plan for an hour.
Emails are also being sent to those on the other affected Accelerators.
EDIT: Got the actual start time from Jason/Ben, so we are dialed in.
Offline
Cool!
Offline
kristiewells wrote:
11:59pm PST
That's 7:59am GMT the following day. Adjust your watch accordingly.
Offline
Offline
Lovely jubbly :-) Many thanks.
Offline
Results of the "perpetual ls test" -- bridgeway vs. cardero in the last 4 1/2 hours:
loaded 19058 lines for bridgeway
Results for bridgeway Total measurements: 3175 Measurements taken from Thursday, 13 December 2007 -- 8:22:12 GMT to Thursday, 13 December 2007 -- 12:48:16 GMT less than 5ms: -- 85.102% (2702) 5ms - 10ms: -- 10.677% (339) 10ms - 50ms: -- 2.866% (91) 50ms - 100ms: -- 0.409% (13) 100ms - 500ms: -- 0.945% (30) 500ms - 1sec: -- 0.000% (0) 1sec - 5sec: -- 0.000% (0) 5sec - 10sec: -- 0.000% (0) 10sec - 20sec: -- 0.000% (0) 20sec - 30sec: -- 0.000% (0) 30sec - 60sec: -- 0.000% (0) 1 - 2 minutes: -- 0.000% (0) over 2 minutes: -- 0.000% (0)loaded 11805 lines for cardero
Total measurements: 1966 Measurements taken from Thu, 13 Dec 2007 -- 08:42:10 GMT to Thu, 13 Dec 2007 -- 12:44:13 GMT less than 5ms: -- 48.627% (956) 5ms - 10ms: -- 11.953% (235) 10ms - 50ms: -- 20.956% (412) 50ms - 100ms: -- 5.341% (105) 100ms - 500ms: -- 7.782% (153) 500ms - 1sec: -- 2.035% (40) 1sec - 5sec: -- 1.679% (33) 5sec - 10sec: -- 1.017% (20) 10sec - 20sec: -- 0.560% (11) 20sec - 30sec: -- 0.051% (1) 30sec - 60sec: -- 0.000% (0) 1 - 2 minutes: -- 0.000% (0) over 2 minutes: -- 0.000% (0)
Results for cardero
Bridgeway is showing major improvement. Nice work folks!
Offline
It's so cool. Thank you very much.
Offline