A place where the Joyent community can gather, help each other out, and stay informed.
You are not logged in.
Some shared accelerators and accelerators are experiencing I/O issues on the backend storage nodes.
The typical symptoms are slow shells and longer page load times for web pages and applications. The appearance/experience of these symptoms are seemingly random, and they're not affecting everyone.
Fundamentally the cause is a lack of spindles/drives to handle the wide fluctuations for some of our users on "older" setups. The solutions are that we're adding more spindles/drives in hot spots to eliminate them, and we're migrating some users (typically with workloads that are fine but simply different from most of us) to an infrastructure that'll provide what they need. Also "older" doesn't mean "old", I'm trying to say that we've recognized this issue, and that recent and new deployments are not going to be experiencing this.
My apologies to those of you experiencing these issues, and thank you for your patience.
Offline
By "older" do you mean shared accelerators that were some of the first to be deployed? If so, does it make sense to update the shared accelerator configuration or just redeploy those on the "older" setups? Do you have a sense for when these issues will be resolved? (N hours, N days, N weeks, or N months, where N<10)
I'm sure the Joyent staff has been putting in extra hours and effort to try to get to the bottom of this. THANKS!
Offline
My haven't had any problems on my Shared Accel, but I've been reading about problems others have been having for a few weeks.
Thanks, Jason. This is the type of post we like seeing from you. It clearly states the problem, who is effected, and what steps are being taken to resolve the issue. The only thing you need to add is an time frame for when you plan to have the solution in place.
Offline
madams wrote:
The only thing you need to add is an time frame for when you plan to have the solution in place.
From yesterday til it's done.
Offline
Hope this relates to bolinas and johnson. Cuz they're slower than molasses right now. I don't know many users who are willing to watch a progress bar for 30 seconds.
But I do appreciate the update. Thanks
Offline
Jason wrote:
From yesterday til it's done.
that is the only kind of time frame joyent should be announcing. i like it better than when you guys set unrealistic expectations on yourselves.
Offline
Yes, for sure was seeing this on Myrtle and Magnolia. Great to hear you're working on in.
Offline
Hope it's easy to throw the drives in, Bridgeway is even worse today. Reid's homepage just took 1m33.681s to fetch (when it doesn't throw a 50x). I'm using his because I'm still rewriting mine to a plain "we're down" page. :(
Offline
bretthoerner wrote:
Reid's homepage just took 1m33.681s to fetch (when it doesn't throw a 50x).
Yes, I've had the role reversal of three clients contacting me to report that my site is down. Even my 80 year old Mom wants to know what's up with my site. My niece isn't happy either. Being 2.5 years old, that's not uncommon, though.
Me? I'm calm about it only because I don't have the time to be any other way.
Offline
jotto wrote:
Jason wrote:
From yesterday til it's done.
that is the only kind of time frame joyent should be announcing. i like it better than when you guys set unrealistic expectations on yourselves.
+1
Good estimates > No estimates > Bad estimates
Also, could this be added to the Server Status page:
https://help.joyent.com/index.php?pg=fo … s&id=1Right now the servers affected are not listed.
Offline
One more thing that wasn't mentioned on this thread was the severity of the problem.
I'd like to contribute a few statistics that I collected today.
To measure performance, I ran "time ls > /dev/null" repeatedly on a directory which contains less than 30 files. I did this on two machines (cardero and bridgeway).
Some highlights of the results:
1. Most of the time (>55%), 'ls' on cardero (FreeBSD, shared hosting) takes less than 5ms to complete.
2. On bridgeway (the Solaris Premier container), only about 16% of the commands take <5ms.
3. A very significant portion of the time (>25%), the command on bridgeway takes over 100ms to complete
4. About 1% of the time the command on bridgeway took between 10 seconds and a full minute. On 2 instances the command took over 2 minutes to execute!
Here's some detailed info on what I did. To start, I used the following little piece of bash code:
while true; do echo "----"; date; time ls > /dev/null; sleep 5; done
I collected the output, did some parsing on it using a small php script, and did a histogram analysis of the time elapsed.
I did this it on both bridgeway (shared container, Solaris, premier) and on cardero (an old FreeBSD, shared hosting machine).
Hopefully there aren't too many bugs in my analysis script ;)
The results I got were as follows:
Results for cardero (FreeBSD shared)
Total measurements: 5288
Measurements taken from Thu, 29 Nov 2007 -- 09:16:04 GMT to Thu, 29 Nov 2007 -- 20:15:31 GMT
less than 5ms: -- 55.862% (2954)
5ms - 10ms: -- 11.876% (628)
10ms - 50ms: -- 17.114% (905)
50ms - 100ms: -- 4.274% (226)
100ms - 1sec: -- 7.451% (394)
1sec - 5sec: -- 2.156% (114)
5sec - 10sec: -- 0.756% (40)
10sec - 20sec: -- 0.473% (25)
20sec - 30sec: -- 0.038% (2)
30sec - 60sec: -- 0.000% (0)
1 - 2 minutes: -- 0.000% (0)
over 2 minutes: -- 0.000% (0)
Results for bridgeway (Solaris shared container, Premier)
Total measurements: 7076
Measurements taken from Thursday, 29 November 2007 -- 9:03:54 GMT to Thursday, 29 November 2007 -- 20:16:46 GMT
less than 5ms: -- 15.941% (1128)
5ms - 10ms: -- 7.292% (516)
10ms - 50ms: -- 34.483% (2440)
50ms - 100ms: -- 15.941% (1128)
100ms - 1sec: -- 20.548% (1454)
1sec - 5sec: -- 3.477% (246)
5sec - 10sec: -- 1.074% (76)
10sec - 20sec: -- 0.565% (40)
20sec - 30sec: -- 0.212% (15)
30sec - 60sec: -- 0.254% (18)
1 - 2 minutes: -- 0.184% (13)
over 2 minutes: -- 0.028% (2)
I am extremely hopeful that 'til it's done' will be very very quick.
Last edited by ronp001 (2007-11-30 14:45:19)
Offline
The problem we're having is unrelated to whether you're on a Shared or Solo Accelerator. We just had really bad luck with our storage infrastructure, it seems.
Offline
Err, now, is this a recent problem? It's also possible that your site/app is slow regardless of this I/O problem and that you DO indeed need more CPU/RAM. A non-shared Accelerator can't hurt, for sure.
Offline
reid i just stumbled upon your 500 error message. hilarious. though i hope the situation improves soon.
Offline
bretthoerner wrote:
Err, now, is this a recent problem? It's also possible that your site/app is slow regardless of this I/O problem and that you DO indeed need more CPU/RAM. A non-shared Accelerator can't hurt, for sure.
We have no traffic at all on our website and it loads very slow, even timeouts sometimes.
Offline
Then that's the current I/O problem Jason is posting about... we can just hope they fix it soon.
Offline
ichigo wrote:
reid i just stumbled upon your 500 error message. hilarious.
Thanks, I don't have a custom 503 message, but there is this one for an Error 500, Error 403, and, of course, Error 404
Might as well have fun with errors, if you're going to have them. In fact, a new Error 503 message could be a project for later tonight. It might have an illustration of an accordion.
Offline
What is spindle? I guess it is some kind of hardware but what is it?
Offline
It's a way of saying they need more hard drives. Spindles are what the HD platters spin on (like a car axle) and the more physical spindles they have in their (I assume) RAID-Z the faster we go.
Offline
:) I took it too serious heh!
Offline
David's update is good also:
http://discuss.joyent.com/viewtopic.php … 57#p155657
As one of the people who is an IO user I appreciate that Joyent wants to fix the issue.
Offline
Jan wrote:
I predict a performance melt down as Reid's error pages get a mention on slashdot.
Ha! ;-)
Offline
Thanks guys. My accelerator seems fine, but Litho, Tamalpais, and Mariposa have been having issues. I have faith.
Offline
fitzage wrote:
I have faith.
Me too. I'm a believer.
Offline
I found a silver lining to this: Litho's recent flakiness has allowed me to reproduce some previously un-reproducible Coda bugs. All I needed was a shitty server to test on. :)
That said, let's hope this doesn't continue much longer. I've got a musical advent calendar launching soon.
Offline
Overall I'm just glad that Joyent has admitted to the problem and is keeping us informed. I started moving to Magnolia from a FreeBSD and wasn't exactly stoked on the performance as I was getting stuff configured and brought online. Then thankfully this came out and told me my experience is not what should be expected. So I'm actually really happy to get this update so I can reset my expectations correctly that they are working on it and know about it.
Offline
Jan wrote:
I predict a performance melt down as Reid's error pages get a mention on slashdot.
Can we have a mirror? Unable to even get an error page at the moment :(
lol
Offline
I just got the 500 error page to load, but the home page won't load at all.
That really, really sucks. -- EDIT: I just got the 503 page waiting for the home page LOL. But that one's not even remotely funny, just sad :-(
jordanbrock wrote:
Can we have a mirror? Unable to even get an error page at the moment :(
lol
Offline
Spintech page loads for me but slowly. Is that on a Shared Accel?
EDIT: I see it is on a Premiere box.
Last edited by jdjustice (2007-11-30 07:23:01)
Offline
jdjustice wrote:
EDIT: I just got the 503 page waiting for the home page LOL. But that one's not even remotely funny, just sad
Error 503, New & Improved: Less Sad, More Funny! But no accordion. No time for a proper illustration.
Offline
Now your site visitors have something to look at when they visit your homepage! Wait... that is your homepage. :/
reid wrote:
jdjustice wrote:
EDIT: I just got the 503 page waiting for the home page LOL. But that one's not even remotely funny, just sad
Error 503, New & Improved: Less Sad, More Funny! But no accordion. No time for a proper illustration.
Offline
Is everything alright now. All my sites seem very responsive at this moment.
Offline
Reid's is also pretty decent!
Offline