…and it seems to work. Sorry for the 30 minute downtime because of that.
Category: Downtimes
Due to until now unknown reasons the server which holds the Jabber services crashed four times in the last not even two hours. From one second to the other the ejabberd processes took every resource they could get, and even more. 8 gigs of RAM and 8 gigs of swap, everything gone. Plus a lot of CPU load. The machine was loaded that “top” refreshed just every 5 minutes and in the end just a hardware reset helped to reboot the machine.
For the tech geeks:
top – 19:56:21 up 31 min, 1 user, load average: 22.86, 13.11, 8.71
Tasks: 240 total,  3 running, 231 sleeping,  0 stopped,  6 zombie
Cpu(s): 1.4%us, 5.8%sy, 0.0%ni, 12.4%id, 80.3%wa, 0.0%hi, 0.1%si, 0.0%st
Mem:  8190900k total, 8138972k used,   51928k free,     796k buffers
Swap: 8393848k total, 7276916k used, 1116932k free,   42404k cachedPID USER     PR NI VIRT RES SHR S %CPU %MEM   TIME+ COMMAND
3239 ejabberd 20  0 15.7g 6.0g 460 S  23 76.7  3:06.65 beam.smp
We are looking into this issue. Maybe a severe bug with ejabberd, maybe a DoS attack. We don’t know, yet.
The server was offline this morning because of Kernel and MySQL upgrade. It would have happened faster if the server rebooted cleanly after “shutdown -r now” which it didn’t. So we had to send someone there to reset the machine manually.
We also upgraded Spectrum to support JID escaping. If this works after our tests (there seem to be some problems with clients who don’t support the unofficial % character which is used for @) I write more about this here.
Unfortunally there was a major problem with the database for all accounts of the jabber.hot-chilli.net domain (not accounts from other domains, like jabber.hot-chilli.eu).
Finally we decided to restore a backup from 4th/5th of May 2010 (day of the server move) and had to take the Jabber server down for about 2 hours.
Affected are just the contact lists and contact groups. This means that as an affected user of this you have to add/delete all buddies you changed since then.
We really apologize for the trouble caused, especially because the backup is one week old.
The question remains why we just got 20 rows of data inside our current database backup from this morning, missing 150000 (!) other rows. We will take a deep look into the backup process.
To avoid login errors or messages and also to avoid certain problems with server-to-server connections we installed proper certificates for the secondary domains today.
We had to restart the Jabber services several times, sorry. The ejabberd config file isn’t commented very good, also Google didn’t tell us the correct settings in the first place. But everything works fine now.
PS: For old SSL connections to port 5223 there are no dynamic certificates. This just isn’t possible because of how SSL works.
Upgrade to ejabberd 2.1.3
At 2pm CEST today I just upgraded the server to ejabberd 2.1.3, which is a bugfixing release. Finally the Debian package is available. Sorry for the less than one minute short service downtime. ;-)
Our provider experienced network problems starting yesterday (05/06/2010) at 2pm CEST. The outages covered a lot of ISPs. T-Online and Alice worked here in Germany, a lot of others like KabelBW and Strato did not work. These severe problems went away at about 6pm, but we still experienced some problems until this morning. According to our server provider the problems are gone now. The problems were cause by a external attack with more than 50 gbits.
The Jabber server was just recovering from a 6 hour downtime on 9am CEST.
Sorry guys, there was a mistake in the config file due to adding a new Jabber domain to it.
We deeply apologize for the trouble caused.