Closed Bug 923155 Opened 11 years ago Closed 10 years ago

some buildrequests get claimed but never started

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

Attachments

(6 files)

The scheduler ran without issues:
18042 2013-10-02 03:02:01-0700 [-] mozilla-central nightly: Creating buildset with sourcestamp ['e3c84e9f24906f37e19dbd3ab7b8a7ef1d952dba', "in 'mozilla-central'"]
18043 2013-10-02 03:02:01-0700 [-] mozilla-central nightly: propfunc returned {'buildid': '20131002030201'}
18044 2013-10-02 03:02:01-0700 [-] mozilla-central nightly: propfunc returned {'builduid': 'f80f9a446ed54f1ba094ff856df088bb'}


But we didn't get 64-bit linux for desktop:
https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013-10-02-03-02-01-mozilla-central/

And we only got x86/armv7 for mobile:
https://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/2013-10-02-00-40-02-mozilla-aurora-android/
https://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/2013-10-02-03-02-01-mozilla-central-android/
11:33 <+philor> and asan debug nightly
11:33 <+philor> not that I've ever been sure why we build such a thing
11:33 <@bhearsum|buildduty> which one?
11:33 <@bhearsum|buildduty> i see a 64-bit linux debug asan
11:33 <+philor> should be a Nd on that row
11:34 <+philor> and a Dxr on linux64 opt
Summary: no nightly builds for armv6 and 64-bit linux on october 2nd → missing many nightlies on mozilla-central on october 2nd
Some info from the db. Columns are buildrequests.complete, buildrequests.results, claimed_by_name, claimed_at, buildrequests.id, and buildername:
|        0 |    NULL | buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master | 1380728643 | 30262800 | Android 2.2 Armv6 mozilla-central nightly            |
|        0 |    NULL | buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master | 1380728655 | 30262806 | Linux x86-64 mozilla-central debug asan nightly      |
|        0 |    NULL | buildbot-master58.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 1380728640 | 30262803 | Linux x86-64 mozilla-central nightly                 |


So, we got buildrequests that have been claimed but not completed. They don't show up on /running, which makes me think that something went wrong between claiming the build and assigning it to a slave.
I really can't figure out what's going on here. I can't find anything bm58's internal state that makes me think it actually claimed the build. I'd really like these nightlies to run though, so I altered the DB state for the three missing ones so they'll get picked up again:
+----------+-------------------------+------------+----------+---------+------------------------------------------------------------------------------+------------+----------+-------------------------------------------------+
| id       | claimed_by_incarnation  | buildsetid | complete | results | claimed_by_name                                                              | claimed_at | id       | buildername                                     |
+----------+-------------------------+------------+----------+---------+------------------------------------------------------------------------------+------------+----------+-------------------------------------------------+
| 30262800 | pid25244-boot1380675028 |    7497825 |        0 |    NULL | buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master | 1380731643 | 30262800 | Android 2.2 Armv6 mozilla-central nightly       |
| 30262806 | pid18517-boot1380688177 |    7497825 |        0 |    NULL | buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master | 1380731657 | 30262806 | Linux x86-64 mozilla-central debug asan nightly |
| 30262803 | pid16739-boot1380647634 |    7497825 |        0 |    NULL | buildbot-master58.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 1380731641 | 30262803 | Linux x86-64 mozilla-central nightly            |
+----------+-------------------------+------------+----------+---------+------------------------------------------------------------------------------+------------+----------+-------------------------------------------------+
3 rows in set (0.01 sec)

mysql> update buildrequests set claimed_by_name=NULL, claimed_at=0, claimed_by_incarnation=NULL where id in (30262800, 30262806, 30262803);
Query OK, 3 rows affected (0.01 sec)

All three of them have been picked up and are running now.

I'm calling this fixed, but I wish we could get to the bottom of things like this =(
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Two of the newly claimed builds disappeared too.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
catlee and i talked tried debugging this a ton and couldn't figure it out. we're going on the theory that this is a freak occurrence and i'm doing the following:
1) resetting the claims on the non-started builds to get them going (done)
2) gracefully restarting the build and try masters in case they're in some weird state after taking the buildbot patch from https://bugzilla.mozilla.org/show_bug.cgi?id=893859
3) writing a script that send a report of claimed-but-not-started buildrequests and what to do to fix them
We continue to hit these, though not as often.
Summary: missing many nightlies on mozilla-central on october 2nd → some buildrequests get claimed but never started
The autocleanup script got modified Thu 8:35am PT, partially commenting out the part that actually resets the claimed-but-not-started buildrequests. It might have also started returning more fields on the initial select to detect the buildrequests, but doesn't yet pick off the buildid to pass to the update.

I disabled the cron job to prevent the spam.
(In reply to Nick Thomas [:nthomas] from comment #9)
> The autocleanup script got modified Thu 8:35am PT, partially commenting out
> the part that actually resets the claimed-but-not-started buildrequests. It
> might have also started returning more fields on the initial select to
> detect the buildrequests, but doesn't yet pick off the buildid to pass to
> the update.
> 
> I disabled the cron job to prevent the spam.

This was a botched change by me - I reverted it and re-enabled the cronjob.
Catlee and I did some more digging into this today and we _think_ there's a race condition in normal (that is, non-graceful) shutdown. We correlated two instances of this happening with slaves in shutdown. Specifically:


From buildbot-master51:
71952 2013-10-18 07:42:08-0700 [Broker,38873,10.134.57.229] BuildSlave.detached(tst-linux64-ec2-016)
71953 2013-10-18 07:42:09-0700 [-] starting build <Build Ubuntu VM 12.04 x64 mozilla-aurora debug test mochitest-3> using slave <SlaveBuilder builder='Ubuntu VM 12.04 x64 mozilla-aurora debug test mochitest-3'>
71954 2013-10-18 07:42:09-0700 [-] Unhandled Error

And from tst-linux64-ec2-016:
5926 2013-10-18 07:42:08-0700 [Broker,client] lost remote
5927 2013-10-18 07:42:08-0700 [Broker,client] Lost connection to buildbot-master51.srv.releng.use1.mozilla.com:9201
5928 2013-10-18 07:42:08-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1d91b00>
5929 2013-10-18 07:42:08-0700 [-] Main loop terminated.
5930 2013-10-18 07:42:08-0700 [-] Server Shut Down.


From buildbot-master52:
18063 2013-10-18 09:01:41-0700 [Broker,38810,10.134.56.83] BuildSlave.detached(tst-linux32-ec2-096)
18064 2013-10-18 09:01:43-0700 [Broker,38812,10.134.57.241] BuildSlave.detached(tst-linux32-ec2-029)
18065 2013-10-18 09:01:44-0700 [-] starting build <Build Ubuntu VM 12.04 mozilla-aurora pgo test reftest-no-accel> using slave <SlaveBuilder builder='Ubuntu VM 12.04 mozilla-aurora pgo test reftest-no-accel'>
18066 2013-10-18 09:01:44-0700 [-] Unhandled Error

And from tst-linux32-ec2-029:
435 2013-10-18 09:01:41-0700 [Broker,client] lost remote
436 2013-10-18 09:01:41-0700 [Broker,client] Lost connection to buildbot-master52.srv.releng.use1.mozilla.com:9201
437 2013-10-18 09:01:41-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x8adbb2c>
438 2013-10-18 09:01:41-0700 [-] Main loop terminated.
439 2013-10-18 09:01:41-0700 [-] Server Shut Down.


The times on the second instance don't correlate _quite_ as well as the first, but that could be due to clock differences between the master and the slave. It's certainly clear that in both cases there are slaves shutting down directly before the master tries to start the job. So the master may be trying to start the job with that slave - which looks okay when it calls startBuild() - but then the slave shuts down before startBuild() can finish initiating the job.
Assuming the assessment in comment #11 is correct, I think this patch should fix us up...by making sure we don't throw in prepare() and return False instead, the "if not ready" block from _prepared should run, and reset the claim: https://hg.mozilla.org/build/buildbot/file/b22c537e9f18/master/buildbot/process/builder.py#l837.
Attachment #819077 - Flags: review?(catlee)
Comment on attachment 819077 [details] [diff] [review]
check your self.slave before you wreck yourself

Review of attachment 819077 [details] [diff] [review]:
-----------------------------------------------------------------

what could go wrong?
Attachment #819077 - Flags: review?(catlee) → review+
Planning to roll this out next week, starting early Monday morning.
Comment on attachment 819077 [details] [diff] [review]
check your self.slave before you wreck yourself

Checked in - needs Buildbot master restarts to take effect. I'm going to start those today, but they'll take awhile to complete. Blow-by-blow will be in https://etherpad.mozilla.org/buildbot-master-restarts.
Attachment #819077 - Flags: checked-in+
End of day update. The majority of the graceful restarts are done, the rest should finish before tomorrow morning (EST). As of writing, the status is as follows:
In progress:
bm59
bm62
bm68
bm73
bm79
Todo:
bm60
bm63
bm64
bm65
bm66
bm74
bm80
bm82
bm83
Done:
bm10
bm19
bm20
bm22
bm29
bm42
bm43
bm44
bm45
bm51
bm52
bm53
bm54
bm55
bm56
bm57
bm58
bm61
bm67
bm69
bm70
bm71
bm72
bm75
bm76
bm77
bm78



I've re-enabled all of the masters in slavealloc again - even the ones mid-shutdown, so that they'll pick up jobs when they finish. I'll verify the deployment tomorrow morning.
Upgrades are done.

Unfortunately, we've already had 3 three buildrequests that need fixing this morning =(.
I enabled a second cronjob on bm81 so we can get more info from failed buildrequests again. Now we wait.
We had two requests pop up this morning, both on bm61:
mysql> select id, from_unixtime(claimed_at), from_unixtime(submitted_at), buildername from buildrequests where id in (31215827, 31215826);
+----------+---------------------------+-----------------------------+-------------------------------------------+
| id       | from_unixtime(claimed_at) | from_unixtime(submitted_at) | buildername                               |
+----------+---------------------------+-----------------------------+-------------------------------------------+
| 31215826 | 2013-10-24 02:52:43       | 2013-10-24 02:36:24         | Firefox mozilla-aurora linux l10n nightly |
| 31215827 | 2013-10-24 02:52:44       | 2013-10-24 02:36:25         | Firefox mozilla-aurora linux l10n nightly |
+----------+---------------------------+-----------------------------+-------------------------------------------+

Looking at the twistd.log around submitted_at time I found nothing of note at the precise second, but these interesting messages a 10s later:
  969 2013-10-24 02:36:33-0700 [-] Failed to claim everything (got 0, expected 1)
  970 2013-10-24 02:36:33-0700 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid4969-boot1382388326
  971 2013-10-24 02:36:33-0700 [-] Old: 1382603781.88
  972 2013-10-24 02:36:33-0700 [-] ((31215789L, 7690205L, u'Firefox mozilla-aurora linux l10n nightly', 0L, 1382607382L, u'buildbot-master64.srv.releng.usw2.mozilla.com:/      builds/buildbot/build1/master', u'pid6561-boot1382421132', 0L, None, 1382607357L, None),)
  973 2013-10-24 02:36:33-0700 [-] Nothing to unclaim
  974 2013-10-24 02:36:33-0700 [-] Failed to claim everything (got 0, expected 1)
  975 2013-10-24 02:36:33-0700 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid4969-boot1382388326
  976 2013-10-24 02:36:33-0700 [-] Old: 1382603781.88
  977 2013-10-24 02:36:33-0700 [-] ((31215790L, 7690206L, u'Firefox mozilla-aurora linux l10n nightly', 0L, 1382607382L, u'buildbot-master64.srv.releng.usw2.mozilla.com:/      builds/buildbot/build1/master', u'pid6561-boot1382421132', 0L, None, 1382607358L, None),)
  978 2013-10-24 02:36:33-0700 [-] Nothing to unclaim
  979 2013-10-24 02:36:33-0700 [-] Failed to claim everything (got 0, expected 1)
  980 2013-10-24 02:36:33-0700 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid4969-boot1382388326
  981 2013-10-24 02:36:33-0700 [-] Old: 1382603781.88
  982 2013-10-24 02:36:33-0700 [-] ((31215792L, 7690208L, u'Firefox mozilla-aurora linux l10n nightly', 0L, 1382607382L, u'buildbot-master64.srv.releng.usw2.mozilla.com:/      builds/buildbot/build1/master', u'pid6561-boot1382421132', 0L, None, 1382607359L, None),)
  983 2013-10-24 02:36:33-0700 [-] Nothing to unclaim
  984 2013-10-24 02:36:33-0700 [HTTPPageGetter,client] Submitted hy-AM locale
  985 2013-10-24 02:36:33-0700 [-] Failed to claim everything (got 0, expected 1)
  986 2013-10-24 02:36:33-0700 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid4969-boot1382388326
  987 2013-10-24 02:36:33-0700 [-] Old: 1382603781.88
  988 2013-10-24 02:36:33-0700 [-] ((31215791L, 7690207L, u'Firefox mozilla-aurora linux l10n nightly', 0L, 1382607377L, u'buildbot-master63.srv.releng.use1.mozilla.com:/      builds/buildbot/build1/master', u'pid22290-boot1382410856', 0L, None, 1382607359L, None),)
  989 2013-10-24 02:36:33-0700 [-] Nothing to unclaim
  990 2013-10-24 02:36:33-0700 [-] Failed to claim everything (got 0, expected 1)
  991 2013-10-24 02:36:33-0700 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid4969-boot1382388326
  992 2013-10-24 02:36:33-0700 [-] Old: 1382603781.88
  993 2013-10-24 02:36:33-0700 [-] ((31215793L, 7690209L, u'Firefox mozilla-aurora linux l10n nightly', 0L, 1382607377L, u'buildbot-master63.srv.releng.use1.mozilla.com:/      builds/buildbot/build1/master', u'pid22290-boot1382410856', 0L, None, 1382607360L, None),)

These could be a result of claim resetting, or they may be relevant.

What's _really_ interesting is that we don't have any BuildSlave.detached or lost remote messages. I'm not entirely sure what that means.
We're no closer to fixing this, but here's data on the buildrequests that failed to start so far this month.
I'm nearly certain this is related to AWS link issues now. We've had a huge spike of them in the past week, all in usw2, and this coincidences with a huge increase in scl3 <-> usw2 link issues. Catlee, you had some theory on which part of the code might be at fault IIRC.
Flags: needinfo?(catlee)
We had a HUGE rash of these yesterday - at least 50. Some examples:
id  buildsetid  buildername priority    claimed_at  claimed_by_name claimed_by_incarnation  complete    results submitted_at    complete_at id  number  brid    start_time  finish_time
33462693    8105723 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637618  NULL    NULL    NULL    NULL    NULL    NULL
33462695    8105725 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637618  NULL    NULL    NULL    NULL    NULL    NULL
33462686    8105722 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637617  NULL    NULL    NULL    NULL    NULL    NULL
33462685    8105721 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637616  NULL    NULL    NULL    NULL    NULL    NULL
33462682    8105717 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637615  NULL    NULL    NULL    NULL    NULL    NULL
33462683    8105718 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637615  NULL    NULL    NULL    NULL    NULL    NULL
33462680    8105715 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637612  NULL    NULL    NULL    NULL    NULL    NULL
33462676    8105712 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637611  NULL    NULL    NULL    NULL    NULL    NULL
33462677    8105713 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637611  NULL    NULL    NULL    NULL    NULL    NULL
33462675    8105711 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637610  NULL    NULL    NULL    NULL    NULL    NULL
33462673    8105709 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637609  NULL    NULL    NULL    NULL    NULL    NULL
33462668    8105708 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637608  NULL    NULL    NULL    NULL    NULL    NULL
33462661    8105706 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637607  NULL    NULL    NULL    NULL    NULL    NULL
33462655    8105703 Ubuntu VM 12.04 try opt test mochitest-4    0   1386639106  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637606  NULL    NULL    NULL    NULL    NULL    NULL

id  buildsetid  buildername priority    claimed_at  claimed_by_name claimed_by_incarnation  complete    results submitted_at    complete_at id  number  brid    start_time  finish_time
33463134    8105887 Ubuntu VM 12.04 x64 try opt test mochitest-4    0   1386638567  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637666  NULL    NULL    NULL    NULL    NULL    NULL
33463135    8105886 Ubuntu VM 12.04 x64 try opt test mochitest-4    0   1386638567  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637666  NULL    NULL    NULL    NULL    NULL    NULL
33463131    8105883 Ubuntu VM 12.04 x64 try opt test mochitest-4    0   1386638567  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637664  NULL    NULL    NULL    NULL    NULL    NULL
33463124    8105874 Ubuntu VM 12.04 x64 try opt test mochitest-4    0   1386638567  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637660  NULL    NULL    NULL    NULL    NULL    NULL
33463107    8105871 Ubuntu VM 12.04 x64 try opt test mochitest-4    0   1386638567  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637659  NULL    NULL    NULL    NULL    NULL    NULL
33462985    8105824 Ubuntu VM 12.04 x64 try opt test mochitest-4    0   1386638567  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637645  NULL    NULL    NULL    NULL    NULL    NULL
33462893    8105807 Ubuntu VM 12.04 x64 try opt test mochitest-4    0   1386638567  buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master  pid21710-boot1383692372 0   NULL    1386637640  NULL    NULL    NULL    NULL    NULL    NULL

33466765    8107325 WINNT 6.2 fx-team pgo talos other   0   1386642148  buildbot-master69.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master    pid12119-boot1382990157 0   NULL    1386642031  NULL    NULL    NULL    NULL    NULL    NULL
33465142    8106807 Ubuntu VM 12.04 x64 try opt test mochitest-5    0   1386642026  buildbot-master68.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master  pid4987-boot1383067971  0   NULL    1386639092  NULL    NULL    NULL    NULL    NULL    NULL
33465141    8106806 Ubuntu VM 12.04 x64 try opt test mochitest-5    0   1386642026  buildbot-master68.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master  pid4987-boot1383067971  0   NULL    1386639091  NULL    NULL    NULL    NULL    NULL    NULL
33465123    8106802 Ubuntu VM 12.04 x64 try opt test mochitest-5    0   1386642026  buildbot-master68.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master  pid4987-boot1383067971  0   NULL    1386639090  NULL    NULL    NULL    NULL    NULL    NULL
33465105    8106795 Ubuntu VM 12.04 x64 try opt test mochitest-5    0   1386642026  buildbot-master68.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master  pid4987-boot1383067971  0   NULL    1386639088  NULL    NULL    NULL    NULL    NULL    NULL
33465085    8106791 Ubuntu VM 12.04 x64 try opt test mochitest-5    0   1386642026  buildbot-master68.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master  pid4987-boot1383067971  0   NULL    1386639087  NULL    NULL    NULL    NULL    NULL    NULL
33465038    8106787 Ubuntu VM 12.04 x64 try opt test mochitest-5    0   1386642026  buildbot-master68.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master  pid4987-boot1383067971  0   NULL    1386639085  NULL    NULL    NULL    NULL    NULL    NULL


It's a little curious that many (but not all) came in chunks on the same master+builder.
My recollection is that the code that marks buildrequests as complete can fail and not retry. I remember some instances where the build master knew the build was done, statusdb had been updated, but schedulerdb thought the build was still running.

http://hg.mozilla.org/build/buildbot/file/533b0ce5c5d8/master/buildbot/db/connector.py#l930
Flags: needinfo?(catlee)
The three most recent of these that we saw were on bm62's "Firefox mozilla-aurora linux64 l10n nightly" builder. The buildrequest IDs of 33985548, 33985550, and 33985552 got stuck and needed fixing. I dug through twistd.log and found an odd correlation: all of the buildrequestids before the stuck ones had a "Failed to claim" messages near them (I think the nextAWSSlave message are irrelevant here, but I included them anyways just in case...):
2013-12-19 03:27:04-0800 [-] nextAWSSlave: 0 retries for Firefox mozilla-aurora linux64 l10n nightly
2013-12-19 03:27:04-0800 [-] nextAWSSlave: Choosing ondemand since there aren't any spot available
2013-12-19 03:27:32-0800 [-] nextAWSSlave: 0 retries for Firefox mozilla-aurora linux64 l10n nightly
2013-12-19 03:27:32-0800 [-] nextAWSSlave: Choosing ondemand since there aren't any spot available
2013-12-19 03:27:32-0800 [-] Failed to claim everything (got 0, expected 1)
2013-12-19 03:27:32-0800 [-] Name: buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid18369-boot1383074470
2013-12-19 03:27:32-0800 [-] Old: 1387448768.68
2013-12-19 03:27:32-0800 [-] ((33985547L, 8206217L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452171L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451608L, None),)
2013-12-19 03:27:32-0800 [-] Nothing to unclaim
2013-12-19 03:28:00-0800 [-] nextAWSSlave: 1 retries for Firefox mozilla-aurora linux64 l10n nightly
2013-12-19 03:28:00-0800 [-] nextAWSSlave: Choosing ondemand because of retries
2013-12-19 03:28:00-0800 [-] Failed to claim everything (got 0, expected 1)
2013-12-19 03:28:00-0800 [-] Name: buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid18369-boot1383074470
2013-12-19 03:28:00-0800 [-] Old: 1387448768.68
2013-12-19 03:28:01-0800 [-] ((33985549L, 8206219L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452439L, u'buildbot-master58.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master', u'pid6068-boot1382992879', 0L, None, 1387451609L, None),)
2013-12-19 03:28:01-0800 [-] Nothing to unclaim
2013-12-19 03:28:30-0800 [-] nextAWSSlave: 0 retries for Firefox mozilla-aurora linux64 l10n nightly
2013-12-19 03:28:30-0800 [-] nextAWSSlave: Choosing ondemand since there aren't any spot available
2013-12-19 03:29:02-0800 [-] nextAWSSlave: 1 retries for Firefox mozilla-aurora linux64 l10n nightly
2013-12-19 03:29:02-0800 [-] nextAWSSlave: Choosing ondemand because of retries
2013-12-19 03:29:03-0800 [-] Failed to claim everything (got 0, expected 1)
2013-12-19 03:29:03-0800 [-] Name: buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid18369-boot1383074470
2013-12-19 03:29:03-0800 [-] Old: 1387448768.68
2013-12-19 03:29:03-0800 [-] ((33985551L, 8206221L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452533L, u'buildbot-master85.srv.releng.scl3.mozilla.com:/builds/buildbot/build1/master', u'pid1481-boot1383690541', 0L, None, 1387451611L, None),)
2013-12-19 03:29:03-0800 [-] Nothing to unclaim
2013-12-19 03:29:34-0800 [-] nextAWSSlave: 0 retries for Firefox mozilla-aurora linux64 l10n nightly
2013-12-19 03:29:34-0800 [-] nextAWSSlave: Choosing ondemand since there aren't any spot available

The fact that we don't have any log messages about the stuck brids implies that the claim process was completely successful - because there's nothing printed in _txn_claim_buildrequests for that case.

Still investigating more...
More interesting stuff from logs. I found the first brid from before the hung ones (33985547) failed to be claimed on multiple different masters as multiple different times:

bm57:
10040 2013-12-19 03:27:06-0800 [-] Failed to claim everything (got 0, expected 1)
10041 2013-12-19 03:27:06-0800 [-] Name: buildbot-master57.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid1978-boot1382976516
10042 2013-12-19 03:27:06-0800 [-] Old: 1387448655.51
10043 2013-12-19 03:27:06-0800 [-] ((33985547L, 8206217L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452171L, u'buildbot-master61.srv.releng.use1.mozilla.com      :/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451608L, None),)
10044 2013-12-19 03:27:06-0800 [-] Nothing to unclaim

bm63:
17947 2013-12-19 03:27:14-0800 [-] Failed to claim everything (got 0, expected 1)
17948 2013-12-19 03:27:14-0800 [-] Name: buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958
17949 2013-12-19 03:27:14-0800 [-] Old: 1387448741.56
17950 2013-12-19 03:27:14-0800 [-] ((33985547L, 8206217L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452171L, u'buildbot-master61.srv.releng.use1.mozilla.com      :/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451608L, None),)
17951 2013-12-19 03:27:14-0800 [-] Nothing to unclaim

bm62:
2013-12-19 03:27:32-0800 [-] Failed to claim everything (got 0, expected 1)
2013-12-19 03:27:32-0800 [-] Name: buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid18369-boot1383074470
2013-12-19 03:27:32-0800 [-] Old: 1387448768.68
2013-12-19 03:27:32-0800 [-] ((33985547L, 8206217L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452171L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451608L, None),)
2013-12-19 03:27:32-0800 [-] Nothing to unclaim

And then it was failed to be reclaimed on bm61 (which implies that it was claimed earlier):
bm61:
14308 2013-12-19 03:32:47-0800 [-] Failed to claim everything (got 7, expected 11)
14309 2013-12-19 03:32:47-0800 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid13849-boot1383062515
14310 2013-12-19 03:32:47-0800 [-] Old: 1387449167.39
14311 2013-12-19 03:32:47-0800 [-] ((33985532L, 8206201L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452733L, u'buildbot-master85.srv.releng.scl3.mozilla.com      :/builds/buildbot/build1/master', u'pid1481-boot1383690541', 0L, None, 1387451596L, None), (33985535L, 8206204L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L,       1387452767L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451599L, None), (33985539L,       8206209L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452730L, u'buildbot-master84.srv.releng.scl3.mozilla.com:/builds/buildbot/build1/master', u'pid14      77-boot1383690531', 0L, None, 1387451601L, None), (33985540L, 8206210L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452767L, u'buildbot-master61.srv.rel      eng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451602L, None), (33985541L, 8206211L, u'Firefox mozilla-aurora linux      64 l10n nightly', 0L, 1387452767L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451603      L, None), (33985542L, 8206212L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452767L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/bu      ild1/master', u'pid13849-boot1383062515', 0L, None, 1387451604L, None), (33985543L, 8206213L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452766L, u'bui      ldbot-master57.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid1978-boot1382976516', 0L, None, 1387451604L, None), (33985544L, 8206214L, u'Firefox       mozilla-aurora linux64 l10n nightly', 0L, 1387452170L, u'buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid14378-boot1383085958',       0L, None, 1387451605L, None), (33985545L, 8206215L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452767L, u'buildbot-master61.srv.releng.use1.mozilla.co      m:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451606L, None), (33985546L, 8206216L, u'Firefox mozilla-aurora linux64 l10n nightly', 0      L, 1387452767L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451607L, None), (33985547      L, 8206217L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452767L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid      13849-boot1383062515', 0L, None, 1387451608L, None))
14312 2013-12-19 03:32:47-0800 [-] Failed to re-claim buildrequests for builder Firefox mozilla-aurora linux64 l10n nightly


I also found that the first hung brid (33985548) was failed to be claimed by other masters:

bm57:
/builds/buildbot/build1/master/twistd.log.2-10082-2013-12-19 03:27:38-0800 [-] Failed to claim everything (got 0, expected 1)
/builds/buildbot/build1/master/twistd.log.2-10083-2013-12-19 03:27:38-0800 [-] Name: buildbot-master57.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid1978-boot1382976516
/builds/buildbot/build1/master/twistd.log.2-10084-2013-12-19 03:27:38-0800 [-] Old: 1387448655.51
/builds/buildbot/build1/master/twistd.log.2:10085:2013-12-19 03:27:38-0800 [-] ((33985548L, 8206218L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452369L, u'buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid18369-boot1383074470', 0L, None, 1387451608L, None),)
/builds/buildbot/build1/master/twistd.log.2-10086-2013-12-19 03:27:38-0800 [-] Nothing to unclaim

bm63:
builds/buildbot/build1/master/twistd.log.2-18196-2013-12-19 03:27:50-0800 [-] Failed to claim everything (got 0, expected 1)
/builds/buildbot/build1/master/twistd.log.2-18197-2013-12-19 03:27:50-0800 [-] Name: buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958
/builds/buildbot/build1/master/twistd.log.2-18198-2013-12-19 03:27:50-0800 [-] Old: 1387448741.56
/builds/buildbot/build1/master/twistd.log.2:18199:2013-12-19 03:27:50-0800 [-] ((33985548L, 8206218L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452369L, u'buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid18369-boot1383074470', 0L, None, 1387451608L, None),)
/builds/buildbot/build1/master/twistd.log.2-18200-2013-12-19 03:27:50-0800 [-] Nothing to unclaim

bm65:
/builds/buildbot/build1/master/twistd.log.2-27024-2013-12-19 03:30:44-0800 [-] Failed to claim everything (got 0, expected 1)
/builds/buildbot/build1/master/twistd.log.2-27025-2013-12-19 03:30:44-0800 [-] Name: buildbot-master65.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master pid12543-boot1383100401
/builds/buildbot/build1/master/twistd.log.2-27026-2013-12-19 03:30:44-0800 [-] Old: 1387449024.99
/builds/buildbot/build1/master/twistd.log.2:27027:2013-12-19 03:30:44-0800 [-] ((33985548L, 8206218L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452625L, u'buildbot-master58.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master', u'pid6068-boot1382992879', 0L, None, 1387451608L, None),)
/builds/buildbot/build1/master/twistd.log.2-27028-2013-12-19 03:30:44-0800 [-] Nothing to unclaim

bm62:
2013-12-19 03:31:05-0800 [-] Error choosing next slave for builder 'Firefox mozilla-aurora linux64 l10n nightly', choosing randomly instead
2013-12-19 03:31:05-0800 [-] Unhandled Error
    Traceback (most recent call last):
      File "/builds/buildbot/build1/lib/python2.7/site-packages/twisted/python/context.py", line 37, in callWithContext
        return func(*args,**kw)
      File "/builds/buildbot/build1/lib/python2.7/site-packages/twisted/enterprise/adbapi.py", line 429, in _runInteraction
        result = interaction(trans, *args, **kw)
      File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/process/builder.py", line 517, in _claim_buildreqs
        sb = self._choose_slave(available_slaves)
      File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/process/builder.py", line 547, in _choose_slave
        return self.nextSlave(self, available_slaves)
    --- <exception caught here> ---
      File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbotcustom/misc.py", line 267, in _nextSlave
        if build.slavename == slavename:
      File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbotcustom/misc.py", line 383, in _nextSlave
        retval = d1.copy()
      File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbotcustom/misc.py", line 327, in _classifyAWSSlaves
        log.err()
    exceptions.AttributeError: 'NoneType' object has no attribute 'slavename'
2013-12-19 03:31:05-0800 [-] Failed to claim everything (got 0, expected 1)
2013-12-19 03:31:05-0800 [-] Name: buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid18369-boot1383074470
2013-12-19 03:31:05-0800 [-] Old: 1387449038.92
2013-12-19 03:31:05-0800 [-] ((33985548L, 8206218L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452625L, u'buildbot-master58.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master', u'pid6068-boot1382992879', 0L, None, 1387451608L, None),)
2013-12-19 03:31:05-0800 [-] Nothing to unclaim

bm61:
/builds/buildbot/build1/master/twistd.log.2-13042-2013-12-19 03:31:07-0800 [-] Failed to claim everything (got 0, expected 1)
/builds/buildbot/build1/master/twistd.log.2-13043-2013-12-19 03:31:07-0800 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid13849-boot1383062515
/builds/buildbot/build1/master/twistd.log.2-13044-2013-12-19 03:31:07-0800 [-] Old: 1387449040.63
/builds/buildbot/build1/master/twistd.log.2:13045:2013-12-19 03:31:07-0800 [-] ((33985548L, 8206218L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452625L, u'buildbot-master58.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master', u'pid6068-boot1382992879', 0L, None, 1387451608L, None),)
/builds/buildbot/build1/master/twistd.log.2-13046-2013-12-19 03:31:07-0800 [-] Nothing to unclaim


I'm not really sure why there's so many failed claim attempts. The traceback right before bm62 fails to claim that brid is also weird - but there isn't a similar one for the other failed brids.

We've been seeing this lots on l10n builders, which are a little special because of how many non-mergeable buildrequests we get in such a short period of time. They're also done by a pool of 12 masters. It makes me think that one or both of # of buildrequests pending or # of masters attempting to make a claim has an influence on this.
I'm hoping bug 952159 will help make some sense of all the failed claims. In the meantime, I'm going to work on the assumption that the claims aren't the issue, and try to figure out how we can claim something but not start a build for it...
With Dustin's help, I stepped through the deferred that Builder.run returns. That method gets called by BotMaster's self.loop, which is a Loop instance. Eventually, the deferred that Builder.run returns ends up here: https://mxr.mozilla.org/build-central/source/buildbot/master/buildbot/util/loop.py#161 -- which adds an errback that calls log.err.

BGiven that, it seems very unlikely that we're losing an exception anywhere in that code. AFAICT, _claim_buildreqs and _start_builds both complete without raising. That mean that there's a logic error somewhere in that code, or it could be a hint that the problem is elsewhere, perhaps some network issue related race/error.
I realized that even though we're not getting a Traceback, we never see the https://mxr.mozilla.org/build-central/source/buildbot/master/buildbot/process/base.py#188 log message...so something in the Builder.run deferred chain is breaking down. I'm hoping this patch will help us determine where exactly that's happening.
Attachment #8350181 - Flags: review?(catlee)
Attachment #8350181 - Flags: review?(catlee) → review+
Comment on attachment 8350181 [details] [diff] [review]
add more debugging messages

I'm doing a graceful restart on bm62 to get this patch running.
Attachment #8350181 - Flags: checked-in+
Attached file 33985548
Here's the db log for brid 33985548. There's lots of attempted claims before and after the fixer upper script runs. The claimed_at times are local to buildbot masters, so I suspect the main reason there's so many claims at various times is clock skew.

The most notable thing here is that we have _two_ completion events. I tracked these down to http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/Firefox%20mozilla-aurora%20linux64%20l10n%20nightly/builds/2200 and http://buildbot-master62.srv.releng.use1.mozilla.com:8001/builders/Firefox%20mozilla-aurora%20linux64%20l10n%20nightly/builds/3380 (which are https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013-12-19-00-40-03-mozilla-aurora-l10n/mozilla-aurora-linux64-l10n-nightly-eu-bm58-build1-build2200.txt.gz and https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013-12-19-00-40-03-mozilla-aurora-l10n/mozilla-aurora-linux64-l10n-nightly-eu-bm62-build1-build3380.txt.gz). This doesn't seem to happen for all claims that get reset, but enough where we're wasting a lot of resources. I looked at 20 buildrequests and 19 of them had two builds. Some had two builds on the _same_ master.

What's fascinating about this is that both builds start _after_ the claim was reset. So I guess what happens is something like:
* master A claims the buildrequest but doesn't start the build (reason TBD)
* buildrequest fixer upper comes along and resets the claim
* master A and B both somehow decide they can claim the newly unclaimed buildrequest, both run builds to completion

The fact that some of these buildrequests have two builds on the same master makes me think that something in https://mxr.mozilla.org/build-central/source/buildbot/master/buildbot/process/builder.py#567 is taking a VERY long time to complete -- otherwise I'm not sure how we could get two builds for the same buildrequest. This may hint towards some bug in slave connect/disconnect, dunno.

This says to me that we need to fix this more urgently - every time we hit one of these we run a job more than once, which wastes a lot of precious machine time. Also, if this happens for a release build it can really screw things up.
HM! Something I just realized is that build requests are claimed in a loop, but no Builds are created until after that entire loop finishes: https://mxr.mozilla.org/build-central/source/buildbot/master/buildbot/process/builder.py#516

So, if we claim request A in the first iteration of the loop, but we still have another 5 (or 10 or 50) requests to claim, it could take some time between the claim happening and the Build existing. This might explain why we see this so much with l10n -- where we have 80+ unmergable buildrequests coming at the exact same time.

nextSlave and nextBuild both run at the start of this loop, so if they take too long to run, we'll end up resetting the claim on the buildrequest the next time fix-nonstarted-buildrequests.sh runs. (Too long is currently 90 seconds, but that's fudged a little because of clock skew between bm81 (where the script runs) and other masters.
I bumped the buildrequest restarter to wait 300s before resetting claims - I'm hoping that gets rid of all of the cases of resetting claims on buildrequests that are simply taking awhile (as opposed to completely dead).
One last thing for today: It could be the case that all the "failures" we're seeing now are just cases of long delays between claiming and starting builds (which is why I bumped the delay time). If that's the case, it's likely that the original patch here fixed the original problem. So far, I haven't been able to find any buildrequests that were reset by the script after that patch landed that didn't result in 2 builds -- which supports that theory. We should have a better idea in the new year.
(In reply to Ben Hearsum [:bhearsum] from comment #32)
> I bumped the buildrequest restarter to wait 300s before resetting claims -
> I'm hoping that gets rid of all of the cases of resetting claims on
> buildrequests that are simply taking awhile (as opposed to completely dead).

This wasn't enough -- we got a bunch of duplicate l10n builds again this morning. There were 14 buildrequests whose claims were reset (11 at 03:40 PT, 1 at 03:45 PT, 3 at 03:50 PT). AFAICT, all of theme were on mozilla-aurora linux64 repacks.

Curiously, I can only find 11 builds with more than one log -- which means that some of the "hung" buildrequests were truly hung, and some weren't. I'm not sure how to figure out which is which, but in e-mail we have:
id  buildsetid  buildername priority    claimed_at  claimed_by_name claimed_by_incarnation  complete    results submitted_at    complete_at id  number  brid    start_time  finish_time
34261127    8276317 Firefox mozilla-aurora linux64 l10n nightly 0   1388662364  buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid14378-boot1383085958 0   NULL    1388662328  NULL    NULL    NULL    NULL    NULL    NULL
34261126    8276316 Firefox mozilla-aurora linux64 l10n nightly 0   1388662364  buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid14378-boot1383085958 0   NULL    1388662327  NULL    NULL    NULL    NULL    NULL    NULL
34261125    8276315 Firefox mozilla-aurora linux64 l10n nightly 0   1388662364  buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid14378-boot1383085958 0   NULL    1388662326  NULL    NULL    NULL    NULL    NULL    NULL
34261124    8276314 Firefox mozilla-aurora linux64 l10n nightly 0   1388662364  buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid14378-boot1383085958 0   NULL    1388662325  NULL    NULL    NULL    NULL    NULL    NULL
34261122    8276312 Firefox mozilla-aurora linux64 l10n nightly 0   1388662364  buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid14378-boot1383085958 0   NULL    1388662324  NULL    NULL    NULL    NULL    NULL    NULL
34261123    8276313 Firefox mozilla-aurora linux64 l10n nightly 0   1388662364  buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid14378-boot1383085958 0   NULL    1388662324  NULL    NULL    NULL    NULL    NULL    NULL
34261121    8276311 Firefox mozilla-aurora linux64 l10n nightly 0   1388662364  buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid14378-boot1383085958 0   NULL    1388662323  NULL    NULL    NULL    NULL    NULL    NULL
34261119    8276309 Firefox mozilla-aurora linux64 l10n nightly 0   1388662364  buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid14378-boot1383085958 0   NULL    1388662321  NULL    NULL    NULL    NULL    NULL    NULL
34261117    8276307 Firefox mozilla-aurora linux64 l10n nightly 0   1388662364  buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid14378-boot1383085958 0   NULL    1388662320  NULL    NULL    NULL    NULL    NULL    NULL
34261118    8276308 Firefox mozilla-aurora linux64 l10n nightly 0   1388662371  buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid13849-boot1383062515 0   NULL    1388662320  NULL    NULL    NULL    NULL    NULL    NULL
34261116    8276306 Firefox mozilla-aurora linux64 l10n nightly 0   1388662364  buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid14378-boot1383085958 0   NULL    1388662319  NULL    NULL    NULL    NULL    NULL    NULL
34261133    8276323 Firefox mozilla-aurora linux64 l10n nightly 0   1388662399  buildbot-master57.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid1978-boot1382976516  0   NULL    1388662332  NULL    NULL    NULL    NULL    NULL    NULL
34261137    8276327 Firefox mozilla-aurora linux64 l10n nightly 0   1388662991  buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid13849-boot1383062515 0   NULL    1388662336  NULL    NULL    NULL    NULL    NULL    NULL
34261138    8276328 Firefox mozilla-aurora linux64 l10n nightly 0   1388662991  buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master    pid13849-boot1383062515 0   NULL    1388662336  NULL    NULL    NULL    NULL    NULL    NULL


So, once again we're clustered onto masters in a single colo.
I made a small tweak to the cronjobs. The one that e-mails us will alert on any claimed buildrequest that has gone 300s without a build, and the one that actually resets the claim will do so after 600s. I'm hoping that with this we'll be able to hit a case of this happening and disable the reset cronjob so that we can debug it live.
The blurt script just threw this out:
id      buildsetid      buildername     priority        claimed_at      claimed_by_name claimed_by_incarnation  complete        results submitted_at    complete_at     id      number  brid    start_time      finish_time
34318105        8288058 b2g_ubuntu32_vm b2g-inbound opt test gaia-integration   0       1388774128      buildbot-master67.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master      pid24295-boot1386962733 0
+NULL   1388773931      NULL    NULL    NULL    NULL    NULL    NULL

I disabled the fixing script but by the time I started looking at the database, a build had started (http://buildbot-master67.srv.releng.use1.mozilla.com:8201/builders/b2g_ubuntu32_vm%20b2g-inbound%20opt%20test%20gaia-integration/builds/28)

In twistd.log, I found this:
40342 2014-01-03 10:32:40-0800 [-] starting build <Build b2g_ubuntu32_vm b2g-inbound opt test gaia-integration>.. pinging the slave <SlaveBuilder builder='b2g_ubuntu32_vm       b2g-inbound opt test gaia-integration' slave='tst-linux32-spot-117'>
40343 2014-01-03 10:32:40-0800 [-] sending ping

And then nothing about that builder until....:
43230 2014-01-03 10:48:10-0800 [Broker,37045,10.134.57.10] ping finished: failure
43231 2014-01-03 10:48:10-0800 [Broker,37045,10.134.57.10] I tried to tell the slave that the build <Build b2g_ubuntu32_vm b2g-inbound opt test gaia-integration> started,       but remote_startBuild failed: slave ping failed
43232 2014-01-03 10:48:10-0800 [Broker,37045,10.134.57.10] re-queueing the BuildRequest
43233 2014-01-03 10:48:10-0800 [Broker,37045,10.134.57.10] BuildSlave.detached(tst-linux32-spot-117)


And then the requeue goes on to succeeding. So in this particular case, it takes about 15 minutes to fail to ping a slave, during which time the requeuing script that I wrote has already reset the claim -- causing us to get two builds in many cases.

I've adjusted the blurt script to wait 20min and the resetting script to wait 22min now -- this should get rid of any false positives from this case. I also filed bug 956378 to reduce the ping failure time.
We haven't seen any more of these since I adjusted the time on the cronjob - so that's a good sign. It's possible that attachment 819077 [details] [diff] [review] really did fix the problem here, and that we've been retriggering jobs due to slow slave pings ever since. If we don't see any instances of this this week I'm going to close this bug.
We still have jobs that take awhile to start due to bug 956738, but I don't think we any that fail to start forever since the buildbot patch here landed. I'm pretty confident that the ones we've seen since that have all been false positives as described in comment #36.

I've disabled the cronjobs on bm81 now, but I'll be attaching them here for posterity.
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → FIXED
Attachment #8359204 - Attachment mime type: application/x-shellscript → text/plain
ben, I know this was a long time ago, but any recollection as to why its not still running? or if it is running where it is?

(I was looking for this today while helping to debug some DB issues)
Flags: needinfo?(bhearsum)
(In reply to Justin Wood (:Callek) from comment #41)
> ben, I know this was a long time ago, but any recollection as to why its not
> still running? or if it is running where it is?
> 
> (I was looking for this today while helping to debug some DB issues)

See comment #38
Flags: needinfo?(bhearsum)
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: