Closed
Bug 923155
Opened 11 years ago
Closed 10 years ago
some buildrequests get claimed but never started
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Unassigned)
References
Details
Attachments
(6 files)
471 bytes,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
8.76 KB,
text/plain
|
Details | |
1.68 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
4.63 KB,
text/plain
|
Details | |
432 bytes,
text/plain
|
Details | |
669 bytes,
text/plain
|
Details |
The scheduler ran without issues: 18042 2013-10-02 03:02:01-0700 [-] mozilla-central nightly: Creating buildset with sourcestamp ['e3c84e9f24906f37e19dbd3ab7b8a7ef1d952dba', "in 'mozilla-central'"] 18043 2013-10-02 03:02:01-0700 [-] mozilla-central nightly: propfunc returned {'buildid': '20131002030201'} 18044 2013-10-02 03:02:01-0700 [-] mozilla-central nightly: propfunc returned {'builduid': 'f80f9a446ed54f1ba094ff856df088bb'} But we didn't get 64-bit linux for desktop: https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013-10-02-03-02-01-mozilla-central/ And we only got x86/armv7 for mobile: https://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/2013-10-02-00-40-02-mozilla-aurora-android/ https://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/2013-10-02-03-02-01-mozilla-central-android/
Reporter | ||
Comment 1•11 years ago
|
||
11:33 <+philor> and asan debug nightly 11:33 <+philor> not that I've ever been sure why we build such a thing 11:33 <@bhearsum|buildduty> which one? 11:33 <@bhearsum|buildduty> i see a 64-bit linux debug asan 11:33 <+philor> should be a Nd on that row
Reporter | ||
Comment 2•11 years ago
|
||
11:34 <+philor> and a Dxr on linux64 opt
Summary: no nightly builds for armv6 and 64-bit linux on october 2nd → missing many nightlies on mozilla-central on october 2nd
Reporter | ||
Comment 3•11 years ago
|
||
Some info from the db. Columns are buildrequests.complete, buildrequests.results, claimed_by_name, claimed_at, buildrequests.id, and buildername: | 0 | NULL | buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master | 1380728643 | 30262800 | Android 2.2 Armv6 mozilla-central nightly | | 0 | NULL | buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master | 1380728655 | 30262806 | Linux x86-64 mozilla-central debug asan nightly | | 0 | NULL | buildbot-master58.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 1380728640 | 30262803 | Linux x86-64 mozilla-central nightly | So, we got buildrequests that have been claimed but not completed. They don't show up on /running, which makes me think that something went wrong between claiming the build and assigning it to a slave.
Reporter | ||
Comment 4•11 years ago
|
||
I really can't figure out what's going on here. I can't find anything bm58's internal state that makes me think it actually claimed the build. I'd really like these nightlies to run though, so I altered the DB state for the three missing ones so they'll get picked up again: +----------+-------------------------+------------+----------+---------+------------------------------------------------------------------------------+------------+----------+-------------------------------------------------+ | id | claimed_by_incarnation | buildsetid | complete | results | claimed_by_name | claimed_at | id | buildername | +----------+-------------------------+------------+----------+---------+------------------------------------------------------------------------------+------------+----------+-------------------------------------------------+ | 30262800 | pid25244-boot1380675028 | 7497825 | 0 | NULL | buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master | 1380731643 | 30262800 | Android 2.2 Armv6 mozilla-central nightly | | 30262806 | pid18517-boot1380688177 | 7497825 | 0 | NULL | buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master | 1380731657 | 30262806 | Linux x86-64 mozilla-central debug asan nightly | | 30262803 | pid16739-boot1380647634 | 7497825 | 0 | NULL | buildbot-master58.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 1380731641 | 30262803 | Linux x86-64 mozilla-central nightly | +----------+-------------------------+------------+----------+---------+------------------------------------------------------------------------------+------------+----------+-------------------------------------------------+ 3 rows in set (0.01 sec) mysql> update buildrequests set claimed_by_name=NULL, claimed_at=0, claimed_by_incarnation=NULL where id in (30262800, 30262806, 30262803); Query OK, 3 rows affected (0.01 sec) All three of them have been picked up and are running now. I'm calling this fixed, but I wish we could get to the bottom of things like this =(
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 5•11 years ago
|
||
Two of the newly claimed builds disappeared too.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 6•11 years ago
|
||
catlee and i talked tried debugging this a ton and couldn't figure it out. we're going on the theory that this is a freak occurrence and i'm doing the following: 1) resetting the claims on the non-started builds to get them going (done) 2) gracefully restarting the build and try masters in case they're in some weird state after taking the buildbot patch from https://bugzilla.mozilla.org/show_bug.cgi?id=893859 3) writing a script that send a report of claimed-but-not-started buildrequests and what to do to fix them
Reporter | ||
Comment 7•11 years ago
|
||
We continue to hit these, though not as often.
Reporter | ||
Updated•11 years ago
|
Summary: missing many nightlies on mozilla-central on october 2nd → some buildrequests get claimed but never started
Comment 9•11 years ago
|
||
The autocleanup script got modified Thu 8:35am PT, partially commenting out the part that actually resets the claimed-but-not-started buildrequests. It might have also started returning more fields on the initial select to detect the buildrequests, but doesn't yet pick off the buildid to pass to the update. I disabled the cron job to prevent the spam.
Reporter | ||
Comment 10•11 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #9) > The autocleanup script got modified Thu 8:35am PT, partially commenting out > the part that actually resets the claimed-but-not-started buildrequests. It > might have also started returning more fields on the initial select to > detect the buildrequests, but doesn't yet pick off the buildid to pass to > the update. > > I disabled the cron job to prevent the spam. This was a botched change by me - I reverted it and re-enabled the cronjob.
Reporter | ||
Comment 11•11 years ago
|
||
Catlee and I did some more digging into this today and we _think_ there's a race condition in normal (that is, non-graceful) shutdown. We correlated two instances of this happening with slaves in shutdown. Specifically: From buildbot-master51: 71952 2013-10-18 07:42:08-0700 [Broker,38873,10.134.57.229] BuildSlave.detached(tst-linux64-ec2-016) 71953 2013-10-18 07:42:09-0700 [-] starting build <Build Ubuntu VM 12.04 x64 mozilla-aurora debug test mochitest-3> using slave <SlaveBuilder builder='Ubuntu VM 12.04 x64 mozilla-aurora debug test mochitest-3'> 71954 2013-10-18 07:42:09-0700 [-] Unhandled Error And from tst-linux64-ec2-016: 5926 2013-10-18 07:42:08-0700 [Broker,client] lost remote 5927 2013-10-18 07:42:08-0700 [Broker,client] Lost connection to buildbot-master51.srv.releng.use1.mozilla.com:9201 5928 2013-10-18 07:42:08-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1d91b00> 5929 2013-10-18 07:42:08-0700 [-] Main loop terminated. 5930 2013-10-18 07:42:08-0700 [-] Server Shut Down. From buildbot-master52: 18063 2013-10-18 09:01:41-0700 [Broker,38810,10.134.56.83] BuildSlave.detached(tst-linux32-ec2-096) 18064 2013-10-18 09:01:43-0700 [Broker,38812,10.134.57.241] BuildSlave.detached(tst-linux32-ec2-029) 18065 2013-10-18 09:01:44-0700 [-] starting build <Build Ubuntu VM 12.04 mozilla-aurora pgo test reftest-no-accel> using slave <SlaveBuilder builder='Ubuntu VM 12.04 mozilla-aurora pgo test reftest-no-accel'> 18066 2013-10-18 09:01:44-0700 [-] Unhandled Error And from tst-linux32-ec2-029: 435 2013-10-18 09:01:41-0700 [Broker,client] lost remote 436 2013-10-18 09:01:41-0700 [Broker,client] Lost connection to buildbot-master52.srv.releng.use1.mozilla.com:9201 437 2013-10-18 09:01:41-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x8adbb2c> 438 2013-10-18 09:01:41-0700 [-] Main loop terminated. 439 2013-10-18 09:01:41-0700 [-] Server Shut Down. The times on the second instance don't correlate _quite_ as well as the first, but that could be due to clock differences between the master and the slave. It's certainly clear that in both cases there are slaves shutting down directly before the master tries to start the job. So the master may be trying to start the job with that slave - which looks okay when it calls startBuild() - but then the slave shuts down before startBuild() can finish initiating the job.
Reporter | ||
Comment 12•11 years ago
|
||
Assuming the assessment in comment #11 is correct, I think this patch should fix us up...by making sure we don't throw in prepare() and return False instead, the "if not ready" block from _prepared should run, and reset the claim: https://hg.mozilla.org/build/buildbot/file/b22c537e9f18/master/buildbot/process/builder.py#l837.
Attachment #819077 -
Flags: review?(catlee)
Comment 13•11 years ago
|
||
Comment on attachment 819077 [details] [diff] [review] check your self.slave before you wreck yourself Review of attachment 819077 [details] [diff] [review]: ----------------------------------------------------------------- what could go wrong?
Attachment #819077 -
Flags: review?(catlee) → review+
Reporter | ||
Comment 14•11 years ago
|
||
Planning to roll this out next week, starting early Monday morning.
Reporter | ||
Comment 15•11 years ago
|
||
Comment on attachment 819077 [details] [diff] [review] check your self.slave before you wreck yourself Checked in - needs Buildbot master restarts to take effect. I'm going to start those today, but they'll take awhile to complete. Blow-by-blow will be in https://etherpad.mozilla.org/buildbot-master-restarts.
Attachment #819077 -
Flags: checked-in+
Reporter | ||
Comment 16•11 years ago
|
||
End of day update. The majority of the graceful restarts are done, the rest should finish before tomorrow morning (EST). As of writing, the status is as follows: In progress: bm59 bm62 bm68 bm73 bm79 Todo: bm60 bm63 bm64 bm65 bm66 bm74 bm80 bm82 bm83 Done: bm10 bm19 bm20 bm22 bm29 bm42 bm43 bm44 bm45 bm51 bm52 bm53 bm54 bm55 bm56 bm57 bm58 bm61 bm67 bm69 bm70 bm71 bm72 bm75 bm76 bm77 bm78 I've re-enabled all of the masters in slavealloc again - even the ones mid-shutdown, so that they'll pick up jobs when they finish. I'll verify the deployment tomorrow morning.
Reporter | ||
Comment 17•11 years ago
|
||
Upgrades are done. Unfortunately, we've already had 3 three buildrequests that need fixing this morning =(.
Reporter | ||
Comment 18•11 years ago
|
||
I enabled a second cronjob on bm81 so we can get more info from failed buildrequests again. Now we wait.
Reporter | ||
Comment 19•11 years ago
|
||
We had two requests pop up this morning, both on bm61: mysql> select id, from_unixtime(claimed_at), from_unixtime(submitted_at), buildername from buildrequests where id in (31215827, 31215826); +----------+---------------------------+-----------------------------+-------------------------------------------+ | id | from_unixtime(claimed_at) | from_unixtime(submitted_at) | buildername | +----------+---------------------------+-----------------------------+-------------------------------------------+ | 31215826 | 2013-10-24 02:52:43 | 2013-10-24 02:36:24 | Firefox mozilla-aurora linux l10n nightly | | 31215827 | 2013-10-24 02:52:44 | 2013-10-24 02:36:25 | Firefox mozilla-aurora linux l10n nightly | +----------+---------------------------+-----------------------------+-------------------------------------------+ Looking at the twistd.log around submitted_at time I found nothing of note at the precise second, but these interesting messages a 10s later: 969 2013-10-24 02:36:33-0700 [-] Failed to claim everything (got 0, expected 1) 970 2013-10-24 02:36:33-0700 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid4969-boot1382388326 971 2013-10-24 02:36:33-0700 [-] Old: 1382603781.88 972 2013-10-24 02:36:33-0700 [-] ((31215789L, 7690205L, u'Firefox mozilla-aurora linux l10n nightly', 0L, 1382607382L, u'buildbot-master64.srv.releng.usw2.mozilla.com:/ builds/buildbot/build1/master', u'pid6561-boot1382421132', 0L, None, 1382607357L, None),) 973 2013-10-24 02:36:33-0700 [-] Nothing to unclaim 974 2013-10-24 02:36:33-0700 [-] Failed to claim everything (got 0, expected 1) 975 2013-10-24 02:36:33-0700 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid4969-boot1382388326 976 2013-10-24 02:36:33-0700 [-] Old: 1382603781.88 977 2013-10-24 02:36:33-0700 [-] ((31215790L, 7690206L, u'Firefox mozilla-aurora linux l10n nightly', 0L, 1382607382L, u'buildbot-master64.srv.releng.usw2.mozilla.com:/ builds/buildbot/build1/master', u'pid6561-boot1382421132', 0L, None, 1382607358L, None),) 978 2013-10-24 02:36:33-0700 [-] Nothing to unclaim 979 2013-10-24 02:36:33-0700 [-] Failed to claim everything (got 0, expected 1) 980 2013-10-24 02:36:33-0700 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid4969-boot1382388326 981 2013-10-24 02:36:33-0700 [-] Old: 1382603781.88 982 2013-10-24 02:36:33-0700 [-] ((31215792L, 7690208L, u'Firefox mozilla-aurora linux l10n nightly', 0L, 1382607382L, u'buildbot-master64.srv.releng.usw2.mozilla.com:/ builds/buildbot/build1/master', u'pid6561-boot1382421132', 0L, None, 1382607359L, None),) 983 2013-10-24 02:36:33-0700 [-] Nothing to unclaim 984 2013-10-24 02:36:33-0700 [HTTPPageGetter,client] Submitted hy-AM locale 985 2013-10-24 02:36:33-0700 [-] Failed to claim everything (got 0, expected 1) 986 2013-10-24 02:36:33-0700 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid4969-boot1382388326 987 2013-10-24 02:36:33-0700 [-] Old: 1382603781.88 988 2013-10-24 02:36:33-0700 [-] ((31215791L, 7690207L, u'Firefox mozilla-aurora linux l10n nightly', 0L, 1382607377L, u'buildbot-master63.srv.releng.use1.mozilla.com:/ builds/buildbot/build1/master', u'pid22290-boot1382410856', 0L, None, 1382607359L, None),) 989 2013-10-24 02:36:33-0700 [-] Nothing to unclaim 990 2013-10-24 02:36:33-0700 [-] Failed to claim everything (got 0, expected 1) 991 2013-10-24 02:36:33-0700 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid4969-boot1382388326 992 2013-10-24 02:36:33-0700 [-] Old: 1382603781.88 993 2013-10-24 02:36:33-0700 [-] ((31215793L, 7690209L, u'Firefox mozilla-aurora linux l10n nightly', 0L, 1382607377L, u'buildbot-master63.srv.releng.use1.mozilla.com:/ builds/buildbot/build1/master', u'pid22290-boot1382410856', 0L, None, 1382607360L, None),) These could be a result of claim resetting, or they may be relevant. What's _really_ interesting is that we don't have any BuildSlave.detached or lost remote messages. I'm not entirely sure what that means.
Reporter | ||
Comment 20•11 years ago
|
||
We're no closer to fixing this, but here's data on the buildrequests that failed to start so far this month.
Reporter | ||
Comment 21•11 years ago
|
||
I'm nearly certain this is related to AWS link issues now. We've had a huge spike of them in the past week, all in usw2, and this coincidences with a huge increase in scl3 <-> usw2 link issues. Catlee, you had some theory on which part of the code might be at fault IIRC.
Flags: needinfo?(catlee)
Reporter | ||
Comment 22•11 years ago
|
||
We had a HUGE rash of these yesterday - at least 50. Some examples: id buildsetid buildername priority claimed_at claimed_by_name claimed_by_incarnation complete results submitted_at complete_at id number brid start_time finish_time 33462693 8105723 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637618 NULL NULL NULL NULL NULL NULL 33462695 8105725 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637618 NULL NULL NULL NULL NULL NULL 33462686 8105722 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637617 NULL NULL NULL NULL NULL NULL 33462685 8105721 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637616 NULL NULL NULL NULL NULL NULL 33462682 8105717 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637615 NULL NULL NULL NULL NULL NULL 33462683 8105718 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637615 NULL NULL NULL NULL NULL NULL 33462680 8105715 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637612 NULL NULL NULL NULL NULL NULL 33462676 8105712 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637611 NULL NULL NULL NULL NULL NULL 33462677 8105713 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637611 NULL NULL NULL NULL NULL NULL 33462675 8105711 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637610 NULL NULL NULL NULL NULL NULL 33462673 8105709 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637609 NULL NULL NULL NULL NULL NULL 33462668 8105708 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637608 NULL NULL NULL NULL NULL NULL 33462661 8105706 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637607 NULL NULL NULL NULL NULL NULL 33462655 8105703 Ubuntu VM 12.04 try opt test mochitest-4 0 1386639106 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637606 NULL NULL NULL NULL NULL NULL id buildsetid buildername priority claimed_at claimed_by_name claimed_by_incarnation complete results submitted_at complete_at id number brid start_time finish_time 33463134 8105887 Ubuntu VM 12.04 x64 try opt test mochitest-4 0 1386638567 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637666 NULL NULL NULL NULL NULL NULL 33463135 8105886 Ubuntu VM 12.04 x64 try opt test mochitest-4 0 1386638567 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637666 NULL NULL NULL NULL NULL NULL 33463131 8105883 Ubuntu VM 12.04 x64 try opt test mochitest-4 0 1386638567 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637664 NULL NULL NULL NULL NULL NULL 33463124 8105874 Ubuntu VM 12.04 x64 try opt test mochitest-4 0 1386638567 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637660 NULL NULL NULL NULL NULL NULL 33463107 8105871 Ubuntu VM 12.04 x64 try opt test mochitest-4 0 1386638567 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637659 NULL NULL NULL NULL NULL NULL 33462985 8105824 Ubuntu VM 12.04 x64 try opt test mochitest-4 0 1386638567 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637645 NULL NULL NULL NULL NULL NULL 33462893 8105807 Ubuntu VM 12.04 x64 try opt test mochitest-4 0 1386638567 buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid21710-boot1383692372 0 NULL 1386637640 NULL NULL NULL NULL NULL NULL 33466765 8107325 WINNT 6.2 fx-team pgo talos other 0 1386642148 buildbot-master69.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master pid12119-boot1382990157 0 NULL 1386642031 NULL NULL NULL NULL NULL NULL 33465142 8106807 Ubuntu VM 12.04 x64 try opt test mochitest-5 0 1386642026 buildbot-master68.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master pid4987-boot1383067971 0 NULL 1386639092 NULL NULL NULL NULL NULL NULL 33465141 8106806 Ubuntu VM 12.04 x64 try opt test mochitest-5 0 1386642026 buildbot-master68.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master pid4987-boot1383067971 0 NULL 1386639091 NULL NULL NULL NULL NULL NULL 33465123 8106802 Ubuntu VM 12.04 x64 try opt test mochitest-5 0 1386642026 buildbot-master68.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master pid4987-boot1383067971 0 NULL 1386639090 NULL NULL NULL NULL NULL NULL 33465105 8106795 Ubuntu VM 12.04 x64 try opt test mochitest-5 0 1386642026 buildbot-master68.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master pid4987-boot1383067971 0 NULL 1386639088 NULL NULL NULL NULL NULL NULL 33465085 8106791 Ubuntu VM 12.04 x64 try opt test mochitest-5 0 1386642026 buildbot-master68.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master pid4987-boot1383067971 0 NULL 1386639087 NULL NULL NULL NULL NULL NULL 33465038 8106787 Ubuntu VM 12.04 x64 try opt test mochitest-5 0 1386642026 buildbot-master68.srv.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux/master pid4987-boot1383067971 0 NULL 1386639085 NULL NULL NULL NULL NULL NULL It's a little curious that many (but not all) came in chunks on the same master+builder.
Comment 23•11 years ago
|
||
My recollection is that the code that marks buildrequests as complete can fail and not retry. I remember some instances where the build master knew the build was done, statusdb had been updated, but schedulerdb thought the build was still running. http://hg.mozilla.org/build/buildbot/file/533b0ce5c5d8/master/buildbot/db/connector.py#l930
Flags: needinfo?(catlee)
Reporter | ||
Comment 24•11 years ago
|
||
The three most recent of these that we saw were on bm62's "Firefox mozilla-aurora linux64 l10n nightly" builder. The buildrequest IDs of 33985548, 33985550, and 33985552 got stuck and needed fixing. I dug through twistd.log and found an odd correlation: all of the buildrequestids before the stuck ones had a "Failed to claim" messages near them (I think the nextAWSSlave message are irrelevant here, but I included them anyways just in case...): 2013-12-19 03:27:04-0800 [-] nextAWSSlave: 0 retries for Firefox mozilla-aurora linux64 l10n nightly 2013-12-19 03:27:04-0800 [-] nextAWSSlave: Choosing ondemand since there aren't any spot available 2013-12-19 03:27:32-0800 [-] nextAWSSlave: 0 retries for Firefox mozilla-aurora linux64 l10n nightly 2013-12-19 03:27:32-0800 [-] nextAWSSlave: Choosing ondemand since there aren't any spot available 2013-12-19 03:27:32-0800 [-] Failed to claim everything (got 0, expected 1) 2013-12-19 03:27:32-0800 [-] Name: buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid18369-boot1383074470 2013-12-19 03:27:32-0800 [-] Old: 1387448768.68 2013-12-19 03:27:32-0800 [-] ((33985547L, 8206217L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452171L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451608L, None),) 2013-12-19 03:27:32-0800 [-] Nothing to unclaim 2013-12-19 03:28:00-0800 [-] nextAWSSlave: 1 retries for Firefox mozilla-aurora linux64 l10n nightly 2013-12-19 03:28:00-0800 [-] nextAWSSlave: Choosing ondemand because of retries 2013-12-19 03:28:00-0800 [-] Failed to claim everything (got 0, expected 1) 2013-12-19 03:28:00-0800 [-] Name: buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid18369-boot1383074470 2013-12-19 03:28:00-0800 [-] Old: 1387448768.68 2013-12-19 03:28:01-0800 [-] ((33985549L, 8206219L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452439L, u'buildbot-master58.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master', u'pid6068-boot1382992879', 0L, None, 1387451609L, None),) 2013-12-19 03:28:01-0800 [-] Nothing to unclaim 2013-12-19 03:28:30-0800 [-] nextAWSSlave: 0 retries for Firefox mozilla-aurora linux64 l10n nightly 2013-12-19 03:28:30-0800 [-] nextAWSSlave: Choosing ondemand since there aren't any spot available 2013-12-19 03:29:02-0800 [-] nextAWSSlave: 1 retries for Firefox mozilla-aurora linux64 l10n nightly 2013-12-19 03:29:02-0800 [-] nextAWSSlave: Choosing ondemand because of retries 2013-12-19 03:29:03-0800 [-] Failed to claim everything (got 0, expected 1) 2013-12-19 03:29:03-0800 [-] Name: buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid18369-boot1383074470 2013-12-19 03:29:03-0800 [-] Old: 1387448768.68 2013-12-19 03:29:03-0800 [-] ((33985551L, 8206221L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452533L, u'buildbot-master85.srv.releng.scl3.mozilla.com:/builds/buildbot/build1/master', u'pid1481-boot1383690541', 0L, None, 1387451611L, None),) 2013-12-19 03:29:03-0800 [-] Nothing to unclaim 2013-12-19 03:29:34-0800 [-] nextAWSSlave: 0 retries for Firefox mozilla-aurora linux64 l10n nightly 2013-12-19 03:29:34-0800 [-] nextAWSSlave: Choosing ondemand since there aren't any spot available The fact that we don't have any log messages about the stuck brids implies that the claim process was completely successful - because there's nothing printed in _txn_claim_buildrequests for that case. Still investigating more...
Reporter | ||
Comment 25•11 years ago
|
||
More interesting stuff from logs. I found the first brid from before the hung ones (33985547) failed to be claimed on multiple different masters as multiple different times: bm57: 10040 2013-12-19 03:27:06-0800 [-] Failed to claim everything (got 0, expected 1) 10041 2013-12-19 03:27:06-0800 [-] Name: buildbot-master57.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid1978-boot1382976516 10042 2013-12-19 03:27:06-0800 [-] Old: 1387448655.51 10043 2013-12-19 03:27:06-0800 [-] ((33985547L, 8206217L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452171L, u'buildbot-master61.srv.releng.use1.mozilla.com :/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451608L, None),) 10044 2013-12-19 03:27:06-0800 [-] Nothing to unclaim bm63: 17947 2013-12-19 03:27:14-0800 [-] Failed to claim everything (got 0, expected 1) 17948 2013-12-19 03:27:14-0800 [-] Name: buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958 17949 2013-12-19 03:27:14-0800 [-] Old: 1387448741.56 17950 2013-12-19 03:27:14-0800 [-] ((33985547L, 8206217L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452171L, u'buildbot-master61.srv.releng.use1.mozilla.com :/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451608L, None),) 17951 2013-12-19 03:27:14-0800 [-] Nothing to unclaim bm62: 2013-12-19 03:27:32-0800 [-] Failed to claim everything (got 0, expected 1) 2013-12-19 03:27:32-0800 [-] Name: buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid18369-boot1383074470 2013-12-19 03:27:32-0800 [-] Old: 1387448768.68 2013-12-19 03:27:32-0800 [-] ((33985547L, 8206217L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452171L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451608L, None),) 2013-12-19 03:27:32-0800 [-] Nothing to unclaim And then it was failed to be reclaimed on bm61 (which implies that it was claimed earlier): bm61: 14308 2013-12-19 03:32:47-0800 [-] Failed to claim everything (got 7, expected 11) 14309 2013-12-19 03:32:47-0800 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid13849-boot1383062515 14310 2013-12-19 03:32:47-0800 [-] Old: 1387449167.39 14311 2013-12-19 03:32:47-0800 [-] ((33985532L, 8206201L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452733L, u'buildbot-master85.srv.releng.scl3.mozilla.com :/builds/buildbot/build1/master', u'pid1481-boot1383690541', 0L, None, 1387451596L, None), (33985535L, 8206204L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452767L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451599L, None), (33985539L, 8206209L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452730L, u'buildbot-master84.srv.releng.scl3.mozilla.com:/builds/buildbot/build1/master', u'pid14 77-boot1383690531', 0L, None, 1387451601L, None), (33985540L, 8206210L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452767L, u'buildbot-master61.srv.rel eng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451602L, None), (33985541L, 8206211L, u'Firefox mozilla-aurora linux 64 l10n nightly', 0L, 1387452767L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451603 L, None), (33985542L, 8206212L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452767L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/bu ild1/master', u'pid13849-boot1383062515', 0L, None, 1387451604L, None), (33985543L, 8206213L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452766L, u'bui ldbot-master57.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid1978-boot1382976516', 0L, None, 1387451604L, None), (33985544L, 8206214L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452170L, u'buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid14378-boot1383085958', 0L, None, 1387451605L, None), (33985545L, 8206215L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452767L, u'buildbot-master61.srv.releng.use1.mozilla.co m:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451606L, None), (33985546L, 8206216L, u'Firefox mozilla-aurora linux64 l10n nightly', 0 L, 1387452767L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid13849-boot1383062515', 0L, None, 1387451607L, None), (33985547 L, 8206217L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452767L, u'buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid 13849-boot1383062515', 0L, None, 1387451608L, None)) 14312 2013-12-19 03:32:47-0800 [-] Failed to re-claim buildrequests for builder Firefox mozilla-aurora linux64 l10n nightly I also found that the first hung brid (33985548) was failed to be claimed by other masters: bm57: /builds/buildbot/build1/master/twistd.log.2-10082-2013-12-19 03:27:38-0800 [-] Failed to claim everything (got 0, expected 1) /builds/buildbot/build1/master/twistd.log.2-10083-2013-12-19 03:27:38-0800 [-] Name: buildbot-master57.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid1978-boot1382976516 /builds/buildbot/build1/master/twistd.log.2-10084-2013-12-19 03:27:38-0800 [-] Old: 1387448655.51 /builds/buildbot/build1/master/twistd.log.2:10085:2013-12-19 03:27:38-0800 [-] ((33985548L, 8206218L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452369L, u'buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid18369-boot1383074470', 0L, None, 1387451608L, None),) /builds/buildbot/build1/master/twistd.log.2-10086-2013-12-19 03:27:38-0800 [-] Nothing to unclaim bm63: builds/buildbot/build1/master/twistd.log.2-18196-2013-12-19 03:27:50-0800 [-] Failed to claim everything (got 0, expected 1) /builds/buildbot/build1/master/twistd.log.2-18197-2013-12-19 03:27:50-0800 [-] Name: buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958 /builds/buildbot/build1/master/twistd.log.2-18198-2013-12-19 03:27:50-0800 [-] Old: 1387448741.56 /builds/buildbot/build1/master/twistd.log.2:18199:2013-12-19 03:27:50-0800 [-] ((33985548L, 8206218L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452369L, u'buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master', u'pid18369-boot1383074470', 0L, None, 1387451608L, None),) /builds/buildbot/build1/master/twistd.log.2-18200-2013-12-19 03:27:50-0800 [-] Nothing to unclaim bm65: /builds/buildbot/build1/master/twistd.log.2-27024-2013-12-19 03:30:44-0800 [-] Failed to claim everything (got 0, expected 1) /builds/buildbot/build1/master/twistd.log.2-27025-2013-12-19 03:30:44-0800 [-] Name: buildbot-master65.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master pid12543-boot1383100401 /builds/buildbot/build1/master/twistd.log.2-27026-2013-12-19 03:30:44-0800 [-] Old: 1387449024.99 /builds/buildbot/build1/master/twistd.log.2:27027:2013-12-19 03:30:44-0800 [-] ((33985548L, 8206218L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452625L, u'buildbot-master58.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master', u'pid6068-boot1382992879', 0L, None, 1387451608L, None),) /builds/buildbot/build1/master/twistd.log.2-27028-2013-12-19 03:30:44-0800 [-] Nothing to unclaim bm62: 2013-12-19 03:31:05-0800 [-] Error choosing next slave for builder 'Firefox mozilla-aurora linux64 l10n nightly', choosing randomly instead 2013-12-19 03:31:05-0800 [-] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/build1/lib/python2.7/site-packages/twisted/python/context.py", line 37, in callWithContext return func(*args,**kw) File "/builds/buildbot/build1/lib/python2.7/site-packages/twisted/enterprise/adbapi.py", line 429, in _runInteraction result = interaction(trans, *args, **kw) File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/process/builder.py", line 517, in _claim_buildreqs sb = self._choose_slave(available_slaves) File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/process/builder.py", line 547, in _choose_slave return self.nextSlave(self, available_slaves) --- <exception caught here> --- File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbotcustom/misc.py", line 267, in _nextSlave if build.slavename == slavename: File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbotcustom/misc.py", line 383, in _nextSlave retval = d1.copy() File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbotcustom/misc.py", line 327, in _classifyAWSSlaves log.err() exceptions.AttributeError: 'NoneType' object has no attribute 'slavename' 2013-12-19 03:31:05-0800 [-] Failed to claim everything (got 0, expected 1) 2013-12-19 03:31:05-0800 [-] Name: buildbot-master62.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid18369-boot1383074470 2013-12-19 03:31:05-0800 [-] Old: 1387449038.92 2013-12-19 03:31:05-0800 [-] ((33985548L, 8206218L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452625L, u'buildbot-master58.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master', u'pid6068-boot1382992879', 0L, None, 1387451608L, None),) 2013-12-19 03:31:05-0800 [-] Nothing to unclaim bm61: /builds/buildbot/build1/master/twistd.log.2-13042-2013-12-19 03:31:07-0800 [-] Failed to claim everything (got 0, expected 1) /builds/buildbot/build1/master/twistd.log.2-13043-2013-12-19 03:31:07-0800 [-] Name: buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid13849-boot1383062515 /builds/buildbot/build1/master/twistd.log.2-13044-2013-12-19 03:31:07-0800 [-] Old: 1387449040.63 /builds/buildbot/build1/master/twistd.log.2:13045:2013-12-19 03:31:07-0800 [-] ((33985548L, 8206218L, u'Firefox mozilla-aurora linux64 l10n nightly', 0L, 1387452625L, u'buildbot-master58.srv.releng.usw2.mozilla.com:/builds/buildbot/build1/master', u'pid6068-boot1382992879', 0L, None, 1387451608L, None),) /builds/buildbot/build1/master/twistd.log.2-13046-2013-12-19 03:31:07-0800 [-] Nothing to unclaim I'm not really sure why there's so many failed claim attempts. The traceback right before bm62 fails to claim that brid is also weird - but there isn't a similar one for the other failed brids. We've been seeing this lots on l10n builders, which are a little special because of how many non-mergeable buildrequests we get in such a short period of time. They're also done by a pool of 12 masters. It makes me think that one or both of # of buildrequests pending or # of masters attempting to make a claim has an influence on this.
Reporter | ||
Comment 26•11 years ago
|
||
I'm hoping bug 952159 will help make some sense of all the failed claims. In the meantime, I'm going to work on the assumption that the claims aren't the issue, and try to figure out how we can claim something but not start a build for it...
Reporter | ||
Comment 27•11 years ago
|
||
With Dustin's help, I stepped through the deferred that Builder.run returns. That method gets called by BotMaster's self.loop, which is a Loop instance. Eventually, the deferred that Builder.run returns ends up here: https://mxr.mozilla.org/build-central/source/buildbot/master/buildbot/util/loop.py#161 -- which adds an errback that calls log.err. BGiven that, it seems very unlikely that we're losing an exception anywhere in that code. AFAICT, _claim_buildreqs and _start_builds both complete without raising. That mean that there's a logic error somewhere in that code, or it could be a hint that the problem is elsewhere, perhaps some network issue related race/error.
Reporter | ||
Comment 28•11 years ago
|
||
I realized that even though we're not getting a Traceback, we never see the https://mxr.mozilla.org/build-central/source/buildbot/master/buildbot/process/base.py#188 log message...so something in the Builder.run deferred chain is breaking down. I'm hoping this patch will help us determine where exactly that's happening.
Attachment #8350181 -
Flags: review?(catlee)
Updated•11 years ago
|
Attachment #8350181 -
Flags: review?(catlee) → review+
Reporter | ||
Comment 29•11 years ago
|
||
Comment on attachment 8350181 [details] [diff] [review] add more debugging messages I'm doing a graceful restart on bm62 to get this patch running.
Attachment #8350181 -
Flags: checked-in+
Reporter | ||
Comment 30•11 years ago
|
||
Here's the db log for brid 33985548. There's lots of attempted claims before and after the fixer upper script runs. The claimed_at times are local to buildbot masters, so I suspect the main reason there's so many claims at various times is clock skew. The most notable thing here is that we have _two_ completion events. I tracked these down to http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/Firefox%20mozilla-aurora%20linux64%20l10n%20nightly/builds/2200 and http://buildbot-master62.srv.releng.use1.mozilla.com:8001/builders/Firefox%20mozilla-aurora%20linux64%20l10n%20nightly/builds/3380 (which are https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013-12-19-00-40-03-mozilla-aurora-l10n/mozilla-aurora-linux64-l10n-nightly-eu-bm58-build1-build2200.txt.gz and https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013-12-19-00-40-03-mozilla-aurora-l10n/mozilla-aurora-linux64-l10n-nightly-eu-bm62-build1-build3380.txt.gz). This doesn't seem to happen for all claims that get reset, but enough where we're wasting a lot of resources. I looked at 20 buildrequests and 19 of them had two builds. Some had two builds on the _same_ master. What's fascinating about this is that both builds start _after_ the claim was reset. So I guess what happens is something like: * master A claims the buildrequest but doesn't start the build (reason TBD) * buildrequest fixer upper comes along and resets the claim * master A and B both somehow decide they can claim the newly unclaimed buildrequest, both run builds to completion The fact that some of these buildrequests have two builds on the same master makes me think that something in https://mxr.mozilla.org/build-central/source/buildbot/master/buildbot/process/builder.py#567 is taking a VERY long time to complete -- otherwise I'm not sure how we could get two builds for the same buildrequest. This may hint towards some bug in slave connect/disconnect, dunno. This says to me that we need to fix this more urgently - every time we hit one of these we run a job more than once, which wastes a lot of precious machine time. Also, if this happens for a release build it can really screw things up.
Reporter | ||
Comment 31•11 years ago
|
||
HM! Something I just realized is that build requests are claimed in a loop, but no Builds are created until after that entire loop finishes: https://mxr.mozilla.org/build-central/source/buildbot/master/buildbot/process/builder.py#516 So, if we claim request A in the first iteration of the loop, but we still have another 5 (or 10 or 50) requests to claim, it could take some time between the claim happening and the Build existing. This might explain why we see this so much with l10n -- where we have 80+ unmergable buildrequests coming at the exact same time. nextSlave and nextBuild both run at the start of this loop, so if they take too long to run, we'll end up resetting the claim on the buildrequest the next time fix-nonstarted-buildrequests.sh runs. (Too long is currently 90 seconds, but that's fudged a little because of clock skew between bm81 (where the script runs) and other masters.
Reporter | ||
Comment 32•11 years ago
|
||
I bumped the buildrequest restarter to wait 300s before resetting claims - I'm hoping that gets rid of all of the cases of resetting claims on buildrequests that are simply taking awhile (as opposed to completely dead).
Reporter | ||
Comment 33•11 years ago
|
||
One last thing for today: It could be the case that all the "failures" we're seeing now are just cases of long delays between claiming and starting builds (which is why I bumped the delay time). If that's the case, it's likely that the original patch here fixed the original problem. So far, I haven't been able to find any buildrequests that were reset by the script after that patch landed that didn't result in 2 builds -- which supports that theory. We should have a better idea in the new year.
Reporter | ||
Comment 34•11 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #32) > I bumped the buildrequest restarter to wait 300s before resetting claims - > I'm hoping that gets rid of all of the cases of resetting claims on > buildrequests that are simply taking awhile (as opposed to completely dead). This wasn't enough -- we got a bunch of duplicate l10n builds again this morning. There were 14 buildrequests whose claims were reset (11 at 03:40 PT, 1 at 03:45 PT, 3 at 03:50 PT). AFAICT, all of theme were on mozilla-aurora linux64 repacks. Curiously, I can only find 11 builds with more than one log -- which means that some of the "hung" buildrequests were truly hung, and some weren't. I'm not sure how to figure out which is which, but in e-mail we have: id buildsetid buildername priority claimed_at claimed_by_name claimed_by_incarnation complete results submitted_at complete_at id number brid start_time finish_time 34261127 8276317 Firefox mozilla-aurora linux64 l10n nightly 0 1388662364 buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958 0 NULL 1388662328 NULL NULL NULL NULL NULL NULL 34261126 8276316 Firefox mozilla-aurora linux64 l10n nightly 0 1388662364 buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958 0 NULL 1388662327 NULL NULL NULL NULL NULL NULL 34261125 8276315 Firefox mozilla-aurora linux64 l10n nightly 0 1388662364 buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958 0 NULL 1388662326 NULL NULL NULL NULL NULL NULL 34261124 8276314 Firefox mozilla-aurora linux64 l10n nightly 0 1388662364 buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958 0 NULL 1388662325 NULL NULL NULL NULL NULL NULL 34261122 8276312 Firefox mozilla-aurora linux64 l10n nightly 0 1388662364 buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958 0 NULL 1388662324 NULL NULL NULL NULL NULL NULL 34261123 8276313 Firefox mozilla-aurora linux64 l10n nightly 0 1388662364 buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958 0 NULL 1388662324 NULL NULL NULL NULL NULL NULL 34261121 8276311 Firefox mozilla-aurora linux64 l10n nightly 0 1388662364 buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958 0 NULL 1388662323 NULL NULL NULL NULL NULL NULL 34261119 8276309 Firefox mozilla-aurora linux64 l10n nightly 0 1388662364 buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958 0 NULL 1388662321 NULL NULL NULL NULL NULL NULL 34261117 8276307 Firefox mozilla-aurora linux64 l10n nightly 0 1388662364 buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958 0 NULL 1388662320 NULL NULL NULL NULL NULL NULL 34261118 8276308 Firefox mozilla-aurora linux64 l10n nightly 0 1388662371 buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid13849-boot1383062515 0 NULL 1388662320 NULL NULL NULL NULL NULL NULL 34261116 8276306 Firefox mozilla-aurora linux64 l10n nightly 0 1388662364 buildbot-master63.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid14378-boot1383085958 0 NULL 1388662319 NULL NULL NULL NULL NULL NULL 34261133 8276323 Firefox mozilla-aurora linux64 l10n nightly 0 1388662399 buildbot-master57.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid1978-boot1382976516 0 NULL 1388662332 NULL NULL NULL NULL NULL NULL 34261137 8276327 Firefox mozilla-aurora linux64 l10n nightly 0 1388662991 buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid13849-boot1383062515 0 NULL 1388662336 NULL NULL NULL NULL NULL NULL 34261138 8276328 Firefox mozilla-aurora linux64 l10n nightly 0 1388662991 buildbot-master61.srv.releng.use1.mozilla.com:/builds/buildbot/build1/master pid13849-boot1383062515 0 NULL 1388662336 NULL NULL NULL NULL NULL NULL So, once again we're clustered onto masters in a single colo.
Reporter | ||
Comment 35•11 years ago
|
||
I made a small tweak to the cronjobs. The one that e-mails us will alert on any claimed buildrequest that has gone 300s without a build, and the one that actually resets the claim will do so after 600s. I'm hoping that with this we'll be able to hit a case of this happening and disable the reset cronjob so that we can debug it live.
Reporter | ||
Comment 36•11 years ago
|
||
The blurt script just threw this out: id buildsetid buildername priority claimed_at claimed_by_name claimed_by_incarnation complete results submitted_at complete_at id number brid start_time finish_time 34318105 8288058 b2g_ubuntu32_vm b2g-inbound opt test gaia-integration 0 1388774128 buildbot-master67.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master pid24295-boot1386962733 0 +NULL 1388773931 NULL NULL NULL NULL NULL NULL I disabled the fixing script but by the time I started looking at the database, a build had started (http://buildbot-master67.srv.releng.use1.mozilla.com:8201/builders/b2g_ubuntu32_vm%20b2g-inbound%20opt%20test%20gaia-integration/builds/28) In twistd.log, I found this: 40342 2014-01-03 10:32:40-0800 [-] starting build <Build b2g_ubuntu32_vm b2g-inbound opt test gaia-integration>.. pinging the slave <SlaveBuilder builder='b2g_ubuntu32_vm b2g-inbound opt test gaia-integration' slave='tst-linux32-spot-117'> 40343 2014-01-03 10:32:40-0800 [-] sending ping And then nothing about that builder until....: 43230 2014-01-03 10:48:10-0800 [Broker,37045,10.134.57.10] ping finished: failure 43231 2014-01-03 10:48:10-0800 [Broker,37045,10.134.57.10] I tried to tell the slave that the build <Build b2g_ubuntu32_vm b2g-inbound opt test gaia-integration> started, but remote_startBuild failed: slave ping failed 43232 2014-01-03 10:48:10-0800 [Broker,37045,10.134.57.10] re-queueing the BuildRequest 43233 2014-01-03 10:48:10-0800 [Broker,37045,10.134.57.10] BuildSlave.detached(tst-linux32-spot-117) And then the requeue goes on to succeeding. So in this particular case, it takes about 15 minutes to fail to ping a slave, during which time the requeuing script that I wrote has already reset the claim -- causing us to get two builds in many cases. I've adjusted the blurt script to wait 20min and the resetting script to wait 22min now -- this should get rid of any false positives from this case. I also filed bug 956378 to reduce the ping failure time.
Reporter | ||
Comment 37•11 years ago
|
||
We haven't seen any more of these since I adjusted the time on the cronjob - so that's a good sign. It's possible that attachment 819077 [details] [diff] [review] really did fix the problem here, and that we've been retriggering jobs due to slow slave pings ever since. If we don't see any instances of this this week I'm going to close this bug.
Reporter | ||
Comment 38•10 years ago
|
||
We still have jobs that take awhile to start due to bug 956738, but I don't think we any that fail to start forever since the buildbot patch here landed. I'm pretty confident that the ones we've seen since that have all been false positives as described in comment #36. I've disabled the cronjobs on bm81 now, but I'll be attaching them here for posterity.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 10 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 39•10 years ago
|
||
Reporter | ||
Comment 40•10 years ago
|
||
Reporter | ||
Updated•10 years ago
|
Attachment #8359204 -
Attachment mime type: application/x-shellscript → text/plain
Comment 41•10 years ago
|
||
ben, I know this was a long time ago, but any recollection as to why its not still running? or if it is running where it is? (I was looking for this today while helping to debug some DB issues)
Flags: needinfo?(bhearsum)
Reporter | ||
Comment 42•10 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #41) > ben, I know this was a long time ago, but any recollection as to why its not > still running? or if it is running where it is? > > (I was looking for this today while helping to debug some DB issues) See comment #38
Flags: needinfo?(bhearsum)
Assignee | ||
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•