-
Notifications
You must be signed in to change notification settings - Fork 1.1k
channels: Avoid deadlocks when locking ao2_container *channels #311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 18
Are you sure you want to change the base?
Conversation
REMINDER: If this PR applies to other branches, please add a comment with the appropriate "cherry-pick-to" headers as per the Create a Pull Request process. If you don't want it cherry-picked, please add a comment with If, after adding "cherry-pick-to" comments, you change your mind, please edit the comment to DELETE the header lines and add The currently active branches are now 18, 20, 21 and master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the cherry pick lines from the commit message and make a comment with those instead.
main/channel.c
Outdated
@@ -712,12 +718,23 @@ static int does_id_conflict(const char *uniqueid) | |||
{ | |||
struct ast_channel *conflict; | |||
size_t length = 0; | |||
struct ast_channel_callback_data callback_data = {.search = uniqueid, .final_match_result = 0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put each element on its own line? Not sure if that's explicitly in the coding guidelines but that's easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
main/channel.c
Outdated
|
||
if (ast_strlen_zero(uniqueid)) { | ||
return 0; | ||
} | ||
|
||
conflict = ast_channel_callback(ast_channel_by_uniqueid_cb, (char *) uniqueid, &length, OBJ_NOLOCK); | ||
try_again: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think goto is necessary here, this could be a do while or for (;;) with break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
main/channel.c
Outdated
|
||
if (callback_data.final_match_result == CMP_RETRY_NEEDED) { | ||
/* We ran into a possible deadlock, try again */ | ||
ast_debug(3, "ast_channel_iterator_by_exten_new() Avoid deadlock trying to find channel by extension@context: %s@%s\n", exten, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ast_debug includes the function name, no need to repeat it.
This could all be simplified to:
`ast_debug(3, "Avoided deadlock trying to find channel by %@%s\n", exten, context);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
main/channel.c
Outdated
|
||
if (ast_strlen_zero(l_name)) { | ||
/* We didn't have a name to search for so quit. */ | ||
return NULL; | ||
} | ||
|
||
chan = ast_channel_callback(ast_channel_by_name_cb, l_name, &name_len, | ||
try_name_again: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, you can use 2 loops here and goto isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are goto's that bad? It reduces nesting. The only thing that adding a loop does is introduce an extra level of nesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're not inherently bad, this just doesn't seem like a necessary or appropriate use for them, since a loop works perfectly fine. I don't think there's a problem with adding a level of nesting, it contains the logic in its own scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
main/channel.c
Outdated
|
||
if (callback_data.final_match_result == CMP_RETRY_NEEDED) { | ||
/* We ran into a possible deadlock, try again */ | ||
ast_debug(3, "ast_channel_get_by_name_prefix() Avoid deadlock trying to find channel by uniqueid: %s\n", name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about log format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
8fa1b68
to
e200e31
Compare
|
PRs should be raised against master, not 18 (unlike Gerrit, it makes a difference now). |
master is slightly different. It will not cleanly cherry-pick to branches from master. Is the process still go to to master first? Make separate PRs for each branch? |
Ah, okay, then it's more complicated, then you might need to do multiple sets since GitHub can only cherry pick verbatim, might want to double check with George. |
e200e31
to
52184d3
Compare
master no longer has support for macros, so the checks in ast_channel_by_exten_cb() will not cherry-pick clean with the change as-is. |
main/channel.c
Outdated
/* We found a match or the search fully completed through all channels */ | ||
break; | ||
} | ||
while (1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while should go on previous line
Though really I think for(;;)
is preferred in Asterisk for infinite loops.
main/channel.c
Outdated
/* We found a match or the search fully completed through all channels */ | ||
break; | ||
} | ||
while (1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
main/channel.c
Outdated
/* We found a match or the search fully completed through all channels */ | ||
break; | ||
} | ||
while (1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Update: Been running this in the wild with mixed results. Some days it's been completely solid with no deadlocks (better than before with 3-5 deadlocks per day or more). Some days it actually almost immediately deadlocked after handling just a few calls. |
Due to the last comment I am transitioning this to a draft as it does not appear to be in a state for review as it does not resolve the underlying issue. If this is untrue and it should actually be reviewed, then it can be published. |
52184d3
to
520b377
Compare
Fixed previous issue where it did not fully fix the problem. Small refactor to avoid breaking assumptions about value of 'arg' with respect to find channel callbacks. It's now passing my tests regarding the original deadlock situation. I would nominate this ready for review now. |
274087e
to
59b0d33
Compare
59b0d33
to
a000ddc
Compare
One situation where deadlocks occur is when: -> res_pjsip_sdp_rtp is checking for rtp timeouts -> Which locks ao2_container channels, and then waits for a lock on a channel that's currently running IMPORT At the same time -> Dialplan asks for IMPORT(${some_channel},...) -> Which grabs a channel lock and then waits on ao2_container channels So in this situation, res_pjsip_sdp_rtp is waiting on a channel lock that will never release, because res_pjsip_sdp_rtp is currently holding ao2_container channels, that the IMPORT needs after having already locked the channel that res_pjsip_sdp_rtp is processing next. UserNote: Module developers should discontinue use of ast_channel_callback() and use ast_channel_callback_safe() instead. (This will avoid causing deadlocks) Resolves: asterisk#307
a000ddc
to
3579cc5
Compare
Pushed a new one.. fixes channel reference leaks (introduced in the previous patch) when OBJ_MULTIPLE is used to return an iterator |
While trying to extract out components from our full platform into a standalone deadlock reproduction test, I've uncovered an additional deadlock related to ast_sip_push_task_wait() also related to IMPORT import_helper() (posted to #307). I realized this is not the same deadlock that this #311 pull is addressing. So I'll work on another test to trigger this specific one as well. This is related, but probably should be split into a new issue since this other newly uncovered deadlock is not related to ao2_container *channels. |
I'm converting this back to a draft again since there are still outstanding changes requested and the cherry-pick lines still aren't correct. Speaking of which, are you saying this PR will cherry-pick cleanly from 18 to master but not the other way around? |
Correct, I don't think it goes the other way, but I can do some tests. |
@intellasoft Is this PR still relevant since the channel storage backends work was merged? |
That's a really good question. This can probably be put on hold for now... I'll need to validate if this still happens with the new backend. |
One situation where deadlocks occur is when:
-> res_pjsip_sdp_rtp is checking for rtp timeouts -> Which locks
ao2_container channels, and then waits for a lock on a channel
that's currently running IMPORT
At the same time
-> Dialplan asks for IMPORT(${some_channel},...) -> Which grabs a
channel lock and then waits on ao2_container channels
So in this situation, res_pjsip_sdp_rtp is waiting on a channel lock
that will never release, because res_pjsip_sdp_rtp is currently
holding ao2_container channels, that the IMPORT needs after having
already locked the channel that res_pjsip_sdp_rtp is processing next.
Resolves: #307
cherry-pick-to: 19
cherry-pick-to: 20
cherry-pick-to: master