Skip to content

Conversation

intellasoft
Copy link
Contributor

One situation where deadlocks occur is when:
-> res_pjsip_sdp_rtp is checking for rtp timeouts -> Which locks
ao2_container channels, and then waits for a lock on a channel
that's currently running IMPORT
At the same time
-> Dialplan asks for IMPORT(${some_channel},...) -> Which grabs a
channel lock and then waits on ao2_container channels

So in this situation, res_pjsip_sdp_rtp is waiting on a channel lock
that will never release, because res_pjsip_sdp_rtp is currently
holding ao2_container channels, that the IMPORT needs after having
already locked the channel that res_pjsip_sdp_rtp is processing next.

Resolves: #307

cherry-pick-to: 19
cherry-pick-to: 20
cherry-pick-to: master

@asterisk-org-access-app
Copy link

REMINDER: If this PR applies to other branches, please add a comment with the appropriate "cherry-pick-to" headers as per the Create a Pull Request process.

If you don't want it cherry-picked, please add a comment with cherry-pick-to: none so we don't keep asking.

If, after adding "cherry-pick-to" comments, you change your mind, please edit the comment to DELETE the header lines and add cherry-pick-to: none.

The currently active branches are now 18, 20, 21 and master.

Copy link
Contributor

@InterLinked1 InterLinked1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the cherry pick lines from the commit message and make a comment with those instead.

main/channel.c Outdated
@@ -712,12 +718,23 @@ static int does_id_conflict(const char *uniqueid)
{
struct ast_channel *conflict;
size_t length = 0;
struct ast_channel_callback_data callback_data = {.search = uniqueid, .final_match_result = 0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put each element on its own line? Not sure if that's explicitly in the coding guidelines but that's easier to read.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

main/channel.c Outdated

if (ast_strlen_zero(uniqueid)) {
return 0;
}

conflict = ast_channel_callback(ast_channel_by_uniqueid_cb, (char *) uniqueid, &length, OBJ_NOLOCK);
try_again:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think goto is necessary here, this could be a do while or for (;;) with break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

main/channel.c Outdated

if (callback_data.final_match_result == CMP_RETRY_NEEDED) {
/* We ran into a possible deadlock, try again */
ast_debug(3, "ast_channel_iterator_by_exten_new() Avoid deadlock trying to find channel by extension@context: %s@%s\n", exten, context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ast_debug includes the function name, no need to repeat it.

This could all be simplified to:

`ast_debug(3, "Avoided deadlock trying to find channel by %@%s\n", exten, context);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

main/channel.c Outdated

if (ast_strlen_zero(l_name)) {
/* We didn't have a name to search for so quit. */
return NULL;
}

chan = ast_channel_callback(ast_channel_by_name_cb, l_name, &name_len,
try_name_again:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, you can use 2 loops here and goto isn't needed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are goto's that bad? It reduces nesting. The only thing that adding a loop does is introduce an extra level of nesting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're not inherently bad, this just doesn't seem like a necessary or appropriate use for them, since a loop works perfectly fine. I don't think there's a problem with adding a level of nesting, it contains the logic in its own scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

main/channel.c Outdated

if (callback_data.final_match_result == CMP_RETRY_NEEDED) {
/* We ran into a possible deadlock, try again */
ast_debug(3, "ast_channel_get_by_name_prefix() Avoid deadlock trying to find channel by uniqueid: %s\n", name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about log format

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@kobaz
Copy link

kobaz commented Sep 9, 2023

cherry-pick-to: 19
cherry-pick-to: 20
cherry-pick-to: master

@InterLinked1
Copy link
Contributor

cherry-pick-to: 19
cherry-pick-to: 20
cherry-pick-to: master

PRs should be raised against master, not 18 (unlike Gerrit, it makes a difference now).
And it should be cherry picked to 18, 20, and 21, not 19.

@kobaz
Copy link

kobaz commented Sep 9, 2023

master is slightly different. It will not cleanly cherry-pick to branches from master. Is the process still go to to master first? Make separate PRs for each branch?

@InterLinked1
Copy link
Contributor

InterLinked1 commented Sep 9, 2023

master is slightly different. It will not cleanly cherry-pick to branches from master. Is the process still go to to master first? Make separate PRs for each branch?

Ah, okay, then it's more complicated, then you might need to do multiple sets since GitHub can only cherry pick verbatim, might want to double check with George.

@kobaz
Copy link

kobaz commented Sep 10, 2023

master is slightly different. It will not cleanly cherry-pick to branches from master. Is the process still go to to master first? Make separate PRs for each branch?

Ah, okay, then it's more complicated, then you might need to do multiple sets since GitHub can only cherry pick verbatim, might want to double check with George.

master no longer has support for macros, so the checks in ast_channel_by_exten_cb() will not cherry-pick clean with the change as-is.

main/channel.c Outdated
/* We found a match or the search fully completed through all channels */
break;
}
while (1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while should go on previous line
Though really I think for(;;) is preferred in Asterisk for infinite loops.

main/channel.c Outdated
/* We found a match or the search fully completed through all channels */
break;
}
while (1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

main/channel.c Outdated
/* We found a match or the search fully completed through all channels */
break;
}
while (1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@kobaz
Copy link

kobaz commented Sep 21, 2023

Update: Been running this in the wild with mixed results.

Some days it's been completely solid with no deadlocks (better than before with 3-5 deadlocks per day or more).

Some days it actually almost immediately deadlocked after handling just a few calls.

@jcolp jcolp marked this pull request as draft September 29, 2023 13:43
@jcolp
Copy link
Member

jcolp commented Sep 29, 2023

Due to the last comment I am transitioning this to a draft as it does not appear to be in a state for review as it does not resolve the underlying issue. If this is untrue and it should actually be reviewed, then it can be published.

@kobaz
Copy link

kobaz commented Sep 30, 2023

Fixed previous issue where it did not fully fix the problem. Small refactor to avoid breaking assumptions about value of 'arg' with respect to find channel callbacks.

It's now passing my tests regarding the original deadlock situation.

I would nominate this ready for review now.

One situation where deadlocks occur is when:
   -> res_pjsip_sdp_rtp is checking for rtp timeouts -> Which locks
      ao2_container channels, and then waits for a lock on a channel
      that's currently running IMPORT
 At the same time
   -> Dialplan asks for IMPORT(${some_channel},...) -> Which grabs a
      channel lock and then waits on ao2_container channels

So in this situation, res_pjsip_sdp_rtp is waiting on a channel lock
that will never release, because res_pjsip_sdp_rtp is currently
holding ao2_container channels, that the IMPORT needs after having
already locked the channel that res_pjsip_sdp_rtp is processing next.

UserNote: Module developers should discontinue use of ast_channel_callback()
and use ast_channel_callback_safe() instead.  (This will avoid causing deadlocks)

Resolves: asterisk#307
@kobaz
Copy link

kobaz commented Oct 8, 2023

Pushed a new one.. fixes channel reference leaks (introduced in the previous patch) when OBJ_MULTIPLE is used to return an iterator

@kobaz
Copy link

kobaz commented Oct 8, 2023

While trying to extract out components from our full platform into a standalone deadlock reproduction test, I've uncovered an additional deadlock related to ast_sip_push_task_wait() also related to IMPORT import_helper() (posted to #307). I realized this is not the same deadlock that this #311 pull is addressing. So I'll work on another test to trigger this specific one as well.

This is related, but probably should be split into a new issue since this other newly uncovered deadlock is not related to ao2_container *channels.

@gtjoseph
Copy link
Member

I'm converting this back to a draft again since there are still outstanding changes requested and the cherry-pick lines still aren't correct.

Speaking of which, are you saying this PR will cherry-pick cleanly from 18 to master but not the other way around?

@gtjoseph gtjoseph marked this pull request as draft January 23, 2024 16:42
@kobaz
Copy link

kobaz commented Jan 23, 2024

Correct, I don't think it goes the other way, but I can do some tests.

@gtjoseph
Copy link
Member

@intellasoft Is this PR still relevant since the channel storage backends work was merged?

@intellasoft
Copy link
Contributor Author

That's a really good question. This can probably be put on hold for now... I'll need to validate if this still happens with the new backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants