Skip to content

Conversation

@liu4480
Copy link

@liu4480 liu4480 commented Aug 25, 2016

By design, arbitrator can not revoke tickets, and if the
commandline does not specify a site, it will try to find a "local site"
which has the same subnet with the machine that runs booth command.

If arbitrator and all sites are in the same subnet, booth will choose
the first one as "local site". Booth might think it is an arbitrator,
and will not execute revoke operations.

This patch add a filter to get most matched arbitrator/booth.

*me = node;
did_match = EXACT_MATCH;
break;
if((matched_tmp < matched * 8)||((node->type == SITE)&&(matched_tmp == matched * 8)))
Copy link
Member

@gao-yan gao-yan Aug 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it even compile? It appears match_tmp is not declared ;)

And I don't actually get it here. If it's the exact match, why is it not the only one? Unless the node is running as both a site and an arbitrator?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The travis build failed. Anyway, I'm not sure if I understand the problem here. If a request for the ticket is received at the booth member which is not the ticket leader, that request is going to be forwarded to the leader. @liu4480, did you observe this actually happening? Can you please post your configuration and exact circumstances.

Copy link
Member

@gao-yan gao-yan Aug 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Dejan, @krig encountered this issue that "booth revoke " fails:

ha2:~ # booth list
ticket: ticketA, leader: 10.12.2.201, expires: 2016-08-23 08:29:25
ticket: ticketB, leader: NONE
ha2:~ # booth revoke ticketA
Aug 23 08:19:51 ha2 booth: [5610]: ERROR: We're just an arbitrator, cannot grant/revoke tickets here.

But ha2 is not the arbitrator, it is a node in the cluster currently holding the ticket.

Configuration file:

# The booth configuration file is "/etc/booth/booth.conf". You need to
# prepare the same booth configuration file on each arbitrator and
# each node in the cluster sites where the booth daemon can be launched.
# Here is an example of the configuration file:

# "transport" means which transport layer booth daemon will use.
# Currently only "UDP" is supported.
transport="UDP"

# The port that booth daemons will use to talk to each other.
port="9929"

# The arbitrator IP. If you want to configure several arbitrators,
# you need to configure each arbitrator with a separate line.
arbitrator="10.12.2.105"

# The site IP. The cluster site uses this IP to talk to other sites.
# Like arbitrator, you need to configure each site with a separate line.
site="10.12.2.201"
site="10.12.2.202"

# The ticket name, which corresponds to a set of resources which can be
# fail-overed among different sites.
ticket="ticketA"
ticket="ticketB"
    expire = 600
    weights = 1,2,3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like it may pick a wrong one as the "local site" if the sites and the arbitrator are in the same subnet. There seems to be some problem in the logic around the "EXACT_MATCH".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also observed "same subnet" vs. fuzzy matching issues.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Thu, Aug 25, 2016 at 06:20:44AM -0700, Jan Pokorný wrote:

@@ -108,10 +108,13 @@ static int find_address(unsigned char ipaddr[BOOTH_IPADDR_LEN],
break;

    if (matched == node->addrlen) {
  •       *address_bits_matched = matched \* 8;
    
  •       *me = node;
    
  •       did_match = EXACT_MATCH;
    
  •       break;
    
  •       if((matched_tmp < matched \* 8)||((node->type == SITE)&&(matched_tmp == matched \* 8)))
    

I've also observed "same subnet" vs. fuzzy matching issues.

Before or after this commit:

b2e06e8

Namely, some cloud testers complained about the issue, but after
this it worked for them (and for me).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is before this commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this explains why I'm not able to produce it with the upstream code;) Thanks for pointing out, Dejan.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Fri, Aug 26, 2016 at 02:56:33AM -0700, Gao,Yan wrote:

Ah, this explains why I'm not able to produce it with the upstream code;) Thanks for pointing out, Dejan.

OK, good :) I guess that we can close this issue then?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmuhamedagic, it was with that commit already contained: #52

@liu4480
Copy link
Author

liu4480 commented Aug 26, 2016

@gao-yan sorry for the ignorance.
I mean, if you have a site and arbitrator in the same subnet, there are two issues so far: 1)if you execute the command on a third node in the site, it will choose the arbitrator as local. 2) if you execute the command the booth site node, it still think arbitrator as local.

By design, arbitrator can not revoke tickets, and if the
commandline does not specify a site, it will try to find a "local site"
which has the same subnet with the machine that runs booth command.

If arbitrator and all sites are in the same subnet, booth will choose
the first one as "local site". Booth might think it is an arbitrator,
and will not execute revoke operations.

This patch add a filter to get most matched arbitrator/booth.
@liu4480
Copy link
Author

liu4480 commented Aug 26, 2016

I 've the vms running booth in 192.168.122.0/24, with the following configuration:
port = 9929
transport = UDP
authfile = /etc/booth/booth.key
arbitrator = 192.168.122.83
site = 192.168.122.143
site = 192.168.122.62
ticket = dummy
retries = 5
expire = 900
timeout = 10

@krig
Copy link

krig commented Aug 26, 2016

Cluster node ha2:

vagrant@ha2:~> ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:82:dd:06 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe82:dd06/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:0d:96:01 brd ff:ff:ff:ff:ff:ff
    inet 10.12.2.102/24 brd 10.12.2.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe0d:9601/64 scope link 
       valid_lft forever preferred_lft forever

Arbitrator ha5:

vagrant@ha5:~> ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:82:dd:06 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe82:dd06/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:fe:60:fb brd ff:ff:ff:ff:ff:ff
    inet 10.12.2.105/24 brd 10.12.2.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fefe:60fb/64 scope link 
       valid_lft forever preferred_lft forever

@liu4480 liu4480 closed this Aug 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants