[Pgcluster-general] Problems after data nodes looses connectivity

Gábriel Ákos akos.gabriel at i-logic.hu
Thu Jun 21 08:54:38 UTC 2007


On Thu, 21 Jun 2007 01:48:56 -0700
Tom Seago <tom at sipcall.com> wrote:

It is because everyone, always, consequently fails to recognise that
pgcluster relies on uniqueness of rows in a table.


> In the limited testing that I have been able to do, I have seen the  
> same results.  I was testing with 1 lb, 1 replicator, and 2 db  
> nodes.  When I would shut down one of the db nodes and then restart  
> it (by killing the process and then restarting the process),  
> depending on the timing I was able to get both missing records in
> the db that was shut down as well as duplicate inserts into that  
> database.  Since we have a somewhat urgent need for a multi-node  
> database we are going to be moving on to other technology, but will  
> keep an eye on the state of this project.
> 
> I am curious, how many people are using pgcluster in a production  
> environment?  Perhaps these problems are isolated to the most recent  
> version?  While the software does work as long as all nodes are  
> carefully brought up without load in an idle setting, the current  
> version really doesn't seem to handle error conditions.  This is to  
> bad because the whole point of such software is to handle those
> error conditions.
> 
> 										(-:
> Tom ;-)
> 
> 
> 
> On Jun 20, 2007, at 10:27 PM, Pshem Kowalczyk wrote:
> 
> > Hi,
> >
> > I have 3 data nodes, 1 replicator and 1 loadbalancer. Postgresql
> > 8.2.4 and corresponding patch.
> > The whole setup works nicely when all nodes are online, but when I
> > ifdown one of the interfaces on the data nodes - things go really
> > bad.
> >
> > Scenario
> > - I start a simple perl script on the loadbalancer inserting one row
> > per second and printing times of inserts
> > - I shut down the network interface in data3 - the inserts stop
> > - I unshut the newtork interface - inserts resume
> > - database on data1 doesn't have  all the rows:
> >
> > testdb1=> select count(*) from t1;
> > count
> > -------
> >    13
> > (1 row)
> >
> > but its not in read only mode:
> >
> > testdb1=> insert into t1 values (24, 'row 24');
> > INSERT 0 1
> >
> > and the changes replicate to other nodes
> >
> > - database on data2 has all the rows (and is in read-write mode):
> >
> > testdb1=> select count(*) from t1;
> > count
> > -------
> >    23
> > (1 row)
> >
> > - database on data3 doesn't have all the rows (and is in read-only  
> > mode)
> >
> > testdb1=# select count(*) from t1;
> > count
> > -------
> >    13
> > (1 row)
> >
> > But accepts changes from other nodes.
> >
> > Obviously the behaviour of data2 is unacceptable - even if it got
> > de-synced by accident it should get switched into read-only mode.
> >
> > Log from the inserting script:
> > # ./insert.pl
> > 1182401599 row 1
> > 1182401600 row 2
> > 1182401601 row 3
> > 1182401602 row 4
> > 1182401603 row 5
> > 1182401604 row 6
> > 1182401605 row 7
> > 1182401606 row 8
> > 1182401607 row 9
> > 1182401608 row 10
> > 1182401609 row 11
> > 1182401610 row 12
> > 1182401625 row 13 <==== please notice the 15 sec gap
> > 1182401626 row 14
> > 1182401627 row 15
> > 1182401628 row 16
> > 1182401629 row 17
> > 1182401630 row 18
> > 1182401631 row 19
> > 1182401632 row 20
> > 1182401633 row 21
> > 1182401634 row 22
> > 1182401635 row 23
> >
> >
> > If I shut down data3 gracefully it gets removed properly and all
> > things seems to work ok.
> >
> >
> >
> > My configuration is below:
> > /etc/hosts
> > 127.0.0.1       localhost
> >
> > 10.23.254.115   loadbalancer
> > 10.23.254.116   replicator
> > 10.23.254.117   data1
> > 10.23.254.118   data2
> > 10.23.254.119   data3
> >
> >
> > # The following lines are desirable for IPv6 capable hosts
> > ::1     ip6-localhost ip6-loopback
> > fe00::0 ip6-localnet
> > ff00::0 ip6-mcastprefix
> > ff02::1 ip6-allnodes
> > ff02::2 ip6-allrouters
> > ff02::3 ip6-allhosts
> >
> >
> > pglb.conf
> > <Cluster_Server_Info>
> >    <Host_Name>                 data1                   </Host_Name>
> >    <Port>                      5432                    </Port>
> >    <Max_Connect>               32
> > </Max_Connect> </Cluster_Server_Info>
> > <Cluster_Server_Info>
> >    <Host_Name>                 data2                   </Host_Name>
> >    <Port>                      5432                    </Port>
> >    <Max_Connect>               32
> > </Max_Connect> </Cluster_Server_Info>
> > <Cluster_Server_Info>
> >    <Host_Name>                 data3                   </Host_Name>
> >    <Port>                      5432                    </Port>
> >    <Max_Connect>               32
> > </Max_Connect> </Cluster_Server_Info>
> > <Host_Name>                     loadbalancer                    </ 
> > Host_Name>
> > <Backend_Socket_Dir>            /var/run/postgresql/
> > </Backend_Socket_Dir>
> > <Receive_Port>                  5432                            </ 
> > Receive_Port>
> > <Recovery_Port>                 6001                            </ 
> > Recovery_Port>
> > <Max_Cluster_Num>               128
> > </Max_Cluster_Num>
> > <Use_Connection_Pooling>        no
> > </Use_Connection_Pooling>
> > <LifeCheck_Timeout>             1s
> > </LifeCheck_Timeout>
> > <LifeCheck_Interval>            2s
> > </LifeCheck_Interval>
> > <Log_File_Info>
> >        <File_Name>             /tmp/pglb.log   </File_Name>
> >        <File_Size>             1M              </File_Size>
> >        <Rotate>                3               </Rotate>
> > </Log_File_Info>
> >
> > pgreplicate.conf
> > <Cluster_Server_Info>
> >    <Host_Name>                 data1           </Host_Name>
> >    <Port>                      5432                    </Port>
> >    <Max_Connect>               32
> > </Max_Connect> <Recovery_Port>             7001
> > </Recovery_Port> </Cluster_Server_Info>
> > <Cluster_Server_Info>
> >    <Host_Name>                 data2           </Host_Name>
> >    <Port>                      5432                    </Port>
> >    <Max_Connect>               32
> > </Max_Connect> <Recovery_Port>             7001
> > </Recovery_Port> </Cluster_Server_Info>
> > <Cluster_Server_Info>
> >    <Host_Name>                 data3           </Host_Name>
> >    <Port>                      5432                    </Port>
> >    <Max_Connect>               32
> > </Max_Connect> <Recovery_Port>             7001
> > </Recovery_Port> </Cluster_Server_Info>
> > <LoadBalance_Server_Info>
> >        <Host_Name>             loadbalancer                    </ 
> > Host_Name>
> >        <Recovery_Port>         6001                            </ 
> > Recovery_Port>
> > </LoadBalance_Server_Info>
> > <Host_Name>                     replicator              </Host_Name>
> > <Replication_Port>              8001                    </ 
> > Replication_Port>
> > <Recovery_Port>                 8101                    </ 
> > Recovery_Port>
> > <RLOG_Port>                     8301                    </RLOG_Port>
> > <Response_Mode>                 normal                  </ 
> > Response_Mode>
> > <Use_Replication_Log>           no                      </ 
> > Use_Replication_Log>
> > <Replication_Timeout>           10s                     </ 
> > Replication_Timeout>
> > <LifeCheck_Timeout>             2s                      </ 
> > LifeCheck_Timeout>
> > <LifeCheck_Interval>            3s                      </ 
> > LifeCheck_Interval>
> > <Log_File_Info>
> >        <File_Name>             /tmp/pgreplicate.log    </File_Name>
> >        <File_Size>             1M                      </File_Size>
> >        <Rotate>                3                       </Rotate>
> > </Log_File_Info>
> >
> >
> > data1:
> > <Replicate_Server_Info>
> >        <Host_Name>             replicator                      </ 
> > Host_Name>
> >        <Port>                  8001
> > </Port> <Recovery_Port>         8101                            </ 
> > Recovery_Port>
> > </Replicate_Server_Info>
> > <Host_Name>                     data1                           </ 
> > Host_Name>
> > <Recovery_Port>                 7001                            </ 
> > Recovery_Port>
> > <Rsync_Path>                    /usr/bin/rsync                  </ 
> > Rsync_Path>
> > <Rsync_Option>                  ssh                             </ 
> > Rsync_Option>
> > <Rsync_Compress>                yes
> > </Rsync_Compress>
> > <Pg_Dump_Path>                  /usr/bin/pg_dump                </ 
> > Pg_Dump_Path>
> > <When_Stand_Alone>              read_only
> > </When_Stand_Alone>
> > <Replication_Timeout>           10s                     </ 
> > Replication_Timeout>
> > <LifeCheck_Timeout>             2s                      </ 
> > LifeCheck_Timeout>
> > <LifeCheck_Interval>            3s                      </ 
> > LifeCheck_Interval>
> >
> > data2:
> > <Replicate_Server_Info>
> >        <Host_Name>             replicator                      </ 
> > Host_Name>
> >        <Port>                  8001
> > </Port> <Recovery_Port>         8101                            </ 
> > Recovery_Port>
> > </Replicate_Server_Info>
> > <Host_Name>                     data2                           </ 
> > Host_Name>
> > <Recovery_Port>                 7001                            </ 
> > Recovery_Port>
> > <Rsync_Path>                    /usr/bin/rsync                  </ 
> > Rsync_Path>
> > <Rsync_Option>                  ssh                             </ 
> > Rsync_Option>
> > <Rsync_Compress>                yes
> > </Rsync_Compress>
> > <Pg_Dump_Path>                  /usr/bin/pg_dump                </ 
> > Pg_Dump_Path>
> > <When_Stand_Alone>              read_only
> > </When_Stand_Alone>
> > <Replication_Timeout>           10s                     </ 
> > Replication_Timeout>
> > <LifeCheck_Timeout>             2s                      </ 
> > LifeCheck_Timeout>
> > <LifeCheck_Interval>            3s                      </ 
> > LifeCheck_Interval>
> >
> > data3:
> > <Replicate_Server_Info>
> >        <Host_Name>             replicator                      </ 
> > Host_Name>
> >        <Port>                  8001
> > </Port> <Recovery_Port>         8101                            </ 
> > Recovery_Port>
> > </Replicate_Server_Info>
> > <Host_Name>                     data3                           </ 
> > Host_Name>
> > <Recovery_Port>                 7001                            </ 
> > Recovery_Port>
> > <Rsync_Path>                    /usr/bin/rsync                  </ 
> > Rsync_Path>
> > <Rsync_Option>                  ssh                             </ 
> > Rsync_Option>
> > <Rsync_Compress>                yes
> > </Rsync_Compress>
> > <Pg_Dump_Path>                  /usr/bin/pg_dump                </ 
> > Pg_Dump_Path>
> > <When_Stand_Alone>              read_only
> > </When_Stand_Alone>
> > <Replication_Timeout>           10s                     </ 
> > Replication_Timeout>
> > <LifeCheck_Timeout>             2s                      </ 
> > LifeCheck_Timeout>
> > <LifeCheck_Interval>            3s                      </ 
> > LifeCheck_Interval>
> >
> >
> > Log files of pglb and pgreplicator are attached.
> >
> > kind regards
> > Pshem
> > <pgrepliacate-failed.log>
> > <pglb-failed.log>
> > _______________________________________________
> > Pgcluster-general mailing list
> > Pgcluster-general at pgfoundry.org
> > http://pgfoundry.org/mailman/listinfo/pgcluster-general
> 
> _______________________________________________
> Pgcluster-general mailing list
> Pgcluster-general at pgfoundry.org
> http://pgfoundry.org/mailman/listinfo/pgcluster-general
> 


-- 
Üdvözlettel,
Gábriel Ákos
-=E-Mail :akos.gabriel at i-logic.hu|Web:  http://www.i-logic.hu =-
-=Tel/fax:+3612367353            |Mobil:+36209278894          =-


More information about the Pgcluster-general mailing list