[Pgcluster-general] Problems after data nodes looses connectivity

Tom Seago tom at sipcall.com
Thu Jun 21 08:48:56 UTC 2007


In the limited testing that I have been able to do, I have seen the  
same results.  I was testing with 1 lb, 1 replicator, and 2 db  
nodes.  When I would shut down one of the db nodes and then restart  
it (by killing the process and then restarting the process),  
depending on the timing I was able to get both missing records in the  
db that was shut down as well as duplicate inserts into that  
database.  Since we have a somewhat urgent need for a multi-node  
database we are going to be moving on to other technology, but will  
keep an eye on the state of this project.

I am curious, how many people are using pgcluster in a production  
environment?  Perhaps these problems are isolated to the most recent  
version?  While the software does work as long as all nodes are  
carefully brought up without load in an idle setting, the current  
version really doesn't seem to handle error conditions.  This is to  
bad because the whole point of such software is to handle those error  
conditions.

										(-: Tom ;-)



On Jun 20, 2007, at 10:27 PM, Pshem Kowalczyk wrote:

> Hi,
>
> I have 3 data nodes, 1 replicator and 1 loadbalancer. Postgresql 8.2.4
> and corresponding patch.
> The whole setup works nicely when all nodes are online, but when I
> ifdown one of the interfaces on the data nodes - things go really bad.
>
> Scenario
> - I start a simple perl script on the loadbalancer inserting one row
> per second and printing times of inserts
> - I shut down the network interface in data3 - the inserts stop
> - I unshut the newtork interface - inserts resume
> - database on data1 doesn't have  all the rows:
>
> testdb1=> select count(*) from t1;
> count
> -------
>    13
> (1 row)
>
> but its not in read only mode:
>
> testdb1=> insert into t1 values (24, 'row 24');
> INSERT 0 1
>
> and the changes replicate to other nodes
>
> - database on data2 has all the rows (and is in read-write mode):
>
> testdb1=> select count(*) from t1;
> count
> -------
>    23
> (1 row)
>
> - database on data3 doesn't have all the rows (and is in read-only  
> mode)
>
> testdb1=# select count(*) from t1;
> count
> -------
>    13
> (1 row)
>
> But accepts changes from other nodes.
>
> Obviously the behaviour of data2 is unacceptable - even if it got
> de-synced by accident it should get switched into read-only mode.
>
> Log from the inserting script:
> # ./insert.pl
> 1182401599 row 1
> 1182401600 row 2
> 1182401601 row 3
> 1182401602 row 4
> 1182401603 row 5
> 1182401604 row 6
> 1182401605 row 7
> 1182401606 row 8
> 1182401607 row 9
> 1182401608 row 10
> 1182401609 row 11
> 1182401610 row 12
> 1182401625 row 13 <==== please notice the 15 sec gap
> 1182401626 row 14
> 1182401627 row 15
> 1182401628 row 16
> 1182401629 row 17
> 1182401630 row 18
> 1182401631 row 19
> 1182401632 row 20
> 1182401633 row 21
> 1182401634 row 22
> 1182401635 row 23
>
>
> If I shut down data3 gracefully it gets removed properly and all
> things seems to work ok.
>
>
>
> My configuration is below:
> /etc/hosts
> 127.0.0.1       localhost
>
> 10.23.254.115   loadbalancer
> 10.23.254.116   replicator
> 10.23.254.117   data1
> 10.23.254.118   data2
> 10.23.254.119   data3
>
>
> # The following lines are desirable for IPv6 capable hosts
> ::1     ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
>
>
> pglb.conf
> <Cluster_Server_Info>
>    <Host_Name>                 data1                   </Host_Name>
>    <Port>                      5432                    </Port>
>    <Max_Connect>               32                      </Max_Connect>
> </Cluster_Server_Info>
> <Cluster_Server_Info>
>    <Host_Name>                 data2                   </Host_Name>
>    <Port>                      5432                    </Port>
>    <Max_Connect>               32                      </Max_Connect>
> </Cluster_Server_Info>
> <Cluster_Server_Info>
>    <Host_Name>                 data3                   </Host_Name>
>    <Port>                      5432                    </Port>
>    <Max_Connect>               32                      </Max_Connect>
> </Cluster_Server_Info>
> <Host_Name>                     loadbalancer                    </ 
> Host_Name>
> <Backend_Socket_Dir>            /var/run/postgresql/
> </Backend_Socket_Dir>
> <Receive_Port>                  5432                            </ 
> Receive_Port>
> <Recovery_Port>                 6001                            </ 
> Recovery_Port>
> <Max_Cluster_Num>               128
> </Max_Cluster_Num>
> <Use_Connection_Pooling>        no
> </Use_Connection_Pooling>
> <LifeCheck_Timeout>             1s
> </LifeCheck_Timeout>
> <LifeCheck_Interval>            2s
> </LifeCheck_Interval>
> <Log_File_Info>
>        <File_Name>             /tmp/pglb.log   </File_Name>
>        <File_Size>             1M              </File_Size>
>        <Rotate>                3               </Rotate>
> </Log_File_Info>
>
> pgreplicate.conf
> <Cluster_Server_Info>
>    <Host_Name>                 data1           </Host_Name>
>    <Port>                      5432                    </Port>
>    <Max_Connect>               32                      </Max_Connect>
>    <Recovery_Port>             7001            </Recovery_Port>
> </Cluster_Server_Info>
> <Cluster_Server_Info>
>    <Host_Name>                 data2           </Host_Name>
>    <Port>                      5432                    </Port>
>    <Max_Connect>               32                      </Max_Connect>
>    <Recovery_Port>             7001            </Recovery_Port>
> </Cluster_Server_Info>
> <Cluster_Server_Info>
>    <Host_Name>                 data3           </Host_Name>
>    <Port>                      5432                    </Port>
>    <Max_Connect>               32                      </Max_Connect>
>    <Recovery_Port>             7001            </Recovery_Port>
> </Cluster_Server_Info>
> <LoadBalance_Server_Info>
>        <Host_Name>             loadbalancer                    </ 
> Host_Name>
>        <Recovery_Port>         6001                            </ 
> Recovery_Port>
> </LoadBalance_Server_Info>
> <Host_Name>                     replicator              </Host_Name>
> <Replication_Port>              8001                    </ 
> Replication_Port>
> <Recovery_Port>                 8101                    </ 
> Recovery_Port>
> <RLOG_Port>                     8301                    </RLOG_Port>
> <Response_Mode>                 normal                  </ 
> Response_Mode>
> <Use_Replication_Log>           no                      </ 
> Use_Replication_Log>
> <Replication_Timeout>           10s                     </ 
> Replication_Timeout>
> <LifeCheck_Timeout>             2s                      </ 
> LifeCheck_Timeout>
> <LifeCheck_Interval>            3s                      </ 
> LifeCheck_Interval>
> <Log_File_Info>
>        <File_Name>             /tmp/pgreplicate.log    </File_Name>
>        <File_Size>             1M                      </File_Size>
>        <Rotate>                3                       </Rotate>
> </Log_File_Info>
>
>
> data1:
> <Replicate_Server_Info>
>        <Host_Name>             replicator                      </ 
> Host_Name>
>        <Port>                  8001                            </Port>
>        <Recovery_Port>         8101                            </ 
> Recovery_Port>
> </Replicate_Server_Info>
> <Host_Name>                     data1                           </ 
> Host_Name>
> <Recovery_Port>                 7001                            </ 
> Recovery_Port>
> <Rsync_Path>                    /usr/bin/rsync                  </ 
> Rsync_Path>
> <Rsync_Option>                  ssh                             </ 
> Rsync_Option>
> <Rsync_Compress>                yes
> </Rsync_Compress>
> <Pg_Dump_Path>                  /usr/bin/pg_dump                </ 
> Pg_Dump_Path>
> <When_Stand_Alone>              read_only
> </When_Stand_Alone>
> <Replication_Timeout>           10s                     </ 
> Replication_Timeout>
> <LifeCheck_Timeout>             2s                      </ 
> LifeCheck_Timeout>
> <LifeCheck_Interval>            3s                      </ 
> LifeCheck_Interval>
>
> data2:
> <Replicate_Server_Info>
>        <Host_Name>             replicator                      </ 
> Host_Name>
>        <Port>                  8001                            </Port>
>        <Recovery_Port>         8101                            </ 
> Recovery_Port>
> </Replicate_Server_Info>
> <Host_Name>                     data2                           </ 
> Host_Name>
> <Recovery_Port>                 7001                            </ 
> Recovery_Port>
> <Rsync_Path>                    /usr/bin/rsync                  </ 
> Rsync_Path>
> <Rsync_Option>                  ssh                             </ 
> Rsync_Option>
> <Rsync_Compress>                yes
> </Rsync_Compress>
> <Pg_Dump_Path>                  /usr/bin/pg_dump                </ 
> Pg_Dump_Path>
> <When_Stand_Alone>              read_only
> </When_Stand_Alone>
> <Replication_Timeout>           10s                     </ 
> Replication_Timeout>
> <LifeCheck_Timeout>             2s                      </ 
> LifeCheck_Timeout>
> <LifeCheck_Interval>            3s                      </ 
> LifeCheck_Interval>
>
> data3:
> <Replicate_Server_Info>
>        <Host_Name>             replicator                      </ 
> Host_Name>
>        <Port>                  8001                            </Port>
>        <Recovery_Port>         8101                            </ 
> Recovery_Port>
> </Replicate_Server_Info>
> <Host_Name>                     data3                           </ 
> Host_Name>
> <Recovery_Port>                 7001                            </ 
> Recovery_Port>
> <Rsync_Path>                    /usr/bin/rsync                  </ 
> Rsync_Path>
> <Rsync_Option>                  ssh                             </ 
> Rsync_Option>
> <Rsync_Compress>                yes
> </Rsync_Compress>
> <Pg_Dump_Path>                  /usr/bin/pg_dump                </ 
> Pg_Dump_Path>
> <When_Stand_Alone>              read_only
> </When_Stand_Alone>
> <Replication_Timeout>           10s                     </ 
> Replication_Timeout>
> <LifeCheck_Timeout>             2s                      </ 
> LifeCheck_Timeout>
> <LifeCheck_Interval>            3s                      </ 
> LifeCheck_Interval>
>
>
> Log files of pglb and pgreplicator are attached.
>
> kind regards
> Pshem
> <pgrepliacate-failed.log>
> <pglb-failed.log>
> _______________________________________________
> Pgcluster-general mailing list
> Pgcluster-general at pgfoundry.org
> http://pgfoundry.org/mailman/listinfo/pgcluster-general



More information about the Pgcluster-general mailing list