[Pgcluster-general] Problems after data nodes looses connectivity
Tom Seago
tom at sipcall.com
Thu Jun 21 08:48:56 UTC 2007
In the limited testing that I have been able to do, I have seen the
same results. I was testing with 1 lb, 1 replicator, and 2 db
nodes. When I would shut down one of the db nodes and then restart
it (by killing the process and then restarting the process),
depending on the timing I was able to get both missing records in the
db that was shut down as well as duplicate inserts into that
database. Since we have a somewhat urgent need for a multi-node
database we are going to be moving on to other technology, but will
keep an eye on the state of this project.
I am curious, how many people are using pgcluster in a production
environment? Perhaps these problems are isolated to the most recent
version? While the software does work as long as all nodes are
carefully brought up without load in an idle setting, the current
version really doesn't seem to handle error conditions. This is to
bad because the whole point of such software is to handle those error
conditions.
(-: Tom ;-)
On Jun 20, 2007, at 10:27 PM, Pshem Kowalczyk wrote:
> Hi,
>
> I have 3 data nodes, 1 replicator and 1 loadbalancer. Postgresql 8.2.4
> and corresponding patch.
> The whole setup works nicely when all nodes are online, but when I
> ifdown one of the interfaces on the data nodes - things go really bad.
>
> Scenario
> - I start a simple perl script on the loadbalancer inserting one row
> per second and printing times of inserts
> - I shut down the network interface in data3 - the inserts stop
> - I unshut the newtork interface - inserts resume
> - database on data1 doesn't have all the rows:
>
> testdb1=> select count(*) from t1;
> count
> -------
> 13
> (1 row)
>
> but its not in read only mode:
>
> testdb1=> insert into t1 values (24, 'row 24');
> INSERT 0 1
>
> and the changes replicate to other nodes
>
> - database on data2 has all the rows (and is in read-write mode):
>
> testdb1=> select count(*) from t1;
> count
> -------
> 23
> (1 row)
>
> - database on data3 doesn't have all the rows (and is in read-only
> mode)
>
> testdb1=# select count(*) from t1;
> count
> -------
> 13
> (1 row)
>
> But accepts changes from other nodes.
>
> Obviously the behaviour of data2 is unacceptable - even if it got
> de-synced by accident it should get switched into read-only mode.
>
> Log from the inserting script:
> # ./insert.pl
> 1182401599 row 1
> 1182401600 row 2
> 1182401601 row 3
> 1182401602 row 4
> 1182401603 row 5
> 1182401604 row 6
> 1182401605 row 7
> 1182401606 row 8
> 1182401607 row 9
> 1182401608 row 10
> 1182401609 row 11
> 1182401610 row 12
> 1182401625 row 13 <==== please notice the 15 sec gap
> 1182401626 row 14
> 1182401627 row 15
> 1182401628 row 16
> 1182401629 row 17
> 1182401630 row 18
> 1182401631 row 19
> 1182401632 row 20
> 1182401633 row 21
> 1182401634 row 22
> 1182401635 row 23
>
>
> If I shut down data3 gracefully it gets removed properly and all
> things seems to work ok.
>
>
>
> My configuration is below:
> /etc/hosts
> 127.0.0.1 localhost
>
> 10.23.254.115 loadbalancer
> 10.23.254.116 replicator
> 10.23.254.117 data1
> 10.23.254.118 data2
> 10.23.254.119 data3
>
>
> # The following lines are desirable for IPv6 capable hosts
> ::1 ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
>
>
> pglb.conf
> <Cluster_Server_Info>
> <Host_Name> data1 </Host_Name>
> <Port> 5432 </Port>
> <Max_Connect> 32 </Max_Connect>
> </Cluster_Server_Info>
> <Cluster_Server_Info>
> <Host_Name> data2 </Host_Name>
> <Port> 5432 </Port>
> <Max_Connect> 32 </Max_Connect>
> </Cluster_Server_Info>
> <Cluster_Server_Info>
> <Host_Name> data3 </Host_Name>
> <Port> 5432 </Port>
> <Max_Connect> 32 </Max_Connect>
> </Cluster_Server_Info>
> <Host_Name> loadbalancer </
> Host_Name>
> <Backend_Socket_Dir> /var/run/postgresql/
> </Backend_Socket_Dir>
> <Receive_Port> 5432 </
> Receive_Port>
> <Recovery_Port> 6001 </
> Recovery_Port>
> <Max_Cluster_Num> 128
> </Max_Cluster_Num>
> <Use_Connection_Pooling> no
> </Use_Connection_Pooling>
> <LifeCheck_Timeout> 1s
> </LifeCheck_Timeout>
> <LifeCheck_Interval> 2s
> </LifeCheck_Interval>
> <Log_File_Info>
> <File_Name> /tmp/pglb.log </File_Name>
> <File_Size> 1M </File_Size>
> <Rotate> 3 </Rotate>
> </Log_File_Info>
>
> pgreplicate.conf
> <Cluster_Server_Info>
> <Host_Name> data1 </Host_Name>
> <Port> 5432 </Port>
> <Max_Connect> 32 </Max_Connect>
> <Recovery_Port> 7001 </Recovery_Port>
> </Cluster_Server_Info>
> <Cluster_Server_Info>
> <Host_Name> data2 </Host_Name>
> <Port> 5432 </Port>
> <Max_Connect> 32 </Max_Connect>
> <Recovery_Port> 7001 </Recovery_Port>
> </Cluster_Server_Info>
> <Cluster_Server_Info>
> <Host_Name> data3 </Host_Name>
> <Port> 5432 </Port>
> <Max_Connect> 32 </Max_Connect>
> <Recovery_Port> 7001 </Recovery_Port>
> </Cluster_Server_Info>
> <LoadBalance_Server_Info>
> <Host_Name> loadbalancer </
> Host_Name>
> <Recovery_Port> 6001 </
> Recovery_Port>
> </LoadBalance_Server_Info>
> <Host_Name> replicator </Host_Name>
> <Replication_Port> 8001 </
> Replication_Port>
> <Recovery_Port> 8101 </
> Recovery_Port>
> <RLOG_Port> 8301 </RLOG_Port>
> <Response_Mode> normal </
> Response_Mode>
> <Use_Replication_Log> no </
> Use_Replication_Log>
> <Replication_Timeout> 10s </
> Replication_Timeout>
> <LifeCheck_Timeout> 2s </
> LifeCheck_Timeout>
> <LifeCheck_Interval> 3s </
> LifeCheck_Interval>
> <Log_File_Info>
> <File_Name> /tmp/pgreplicate.log </File_Name>
> <File_Size> 1M </File_Size>
> <Rotate> 3 </Rotate>
> </Log_File_Info>
>
>
> data1:
> <Replicate_Server_Info>
> <Host_Name> replicator </
> Host_Name>
> <Port> 8001 </Port>
> <Recovery_Port> 8101 </
> Recovery_Port>
> </Replicate_Server_Info>
> <Host_Name> data1 </
> Host_Name>
> <Recovery_Port> 7001 </
> Recovery_Port>
> <Rsync_Path> /usr/bin/rsync </
> Rsync_Path>
> <Rsync_Option> ssh </
> Rsync_Option>
> <Rsync_Compress> yes
> </Rsync_Compress>
> <Pg_Dump_Path> /usr/bin/pg_dump </
> Pg_Dump_Path>
> <When_Stand_Alone> read_only
> </When_Stand_Alone>
> <Replication_Timeout> 10s </
> Replication_Timeout>
> <LifeCheck_Timeout> 2s </
> LifeCheck_Timeout>
> <LifeCheck_Interval> 3s </
> LifeCheck_Interval>
>
> data2:
> <Replicate_Server_Info>
> <Host_Name> replicator </
> Host_Name>
> <Port> 8001 </Port>
> <Recovery_Port> 8101 </
> Recovery_Port>
> </Replicate_Server_Info>
> <Host_Name> data2 </
> Host_Name>
> <Recovery_Port> 7001 </
> Recovery_Port>
> <Rsync_Path> /usr/bin/rsync </
> Rsync_Path>
> <Rsync_Option> ssh </
> Rsync_Option>
> <Rsync_Compress> yes
> </Rsync_Compress>
> <Pg_Dump_Path> /usr/bin/pg_dump </
> Pg_Dump_Path>
> <When_Stand_Alone> read_only
> </When_Stand_Alone>
> <Replication_Timeout> 10s </
> Replication_Timeout>
> <LifeCheck_Timeout> 2s </
> LifeCheck_Timeout>
> <LifeCheck_Interval> 3s </
> LifeCheck_Interval>
>
> data3:
> <Replicate_Server_Info>
> <Host_Name> replicator </
> Host_Name>
> <Port> 8001 </Port>
> <Recovery_Port> 8101 </
> Recovery_Port>
> </Replicate_Server_Info>
> <Host_Name> data3 </
> Host_Name>
> <Recovery_Port> 7001 </
> Recovery_Port>
> <Rsync_Path> /usr/bin/rsync </
> Rsync_Path>
> <Rsync_Option> ssh </
> Rsync_Option>
> <Rsync_Compress> yes
> </Rsync_Compress>
> <Pg_Dump_Path> /usr/bin/pg_dump </
> Pg_Dump_Path>
> <When_Stand_Alone> read_only
> </When_Stand_Alone>
> <Replication_Timeout> 10s </
> Replication_Timeout>
> <LifeCheck_Timeout> 2s </
> LifeCheck_Timeout>
> <LifeCheck_Interval> 3s </
> LifeCheck_Interval>
>
>
> Log files of pglb and pgreplicator are attached.
>
> kind regards
> Pshem
> <pgrepliacate-failed.log>
> <pglb-failed.log>
> _______________________________________________
> Pgcluster-general mailing list
> Pgcluster-general at pgfoundry.org
> http://pgfoundry.org/mailman/listinfo/pgcluster-general
More information about the Pgcluster-general
mailing list