[Pgcluster-general] Problems after data nodes looses connectivity
Gábriel Ákos
akos.gabriel at i-logic.hu
Thu Jun 21 08:54:38 UTC 2007
On Thu, 21 Jun 2007 01:48:56 -0700
Tom Seago <tom at sipcall.com> wrote:
It is because everyone, always, consequently fails to recognise that
pgcluster relies on uniqueness of rows in a table.
> In the limited testing that I have been able to do, I have seen the
> same results. I was testing with 1 lb, 1 replicator, and 2 db
> nodes. When I would shut down one of the db nodes and then restart
> it (by killing the process and then restarting the process),
> depending on the timing I was able to get both missing records in
> the db that was shut down as well as duplicate inserts into that
> database. Since we have a somewhat urgent need for a multi-node
> database we are going to be moving on to other technology, but will
> keep an eye on the state of this project.
>
> I am curious, how many people are using pgcluster in a production
> environment? Perhaps these problems are isolated to the most recent
> version? While the software does work as long as all nodes are
> carefully brought up without load in an idle setting, the current
> version really doesn't seem to handle error conditions. This is to
> bad because the whole point of such software is to handle those
> error conditions.
>
> (-:
> Tom ;-)
>
>
>
> On Jun 20, 2007, at 10:27 PM, Pshem Kowalczyk wrote:
>
> > Hi,
> >
> > I have 3 data nodes, 1 replicator and 1 loadbalancer. Postgresql
> > 8.2.4 and corresponding patch.
> > The whole setup works nicely when all nodes are online, but when I
> > ifdown one of the interfaces on the data nodes - things go really
> > bad.
> >
> > Scenario
> > - I start a simple perl script on the loadbalancer inserting one row
> > per second and printing times of inserts
> > - I shut down the network interface in data3 - the inserts stop
> > - I unshut the newtork interface - inserts resume
> > - database on data1 doesn't have all the rows:
> >
> > testdb1=> select count(*) from t1;
> > count
> > -------
> > 13
> > (1 row)
> >
> > but its not in read only mode:
> >
> > testdb1=> insert into t1 values (24, 'row 24');
> > INSERT 0 1
> >
> > and the changes replicate to other nodes
> >
> > - database on data2 has all the rows (and is in read-write mode):
> >
> > testdb1=> select count(*) from t1;
> > count
> > -------
> > 23
> > (1 row)
> >
> > - database on data3 doesn't have all the rows (and is in read-only
> > mode)
> >
> > testdb1=# select count(*) from t1;
> > count
> > -------
> > 13
> > (1 row)
> >
> > But accepts changes from other nodes.
> >
> > Obviously the behaviour of data2 is unacceptable - even if it got
> > de-synced by accident it should get switched into read-only mode.
> >
> > Log from the inserting script:
> > # ./insert.pl
> > 1182401599 row 1
> > 1182401600 row 2
> > 1182401601 row 3
> > 1182401602 row 4
> > 1182401603 row 5
> > 1182401604 row 6
> > 1182401605 row 7
> > 1182401606 row 8
> > 1182401607 row 9
> > 1182401608 row 10
> > 1182401609 row 11
> > 1182401610 row 12
> > 1182401625 row 13 <==== please notice the 15 sec gap
> > 1182401626 row 14
> > 1182401627 row 15
> > 1182401628 row 16
> > 1182401629 row 17
> > 1182401630 row 18
> > 1182401631 row 19
> > 1182401632 row 20
> > 1182401633 row 21
> > 1182401634 row 22
> > 1182401635 row 23
> >
> >
> > If I shut down data3 gracefully it gets removed properly and all
> > things seems to work ok.
> >
> >
> >
> > My configuration is below:
> > /etc/hosts
> > 127.0.0.1 localhost
> >
> > 10.23.254.115 loadbalancer
> > 10.23.254.116 replicator
> > 10.23.254.117 data1
> > 10.23.254.118 data2
> > 10.23.254.119 data3
> >
> >
> > # The following lines are desirable for IPv6 capable hosts
> > ::1 ip6-localhost ip6-loopback
> > fe00::0 ip6-localnet
> > ff00::0 ip6-mcastprefix
> > ff02::1 ip6-allnodes
> > ff02::2 ip6-allrouters
> > ff02::3 ip6-allhosts
> >
> >
> > pglb.conf
> > <Cluster_Server_Info>
> > <Host_Name> data1 </Host_Name>
> > <Port> 5432 </Port>
> > <Max_Connect> 32
> > </Max_Connect> </Cluster_Server_Info>
> > <Cluster_Server_Info>
> > <Host_Name> data2 </Host_Name>
> > <Port> 5432 </Port>
> > <Max_Connect> 32
> > </Max_Connect> </Cluster_Server_Info>
> > <Cluster_Server_Info>
> > <Host_Name> data3 </Host_Name>
> > <Port> 5432 </Port>
> > <Max_Connect> 32
> > </Max_Connect> </Cluster_Server_Info>
> > <Host_Name> loadbalancer </
> > Host_Name>
> > <Backend_Socket_Dir> /var/run/postgresql/
> > </Backend_Socket_Dir>
> > <Receive_Port> 5432 </
> > Receive_Port>
> > <Recovery_Port> 6001 </
> > Recovery_Port>
> > <Max_Cluster_Num> 128
> > </Max_Cluster_Num>
> > <Use_Connection_Pooling> no
> > </Use_Connection_Pooling>
> > <LifeCheck_Timeout> 1s
> > </LifeCheck_Timeout>
> > <LifeCheck_Interval> 2s
> > </LifeCheck_Interval>
> > <Log_File_Info>
> > <File_Name> /tmp/pglb.log </File_Name>
> > <File_Size> 1M </File_Size>
> > <Rotate> 3 </Rotate>
> > </Log_File_Info>
> >
> > pgreplicate.conf
> > <Cluster_Server_Info>
> > <Host_Name> data1 </Host_Name>
> > <Port> 5432 </Port>
> > <Max_Connect> 32
> > </Max_Connect> <Recovery_Port> 7001
> > </Recovery_Port> </Cluster_Server_Info>
> > <Cluster_Server_Info>
> > <Host_Name> data2 </Host_Name>
> > <Port> 5432 </Port>
> > <Max_Connect> 32
> > </Max_Connect> <Recovery_Port> 7001
> > </Recovery_Port> </Cluster_Server_Info>
> > <Cluster_Server_Info>
> > <Host_Name> data3 </Host_Name>
> > <Port> 5432 </Port>
> > <Max_Connect> 32
> > </Max_Connect> <Recovery_Port> 7001
> > </Recovery_Port> </Cluster_Server_Info>
> > <LoadBalance_Server_Info>
> > <Host_Name> loadbalancer </
> > Host_Name>
> > <Recovery_Port> 6001 </
> > Recovery_Port>
> > </LoadBalance_Server_Info>
> > <Host_Name> replicator </Host_Name>
> > <Replication_Port> 8001 </
> > Replication_Port>
> > <Recovery_Port> 8101 </
> > Recovery_Port>
> > <RLOG_Port> 8301 </RLOG_Port>
> > <Response_Mode> normal </
> > Response_Mode>
> > <Use_Replication_Log> no </
> > Use_Replication_Log>
> > <Replication_Timeout> 10s </
> > Replication_Timeout>
> > <LifeCheck_Timeout> 2s </
> > LifeCheck_Timeout>
> > <LifeCheck_Interval> 3s </
> > LifeCheck_Interval>
> > <Log_File_Info>
> > <File_Name> /tmp/pgreplicate.log </File_Name>
> > <File_Size> 1M </File_Size>
> > <Rotate> 3 </Rotate>
> > </Log_File_Info>
> >
> >
> > data1:
> > <Replicate_Server_Info>
> > <Host_Name> replicator </
> > Host_Name>
> > <Port> 8001
> > </Port> <Recovery_Port> 8101 </
> > Recovery_Port>
> > </Replicate_Server_Info>
> > <Host_Name> data1 </
> > Host_Name>
> > <Recovery_Port> 7001 </
> > Recovery_Port>
> > <Rsync_Path> /usr/bin/rsync </
> > Rsync_Path>
> > <Rsync_Option> ssh </
> > Rsync_Option>
> > <Rsync_Compress> yes
> > </Rsync_Compress>
> > <Pg_Dump_Path> /usr/bin/pg_dump </
> > Pg_Dump_Path>
> > <When_Stand_Alone> read_only
> > </When_Stand_Alone>
> > <Replication_Timeout> 10s </
> > Replication_Timeout>
> > <LifeCheck_Timeout> 2s </
> > LifeCheck_Timeout>
> > <LifeCheck_Interval> 3s </
> > LifeCheck_Interval>
> >
> > data2:
> > <Replicate_Server_Info>
> > <Host_Name> replicator </
> > Host_Name>
> > <Port> 8001
> > </Port> <Recovery_Port> 8101 </
> > Recovery_Port>
> > </Replicate_Server_Info>
> > <Host_Name> data2 </
> > Host_Name>
> > <Recovery_Port> 7001 </
> > Recovery_Port>
> > <Rsync_Path> /usr/bin/rsync </
> > Rsync_Path>
> > <Rsync_Option> ssh </
> > Rsync_Option>
> > <Rsync_Compress> yes
> > </Rsync_Compress>
> > <Pg_Dump_Path> /usr/bin/pg_dump </
> > Pg_Dump_Path>
> > <When_Stand_Alone> read_only
> > </When_Stand_Alone>
> > <Replication_Timeout> 10s </
> > Replication_Timeout>
> > <LifeCheck_Timeout> 2s </
> > LifeCheck_Timeout>
> > <LifeCheck_Interval> 3s </
> > LifeCheck_Interval>
> >
> > data3:
> > <Replicate_Server_Info>
> > <Host_Name> replicator </
> > Host_Name>
> > <Port> 8001
> > </Port> <Recovery_Port> 8101 </
> > Recovery_Port>
> > </Replicate_Server_Info>
> > <Host_Name> data3 </
> > Host_Name>
> > <Recovery_Port> 7001 </
> > Recovery_Port>
> > <Rsync_Path> /usr/bin/rsync </
> > Rsync_Path>
> > <Rsync_Option> ssh </
> > Rsync_Option>
> > <Rsync_Compress> yes
> > </Rsync_Compress>
> > <Pg_Dump_Path> /usr/bin/pg_dump </
> > Pg_Dump_Path>
> > <When_Stand_Alone> read_only
> > </When_Stand_Alone>
> > <Replication_Timeout> 10s </
> > Replication_Timeout>
> > <LifeCheck_Timeout> 2s </
> > LifeCheck_Timeout>
> > <LifeCheck_Interval> 3s </
> > LifeCheck_Interval>
> >
> >
> > Log files of pglb and pgreplicator are attached.
> >
> > kind regards
> > Pshem
> > <pgrepliacate-failed.log>
> > <pglb-failed.log>
> > _______________________________________________
> > Pgcluster-general mailing list
> > Pgcluster-general at pgfoundry.org
> > http://pgfoundry.org/mailman/listinfo/pgcluster-general
>
> _______________________________________________
> Pgcluster-general mailing list
> Pgcluster-general at pgfoundry.org
> http://pgfoundry.org/mailman/listinfo/pgcluster-general
>
--
Üdvözlettel,
Gábriel Ákos
-=E-Mail :akos.gabriel at i-logic.hu|Web: http://www.i-logic.hu =-
-=Tel/fax:+3612367353 |Mobil:+36209278894 =-
More information about the Pgcluster-general
mailing list