[Pgcluster-general] Problems after data nodes looses connectivity
Tom Seago
tom at sipcall.com
Thu Jun 21 14:50:12 UTC 2007
I could live with a requirement that the data rows be unique. From a
performance standpoint the primary key constraint violations are not
the best thing in the world, but they could be lived with. That
doesn't help for missing data though.
I should also point out that this "requirement" is not clear in any
available documentation. It may be present in some doc somewhere,
but it certainly is obvious. The lack of clear documentation is
another big reason that these issues have turned us away from the
product at this time.
(-: Tom ;-)
On Jun 21, 2007, at 1:54 AM, Gábriel Ákos wrote:
> On Thu, 21 Jun 2007 01:48:56 -0700
> Tom Seago <tom at sipcall.com> wrote:
>
> It is because everyone, always, consequently fails to recognise that
> pgcluster relies on uniqueness of rows in a table.
>
>
>> In the limited testing that I have been able to do, I have seen the
>> same results. I was testing with 1 lb, 1 replicator, and 2 db
>> nodes. When I would shut down one of the db nodes and then restart
>> it (by killing the process and then restarting the process),
>> depending on the timing I was able to get both missing records in
>> the db that was shut down as well as duplicate inserts into that
>> database. Since we have a somewhat urgent need for a multi-node
>> database we are going to be moving on to other technology, but will
>> keep an eye on the state of this project.
>>
>> I am curious, how many people are using pgcluster in a production
>> environment? Perhaps these problems are isolated to the most recent
>> version? While the software does work as long as all nodes are
>> carefully brought up without load in an idle setting, the current
>> version really doesn't seem to handle error conditions. This is to
>> bad because the whole point of such software is to handle those
>> error conditions.
>>
>> (-:
>> Tom ;-)
>>
>>
>>
>> On Jun 20, 2007, at 10:27 PM, Pshem Kowalczyk wrote:
>>
>>> Hi,
>>>
>>> I have 3 data nodes, 1 replicator and 1 loadbalancer. Postgresql
>>> 8.2.4 and corresponding patch.
>>> The whole setup works nicely when all nodes are online, but when I
>>> ifdown one of the interfaces on the data nodes - things go really
>>> bad.
>>>
>>> Scenario
>>> - I start a simple perl script on the loadbalancer inserting one row
>>> per second and printing times of inserts
>>> - I shut down the network interface in data3 - the inserts stop
>>> - I unshut the newtork interface - inserts resume
>>> - database on data1 doesn't have all the rows:
>>>
>>> testdb1=> select count(*) from t1;
>>> count
>>> -------
>>> 13
>>> (1 row)
>>>
>>> but its not in read only mode:
>>>
>>> testdb1=> insert into t1 values (24, 'row 24');
>>> INSERT 0 1
>>>
>>> and the changes replicate to other nodes
>>>
>>> - database on data2 has all the rows (and is in read-write mode):
>>>
>>> testdb1=> select count(*) from t1;
>>> count
>>> -------
>>> 23
>>> (1 row)
>>>
>>> - database on data3 doesn't have all the rows (and is in read-only
>>> mode)
>>>
>>> testdb1=# select count(*) from t1;
>>> count
>>> -------
>>> 13
>>> (1 row)
>>>
>>> But accepts changes from other nodes.
>>>
>>> Obviously the behaviour of data2 is unacceptable - even if it got
>>> de-synced by accident it should get switched into read-only mode.
>>>
>>> Log from the inserting script:
>>> # ./insert.pl
>>> 1182401599 row 1
>>> 1182401600 row 2
>>> 1182401601 row 3
>>> 1182401602 row 4
>>> 1182401603 row 5
>>> 1182401604 row 6
>>> 1182401605 row 7
>>> 1182401606 row 8
>>> 1182401607 row 9
>>> 1182401608 row 10
>>> 1182401609 row 11
>>> 1182401610 row 12
>>> 1182401625 row 13 <==== please notice the 15 sec gap
>>> 1182401626 row 14
>>> 1182401627 row 15
>>> 1182401628 row 16
>>> 1182401629 row 17
>>> 1182401630 row 18
>>> 1182401631 row 19
>>> 1182401632 row 20
>>> 1182401633 row 21
>>> 1182401634 row 22
>>> 1182401635 row 23
>>>
>>>
>>> If I shut down data3 gracefully it gets removed properly and all
>>> things seems to work ok.
>>>
>>>
>>>
>>> My configuration is below:
>>> /etc/hosts
>>> 127.0.0.1 localhost
>>>
>>> 10.23.254.115 loadbalancer
>>> 10.23.254.116 replicator
>>> 10.23.254.117 data1
>>> 10.23.254.118 data2
>>> 10.23.254.119 data3
>>>
>>>
>>> # The following lines are desirable for IPv6 capable hosts
>>> ::1 ip6-localhost ip6-loopback
>>> fe00::0 ip6-localnet
>>> ff00::0 ip6-mcastprefix
>>> ff02::1 ip6-allnodes
>>> ff02::2 ip6-allrouters
>>> ff02::3 ip6-allhosts
>>>
>>>
>>> pglb.conf
>>> <Cluster_Server_Info>
>>> <Host_Name> data1 </Host_Name>
>>> <Port> 5432 </Port>
>>> <Max_Connect> 32
>>> </Max_Connect> </Cluster_Server_Info>
>>> <Cluster_Server_Info>
>>> <Host_Name> data2 </Host_Name>
>>> <Port> 5432 </Port>
>>> <Max_Connect> 32
>>> </Max_Connect> </Cluster_Server_Info>
>>> <Cluster_Server_Info>
>>> <Host_Name> data3 </Host_Name>
>>> <Port> 5432 </Port>
>>> <Max_Connect> 32
>>> </Max_Connect> </Cluster_Server_Info>
>>> <Host_Name> loadbalancer </
>>> Host_Name>
>>> <Backend_Socket_Dir> /var/run/postgresql/
>>> </Backend_Socket_Dir>
>>> <Receive_Port> 5432 </
>>> Receive_Port>
>>> <Recovery_Port> 6001 </
>>> Recovery_Port>
>>> <Max_Cluster_Num> 128
>>> </Max_Cluster_Num>
>>> <Use_Connection_Pooling> no
>>> </Use_Connection_Pooling>
>>> <LifeCheck_Timeout> 1s
>>> </LifeCheck_Timeout>
>>> <LifeCheck_Interval> 2s
>>> </LifeCheck_Interval>
>>> <Log_File_Info>
>>> <File_Name> /tmp/pglb.log </File_Name>
>>> <File_Size> 1M </File_Size>
>>> <Rotate> 3 </Rotate>
>>> </Log_File_Info>
>>>
>>> pgreplicate.conf
>>> <Cluster_Server_Info>
>>> <Host_Name> data1 </Host_Name>
>>> <Port> 5432 </Port>
>>> <Max_Connect> 32
>>> </Max_Connect> <Recovery_Port> 7001
>>> </Recovery_Port> </Cluster_Server_Info>
>>> <Cluster_Server_Info>
>>> <Host_Name> data2 </Host_Name>
>>> <Port> 5432 </Port>
>>> <Max_Connect> 32
>>> </Max_Connect> <Recovery_Port> 7001
>>> </Recovery_Port> </Cluster_Server_Info>
>>> <Cluster_Server_Info>
>>> <Host_Name> data3 </Host_Name>
>>> <Port> 5432 </Port>
>>> <Max_Connect> 32
>>> </Max_Connect> <Recovery_Port> 7001
>>> </Recovery_Port> </Cluster_Server_Info>
>>> <LoadBalance_Server_Info>
>>> <Host_Name> loadbalancer </
>>> Host_Name>
>>> <Recovery_Port> 6001 </
>>> Recovery_Port>
>>> </LoadBalance_Server_Info>
>>> <Host_Name> replicator </Host_Name>
>>> <Replication_Port> 8001 </
>>> Replication_Port>
>>> <Recovery_Port> 8101 </
>>> Recovery_Port>
>>> <RLOG_Port> 8301 </RLOG_Port>
>>> <Response_Mode> normal </
>>> Response_Mode>
>>> <Use_Replication_Log> no </
>>> Use_Replication_Log>
>>> <Replication_Timeout> 10s </
>>> Replication_Timeout>
>>> <LifeCheck_Timeout> 2s </
>>> LifeCheck_Timeout>
>>> <LifeCheck_Interval> 3s </
>>> LifeCheck_Interval>
>>> <Log_File_Info>
>>> <File_Name> /tmp/pgreplicate.log </File_Name>
>>> <File_Size> 1M </File_Size>
>>> <Rotate> 3 </Rotate>
>>> </Log_File_Info>
>>>
>>>
>>> data1:
>>> <Replicate_Server_Info>
>>> <Host_Name> replicator </
>>> Host_Name>
>>> <Port> 8001
>>> </Port> <Recovery_Port> 8101 </
>>> Recovery_Port>
>>> </Replicate_Server_Info>
>>> <Host_Name> data1 </
>>> Host_Name>
>>> <Recovery_Port> 7001 </
>>> Recovery_Port>
>>> <Rsync_Path> /usr/bin/rsync </
>>> Rsync_Path>
>>> <Rsync_Option> ssh </
>>> Rsync_Option>
>>> <Rsync_Compress> yes
>>> </Rsync_Compress>
>>> <Pg_Dump_Path> /usr/bin/pg_dump </
>>> Pg_Dump_Path>
>>> <When_Stand_Alone> read_only
>>> </When_Stand_Alone>
>>> <Replication_Timeout> 10s </
>>> Replication_Timeout>
>>> <LifeCheck_Timeout> 2s </
>>> LifeCheck_Timeout>
>>> <LifeCheck_Interval> 3s </
>>> LifeCheck_Interval>
>>>
>>> data2:
>>> <Replicate_Server_Info>
>>> <Host_Name> replicator </
>>> Host_Name>
>>> <Port> 8001
>>> </Port> <Recovery_Port> 8101 </
>>> Recovery_Port>
>>> </Replicate_Server_Info>
>>> <Host_Name> data2 </
>>> Host_Name>
>>> <Recovery_Port> 7001 </
>>> Recovery_Port>
>>> <Rsync_Path> /usr/bin/rsync </
>>> Rsync_Path>
>>> <Rsync_Option> ssh </
>>> Rsync_Option>
>>> <Rsync_Compress> yes
>>> </Rsync_Compress>
>>> <Pg_Dump_Path> /usr/bin/pg_dump </
>>> Pg_Dump_Path>
>>> <When_Stand_Alone> read_only
>>> </When_Stand_Alone>
>>> <Replication_Timeout> 10s </
>>> Replication_Timeout>
>>> <LifeCheck_Timeout> 2s </
>>> LifeCheck_Timeout>
>>> <LifeCheck_Interval> 3s </
>>> LifeCheck_Interval>
>>>
>>> data3:
>>> <Replicate_Server_Info>
>>> <Host_Name> replicator </
>>> Host_Name>
>>> <Port> 8001
>>> </Port> <Recovery_Port> 8101 </
>>> Recovery_Port>
>>> </Replicate_Server_Info>
>>> <Host_Name> data3 </
>>> Host_Name>
>>> <Recovery_Port> 7001 </
>>> Recovery_Port>
>>> <Rsync_Path> /usr/bin/rsync </
>>> Rsync_Path>
>>> <Rsync_Option> ssh </
>>> Rsync_Option>
>>> <Rsync_Compress> yes
>>> </Rsync_Compress>
>>> <Pg_Dump_Path> /usr/bin/pg_dump </
>>> Pg_Dump_Path>
>>> <When_Stand_Alone> read_only
>>> </When_Stand_Alone>
>>> <Replication_Timeout> 10s </
>>> Replication_Timeout>
>>> <LifeCheck_Timeout> 2s </
>>> LifeCheck_Timeout>
>>> <LifeCheck_Interval> 3s </
>>> LifeCheck_Interval>
>>>
>>>
>>> Log files of pglb and pgreplicator are attached.
>>>
>>> kind regards
>>> Pshem
>>> <pgrepliacate-failed.log>
>>> <pglb-failed.log>
>>> _______________________________________________
>>> Pgcluster-general mailing list
>>> Pgcluster-general at pgfoundry.org
>>> http://pgfoundry.org/mailman/listinfo/pgcluster-general
>>
>> _______________________________________________
>> Pgcluster-general mailing list
>> Pgcluster-general at pgfoundry.org
>> http://pgfoundry.org/mailman/listinfo/pgcluster-general
>>
>
>
> --
> Üdvözlettel,
> Gábriel Ákos
> -=E-Mail :akos.gabriel at i-logic.hu|Web: http://www.i-logic.hu =-
> -=Tel/fax:+3612367353 |Mobil:+36209278894 =-
> _______________________________________________
> Pgcluster-general mailing list
> Pgcluster-general at pgfoundry.org
> http://pgfoundry.org/mailman/listinfo/pgcluster-general
More information about the Pgcluster-general
mailing list