[Pgcluster-general] Instability replicating over wan

Diogo Biazus diogob at gmail.com
Fri May 18 14:06:51 UTC 2007


Hi there,
I just started using pgcluster and I must say I'm impressed with the
project, although it lacks some docs (I'll try to write something in
portuguese later) it was pretty easy to setup and get it replicating.

My network layout:
I have 2 servers communicating over a VPN: cluster_poa and
cluster_rio. The machines are on different cities, but the average
ping time is good: 53.504 ms
I'm not using the load balancer, people on rio de janeiro access only
the cluster_rio, and people on porto alegre access only the
cluster_poa, and I'm using only one replicator on cluster_poa.
I'm using the version 1.7.rc7 on both machines. Below I put my config
files for both:

>>>>>>>>>>>>>> cluster_poa <<<<<<<<<<<<<<<<<<
* /etc/hosts
192.168.2.8     cluster_rio
192.168.2.8     rep_rio
192.168.200.100 cluster_poa
192.168.200.100 rep_poa

* cluster.conf
<Replicate_Server_Info>
        <Host_Name>             rep_poa </Host_Name>
        <Port>                  8001                           </Port>
        <Recovery_Port>         8101                            </Recovery_Port>
</Replicate_Server_Info>
<Host_Name>                     cluster_poa             </Host_Name>
<Recovery_Port>         7001                            </Recovery_Port>
<Rsync_Path>                    /usr/bin/rsync                  </Rsync_Path>
<Rsync_Option>                  ssh                             </Rsync_Option>
<Rsync_Compress>                yes
</Rsync_Compress>
<Pg_Dump_Path>                  /usr/local/pgsql/bin/pg_dump    </Pg_Dump_Path>
<When_Stand_Alone>              read_only
</When_Stand_Alone>
<Replication_Timeout>           1 min
</Replication_Timeout>
<LifeCheck_Timeout>             20s
</LifeCheck_Timeout>
<LifeCheck_Interval>            21s
</LifeCheck_Interval>

* pgreplicate.conf
<Cluster_Server_Info>
    <Host_Name>                 cluster_poa     </Host_Name>
    <Port>                      5432                            </Port>
    <Recovery_Port>             7001                            </Recovery_Port>
</Cluster_Server_Info>
<Cluster_Server_Info>
    <Host_Name>                 cluster_rio     </Host_Name>
    <Port>                      5432                            </Port>
    <Recovery_Port>             7001                            </Recovery_Port>
</Cluster_Server_Info>
<Host_Name>                     rep_poa         </Host_Name>
<Replication_Port>              8001
</Replication_Port>
<Recovery_Port>         8101                            </Recovery_Port>
<RLOG_Port>                     8301                            </RLOG_Port>
<Response_Mode>         normal                          </Response_Mode>
<Use_Replication_Log>           no
</Use_Replication_Log>
<Replication_Timeout>           1min
</Replication_Timeout>
<LifeCheck_Timeout>             10s
</LifeCheck_Timeout>
<LifeCheck_Interval>            15s
</LifeCheck_Interval>
<Log_File_Info>
        <File_Name>             /tmp/pgreplicate.log    </File_Name>
        <File_Size>             1M                      </File_Size>
        <Rotate>                3                       </Rotate>
</Log_File_Info>



>>>>>>>>>>>>>> cluster_rio <<<<<<<<<<<<<<<<<<
* /etc/hosts
192.168.2.8     cluster_rio
192.168.2.8     rep_rio
192.168.200.100 cluster_poa
192.168.200.100 rep_poa

* cluster.conf
<Replicate_Server_Info>
        <Host_Name>             rep_poa </Host_Name>
        <Port>                  8001                            </Port>
        <Recovery_Port>         8101                            </Recovery_Port>
</Replicate_Server_Info>
<Host_Name>                     cluster_rio             </Host_Name>
<Recovery_Port>         7001                            </Recovery_Port>
<Rsync_Path>                    /usr/bin/rsync                  </Rsync_Path>
<Rsync_Option>                  ssh                             </Rsync_Option>
<Rsync_Compress>                yes
</Rsync_Compress>
<Pg_Dump_Path>                  /usr/local/pgsql/bin/pg_dump    </Pg_Dump_Path>
<When_Stand_Alone>              read_only
</When_Stand_Alone>
<Replication_Timeout>           2 min
</Replication_Timeout>
<LifeCheck_Timeout>             20s
</LifeCheck_Timeout>
<LifeCheck_Interval>            21s
</LifeCheck_Interval>

My problem:
I'm having some stability problems. It works very well and then
sometimes (I couldn't find out any pattern, it seems random to me) the
cluster_rio stops seeing the replicator and the whole cluster restarts
and put in the log:

LOG:  server process (PID 11017) was terminated by signal 11
LOG:  terminating any other active server processes
LOG:  all server processes terminated; reinitializing

When this starts happening it wont stop till I restart the whole
cluster and the replicator.
If I restart only the cluster_rio postmaster it will not comunicate
with the replicator and give lots of:
ERROR:  This query is not permitted when all replication servers fell down

But it seems that the comunication between the machis is perfect
everytime I test it. So I was thinking if it is not caused by some
network instability and some problem for the replicator reconnect to
the remote cluster.

Any ideas on this?

Thanks in advance,
-- 
Diogo Biazus - diogob at gmail.com
Móvel Consultoria
http://www.movelinfo.com.br
http://www.postgresql.org.br


More information about the Pgcluster-general mailing list