[Pgcluster-general] Possible Broken Code in 1.5.0rc16
johng at auctionsolutions.com
johng at auctionsolutions.com
Tue Jun 5 12:31:47 UTC 2007
Mr. Mitani,
I started looking into the code to see if I could resolve problem with 2
replications server and 2 db servers and lifecheck.
I noticed that if rep1 was killed that the inserts stopped working and an
error message on the screen about "replication server fell down" was
displayed.
My lifecheck interval value is 11s.
Failover to rep2 did succeed but after 11 seconds the postgres engine tried
to re-connect back to rep1 which was still down and cause problems in the
engine. I know this because I put debug logic into the replicate.c code to
see what was happening.
If I change the value of the lifecheck interval say to 1 minute, then I can
continue to perform inserts after rep1 is killed until the 1 minute timer
expires.
I am looking into possible problem with ~/src/backend/libpq/replicate.c
line 162:
ReplicateServerData = (ReplicateServerInfo*)shmat(ReplicateServerShmid,
0,0);
Is the pointer ReplicateServerData used by any children processes or
threads?
If so, couldn't there be contention issues where multiple children which
could be trying to modify the contents of the memory pointed to by
ReplicateServerData?
If so, shouldn't we be protecting the memory access with semaphores?
Life Check is a forked procecess and in lifecheck.c in function
PGR_Lifecheck_Main calls a function PGR_get_replicate_server_info which
walks the list of servers pointed to by ReplicateServerData.
Could this have anything to do with the engine trying to reconnect back to
rep1 causing my issue?
I will continue to investigate the problem, but if you could respond with
your comments it would help.
Thanks,
-John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://pgfoundry.org/pipermail/pgcluster-general/attachments/20070605/c81133bc/attachment.html
More information about the Pgcluster-general
mailing list