Troubleshooting guide¶
This guide helps you recover Percona ClusterSync for MongoDB after an unexpected interruption, whether it occurs during initial data clone or real-time replication.
Recover PCSM during initial data clone¶
Percona ClusterSync for MongoDB can interrupt because of various reasons. For example, it is restarted, abnormally exits or loses connection to the source or destination cluster for an extended time. In any of these cases you must restart the initial data clone.
Symptoms¶
After subsequently starting the service, you may see such messages:
Sample error messages
2025-06-02 21:25:38.927 INF Found Recovery Data. Recovering... s=recovery
Error: new server: recover Percona ClusterSync for MongoDB: recover: cannot resume: replication is not started or not resuming from failure
2025-06-02 21:25:38.929 FTL error="new server: recover Percona ClusterSync for MongoDB: recover: cannot resume: replication is not started or not resuming from failure"
Recovery steps¶
To recover PCSM, do the following:
-
Stop the
pcsmservice:$ sudo systemctl stop pcsm -
Reset the PCSM state with the following command and pass the connection string URL to the target deployment:
$ pcsm reset --target <target-mongodb-uri>The command does the following:
- Connects to the target MongoDB deployment
- Deletes the metadata collections
- Restores the
pcsmservice from thefailedstate
-
Restart
pcsm$ sudo systemctl start pcsm -
Start data replication from scratch:
$ pcsm start
Recover PCSM during real-time replication¶
PCSM can successfully complete the initial data clone and then interrupt unexpectedly, during the real-time replication. The recovery steps differ depending on how PCSM stopped.
Unexpected shutdown¶
If PCSM exits abnormally or is stopped unexpectedly, restart the pcsm service. This is typically sufficient as PCSM resumes replication automatically from the last saved checkpoint.
Example logs
2025-06-02 21:32:04.592 INF Starting Cluster Replication s=pcsm
2025-06-02 21:32:04.592 DBG Change Replication is resuming s=repl
2025-06-02 21:32:04.592 INF Change Replication resumed op_ts=[1748887947,1] s=repl
2025-06-02 21:32:04.594 DBG Checkpoint saved s=checkpointing
Replication fails while PCSM is running¶
The pcsm process is active but the replication may fail due to a temporary connection issue or other reasons. After you resolve the reason of failure (restore the connection), follow these steps to recover PCSM:
-
Check current replication status:
$ pcsm statusSample output
{ "ok": false, "error": "change replication: bulk write: server selection error: context deadline exceeded, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: sandra-xps15:28017, Type: Unknown, Last error: dial tcp 127.0.1.1:28017: connect: connection refused }, ] }", "state": "failed", "info": "Failed", "eventsProcessed": 2301, "lastReplicatedOpTime": "1748889570.1", "initialSync": { "lagTime": 0, "estimatedCloneSize": 0, "clonedSize": 0, "completed": true, "cloneCompleted": true } } -
Resume the replication from the last successful checkpoint:
$ pcsm resume --from-failure -
Confirm that the replication has resumed:
pcsm statusSample output after successful resume
{ "ok": true, "state": "running", "info": "Replicating Changes", "lagTime": 140, "eventsProcessed": 2301, "lastReplicatedOpTime": "1748889570.1", "initialSync": { "lagTime": 140, "estimatedCloneSize": 0, "clonedSize": 0, "completed": true, "cloneCompleted": true } }
Note
If replication still fails after using the pcsm resume --from-failure, even after you restored the connectivity, the target cluster availability or any other underlying issue, you’ll need to start over. Refer to the Recover PCSM during initial data clone section and reset the PCSM state to begin replication from scratch.
Created: January 13, 2026