Skip to content
Rate this page
Thanks for your feedback
Thank you! The feedback has been submitted.

Get free database assistance or contact our experts for personalized support.

Troubleshooting guide

This guide helps you recover Percona ClusterSync for MongoDB after an unexpected interruption, whether it occurs during initial data clone or real-time replication.

Recover PCSM during initial data clone

Percona ClusterSync for MongoDB can interrupt because of various reasons. For example, it is restarted, abnormally exits or loses connection to the source or destination cluster for an extended time. In any of these cases you must restart the initial data clone.

Symptoms

After subsequently starting the service, you may see such messages:

Sample error messages
2025-06-02 21:25:38.927 INF Found Recovery Data. Recovering... s=recovery
Error: new server: recover Percona ClusterSync for MongoDB: recover: cannot resume: replication is not started or not resuming from failure
2025-06-02 21:25:38.929 FTL error="new server: recover Percona ClusterSync for MongoDB: recover: cannot resume: replication is not started or not resuming from failure"

Recovery steps

To recover PCSM, do the following:

  1. Stop the pcsm service:

    $ sudo systemctl stop pcsm
    
  2. Reset the PCSM state with the following command and pass the connection string URL to the target deployment:

    $ pcsm reset --target <target-mongodb-uri>
    

    The command does the following:

    • Connects to the target MongoDB deployment
    • Deletes the metadata collections
    • Restores the pcsm service from the failed state
  3. Restart pcsm

    $ sudo systemctl start pcsm
    
  4. Start data replication from scratch:

    $ pcsm start
    

Recover PCSM during real-time replication

PCSM can successfully complete the initial data clone and then interrupt unexpectedly, during the real-time replication. The recovery steps differ depending on how PCSM stopped.

Unexpected shutdown

If PCSM exits abnormally or is stopped unexpectedly, restart the pcsm service. This is typically sufficient as PCSM resumes replication automatically from the last saved checkpoint.

Example logs
2025-06-02 21:32:04.592 INF Starting Cluster Replication s=pcsm
2025-06-02 21:32:04.592 DBG Change Replication is resuming s=repl
2025-06-02 21:32:04.592 INF Change Replication resumed op_ts=[1748887947,1] s=repl
2025-06-02 21:32:04.594 DBG Checkpoint saved s=checkpointing

Replication fails while PCSM is running

The pcsm process is active but the replication may fail due to a temporary connection issue or other reasons. After you resolve the reason of failure (restore the connection), follow these steps to recover PCSM:

  1. Check current replication status:

    $ pcsm status
    
    Sample output
     {
       "ok": false,
       "error": "change replication: bulk write: server selection error: context deadline exceeded, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: sandra-xps15:28017, Type:          Unknown, Last error: dial tcp 127.0.1.1:28017: connect: connection refused }, ] }",
       "state": "failed",
       "info": "Failed",
       "eventsProcessed": 2301,
       "lastReplicatedOpTime": "1748889570.1",
       "initialSync": {
         "lagTime": 0,
         "estimatedCloneSize": 0,
         "clonedSize": 0,
         "completed": true,
         "cloneCompleted": true
       }
     }
    
  2. Resume the replication from the last successful checkpoint:

    $ pcsm resume --from-failure
    
  3. Confirm that the replication has resumed:

    pcsm status
    
    Sample output after successful resume
    {
      "ok": true,
      "state": "running",
      "info": "Replicating Changes",
      "lagTime": 140,
      "eventsProcessed": 2301,
      "lastReplicatedOpTime": "1748889570.1",
      "initialSync": {
        "lagTime": 140,
        "estimatedCloneSize": 0,
        "clonedSize": 0,
        "completed": true,
        "cloneCompleted": true
      }
    }
    

Note

If replication still fails after using the pcsm resume --from-failure, even after you restored the connectivity, the target cluster availability or any other underlying issue, you’ll need to start over. Refer to the Recover PCSM during initial data clone section and reset the PCSM state to begin replication from scratch.


Last update: January 13, 2026
Created: January 13, 2026