[Qemu-discuss] BLOCK_JOB_ERROR showing up on qmp monitor socket, failing live migration

Abe Massry

2018-09-25 15:58:41 UTC

Hello,

I'm seeing this error message come across the qemu qmp monitor and it
is preventing live migrations from completing successfully on a number
of qemu instances. Some of them do complete successfully with the same
parameters.

{"timestamp": {"seconds": 1537544621, "microseconds": 488111},
"event": "BLOCK_JOB_ERROR", "data": {"device": "drive-scsi-disk-1",
"operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1537544621, "microseconds": 488957},
"event": "BLOCK_JOB_ERROR", "data": {"device": "drive-scsi-disk-1",
"operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1537544621, "microseconds": 501077},
"event": "BLOCK_JOB_ERROR", "data": {"device": "drive-scsi-disk-1",
"operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1537544621, "microseconds": 501694},
"event": "BLOCK_JOB_ERROR", "data": {"device": "drive-scsi-disk-1",
"operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1537544621, "microseconds": 606157},
"event": "BLOCK_JOB_COMPLETED", "data": {"device":
"drive-scsi-disk-1", "len": 541065216, "offset": 536870912, "speed":
1073741824, "type": "mirror", "error": "Input/output error"}}

in most (but not all cases) the difference between "len" and "offset"
is ( 541065216 - 536870912 ) / 1024 = 4096
which leads me to believe it's missing one 4k block

the destination qemu instance is started with:

-incoming tcp:$RamMigrationIP:$RamMigrationPort

and the nbd server is started on the destination
{
"execute": "nbd-server-start",
"arguments": {
"addr": {
"type": "inet",
"data": {
"host": $ip,
"port": $port
}
}
}
}

the command I'm running on the source is:
{
"execute": "drive-mirror",
"arguments": {
"device": "drive-scsi-disk-1,
"target": "nbd://$ip:$port/drive-scsi-disk-1",
"speed": 1073741824,
"sync": "full",
"mode": "existing",
"format": "raw"
}
}

going from qemu 2.11.1 to 2.11.2

I've also started throttling the disk io during live migration with
{
"execute": "block_set_io_throttle",
"arguments": {
"device": drive-scsi-disk-1,
"bps_rd": 0,
"bps_wr": 0,
"bps": 104857600,
"iops": 0,
"iops_rd": 0,
"iops_wr": 0
}
}
This allowed disks to complete the live migration that previously
couldn't due to IO being too high.

Has anyone seen this before? Does anyone know what the problem is or
how to fix it?
I would appreciate any help very much.

Thank you,
Abe

--
Abe Massry
Linode - https://www.linode.com/
--