Discussion:
[Qemu-discuss] BLOCK_JOB_ERROR showing up on qmp monitor socket, failing live migration
Abe Massry
2018-09-25 15:58:41 UTC
Permalink
Hello,

I'm seeing this error message come across the qemu qmp monitor and it
is preventing live migrations from completing successfully on a number
of qemu instances. Some of them do complete successfully with the same
parameters.

{"timestamp": {"seconds": 1537544621, "microseconds": 488111},
"event": "BLOCK_JOB_ERROR", "data": {"device": "drive-scsi-disk-1",
"operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1537544621, "microseconds": 488957},
"event": "BLOCK_JOB_ERROR", "data": {"device": "drive-scsi-disk-1",
"operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1537544621, "microseconds": 501077},
"event": "BLOCK_JOB_ERROR", "data": {"device": "drive-scsi-disk-1",
"operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1537544621, "microseconds": 501694},
"event": "BLOCK_JOB_ERROR", "data": {"device": "drive-scsi-disk-1",
"operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1537544621, "microseconds": 606157},
"event": "BLOCK_JOB_COMPLETED", "data": {"device":
"drive-scsi-disk-1", "len": 541065216, "offset": 536870912, "speed":
1073741824, "type": "mirror", "error": "Input/output error"}}

in most (but not all cases) the difference between "len" and "offset"
is ( 541065216 - 536870912 ) / 1024 = 4096
which leads me to believe it's missing one 4k block

the destination qemu instance is started with:

-incoming tcp:$RamMigrationIP:$RamMigrationPort

and the nbd server is started on the destination
{
"execute": "nbd-server-start",
"arguments": {
"addr": {
"type": "inet",
"data": {
"host": $ip,
"port": $port
}
}
}
}

the command I'm running on the source is:
{
"execute": "drive-mirror",
"arguments": {
"device": "drive-scsi-disk-1,
"target": "nbd://$ip:$port/drive-scsi-disk-1",
"speed": 1073741824,
"sync": "full",
"mode": "existing",
"format": "raw"
}
}

going from qemu 2.11.1 to 2.11.2

I've also started throttling the disk io during live migration with
{
"execute": "block_set_io_throttle",
"arguments": {
"device": drive-scsi-disk-1,
"bps_rd": 0,
"bps_wr": 0,
"bps": 104857600,
"iops": 0,
"iops_rd": 0,
"iops_wr": 0
}
}
This allowed disks to complete the live migration that previously
couldn't due to IO being too high.

Has anyone seen this before? Does anyone know what the problem is or
how to fix it?
I would appreciate any help very much.

Thank you,
Abe

--
Abe Massry
Linode - https://www.linode.com/
--
Abe Massry
2018-09-29 01:53:33 UTC
Permalink
To answer my own question for others that may run into this same
problem. I'm using logical volumes (lvs) from LVM to back the disks up
in RAW format. The source and the destination have to have matching
sizes; due to an accounting error they didn't. The error message
didn't make it entirely clear what the issue was but I'm glad it has
been resolved. Hopefully others will see this message when searching
for the same text I was searching for and will check the lv sizes on
the source qemu instance and the destination qemu instance.

Thanks again,
Abe

Loading...