Problems with btrfs and kernel 3.17

Started by hightime, 2014/10/09, 02:26:06

Previous topic - Next topic

hightime

Since upgrading to the 3.17 kernel I've had filesystem corruption on two different systems. The first sign is that the filesystem  switches to readonly.  Anybody else encountered this?

dibl

#1
I'm seeing some problems, but my hardware has been running 24/7/365 for four years. and I'm not sure how much is due to new kernels and how much is due to aging hardware.  With the 3.17 kernels, I have commented out the /etc/fstab mount line for the btrfs filesystem, because it will not mount correctly during boot.  After booting and logging in under lightdm, then I can manually mount the btrfs filesystem and all is good.

Sidelight -- no matter what kind of hdd, ssd, or usb stick I boot with, I see the error "/dev/sda partition table not recognized", or something to that effect.  It can be a bootable siduction USB stick, or my installed OS on a SSD, it is the same -- /dev/sda can be different devices, but the error message is the same.  Strange ...


EDIT 9 OCT: Well, a new day, and today the error is on /dev/sdb.  The logged error during boot was:


Oct 08 20:32:29 imerabox systemd-gpt-auto-generator[290]: Failed to determine partition table type of /dev/sdb: Input/output error


The drive configuration is:


/dev/sda1                       ext2                       /boot                               ac7da829-aebb-46f0-806c-04a4d81a945a
/dev/sda2                       swap                       <swap>                              0d939b7d-48f1-47dd-aebe-77e7bd8c3503
/dev/sdb1                       ext4                       /                                   bea3a748-3411-4024-acd0-39f3882ddaf9
/dev/sdb2                       ext4         SDA2          /mnt/SDA2                           8cfe2acc-7572-4b45-b25f-ed021bb1d78b
/dev/sdc1                       ext4         revodata      /mnt/REVODATA                       ec21f5b3-7fd4-4f4b-af8d-cf787b147ae8
/dev/sde                        btrfs                      (in use)                            2bbc4079-e05d-43a3-865b-5b3d3f4af0f5
/dev/sdd                        btrfs                      (in use)                            2bbc4079-e05d-43a3-865b-5b3d3f4af0f5



/dev/sdb is one half of a 120GB OCZ Revodrive SSD, the other half being /dev/sdc. They are seen as separate drives by the OS.


fdisk has no problem seeing the partition table and reporting out the configuration:


Disk /dev/sdb: 55.9 GiB, 60022480896 bytes, 117231408 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x000ab81f


Device     Boot    Start       End  Sectors  Size Id Type
/dev/sdb1           1024  39844863 39843840   19G 83 Linux
/dev/sdb2       39844864 117230591 77385728 36.9G 83 Linux



So, I dunno -- some weirdness going on at the moment.

EDIT 2: Hypothesis:  When systemd encounters drives with no partition table during boot, such as my two WD1000s that have the BTRFS filesystem on the raw devices, then it becomes very, very confused about the device IDs on all the drives, but later the OS does not have the same problem.







System76 Oryx Pro, Intel Core i7-11800H, ASRock B860 Pro-A, Intel Core Ultra 7 265KF, Nvidia GTX-1060, SSD 990 EVO Plus.

hightime

I have rebuilt two of the affected filesystems after being unable to clean them up. btrfs scrub would exit almost immediately. I ran btrfs check on one and hundreds of errors scrolled across before I terminated it. Other than those errors the only errors I see is when looking in my snapshot directory. Snapshots made after the kernel change can't be accessed and I can't delete them.

# ls -l /.snapshots
ls: cannot access /.snapshots/2014-10-07T20:00-0400: Cannot allocate memory
total 0
d????????? ? ? ? ?            ? 2014-10-07T20:00-0400


At this point I'll probably have to rebuild that filesystem as well which means reinstalling the OS.

Also I've switched back to the 3.16-3 kernel.  Not sure I trust 3.17 at this point.

melmarker

mhh - maybe thats why btrfs is still declarated as experimental - we should move those things to exerimental.
Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. (Benjamin Franklin, November 11, 1755)
Never attribute to malice that which can be adequately explained by stupidity. (Hanlons razor)

dibl


I don't think the btrfs dev team will ever declare it "stable", but it has been "stable on a stable platform" for years now, in my experience as well as others.

I ran btrfsck about a week ago and there were no errors here.  I had replaced the WD1000s about a year ago, to be safe, since the original pair had 3 years of continuous runtime on them and I saw an occasional btrfs error on them.

Having now replaced the graphics card and the memory modules in my desktop, my hardware issues may or may not be resolved -- I will know if it runs a week with no problem (there was some kind of video glitching that showed up as a blinking pixel pattern on the monitor).  I think the systemd/BTRFS/partition table issue is unrelated to whatever was dying in my hardware, and it does not appear to cause any problems once the filesystem is mounted manually.
System76 Oryx Pro, Intel Core i7-11800H, ASRock B860 Pro-A, Intel Core Ultra 7 265KF, Nvidia GTX-1060, SSD 990 EVO Plus.

hightime

After I reinstall how do I limit the kernel version to 3.16 when I do a dist-upgrade?

dibl

Use "apt-mark hold" on the linux-image package that you want to keep.
System76 Oryx Pro, Intel Core i7-11800H, ASRock B860 Pro-A, Intel Core Ultra 7 265KF, Nvidia GTX-1060, SSD 990 EVO Plus.

hightime

Quote
Use "apt-mark hold"

Thanks for that.

Do you make snapshots very often? I was wondering if the problem could be with snapshots. My second disk is btrfs, but I don't create snapshots on it. I use it for replicas of the snapshots from the first disk using send/receive. I did not have any filesystem corruption on the second disk. Just a thought.

dibl

I confess, I have not used the snapshot capability of BTRFS.  My concern was not to capture the OS image -- it is on an ext4 filesystem on a SSD.  But I have 960GB of user data that I want (a) readily available and (b) as secure as possible on spinning disks.  So when I discovered the "multi-device" capability of BTRFS, where you can set up what is approximately RAID 1 + RAID 0, that's what I did.  A pair of WD1000 hdds holds a single BTRFS filesystem, with default striping of data and mirroring of metadata.  Of course it gets backed up regularly, not as often as it should.  But the data do not change rapidly -- old photos, music, videos, documents, etc. I've been running this setup since early 2011, 24/7/365, and have not lost a single byte of data.  When the original WD drives went past 3 years, I noticed an occasional error message from btrfsck, so I bought another pair of new drives, backed up and restored, and it's been going good for another half year.
System76 Oryx Pro, Intel Core i7-11800H, ASRock B860 Pro-A, Intel Core Ultra 7 265KF, Nvidia GTX-1060, SSD 990 EVO Plus.