There are lots of ec volumes we noticed on one cluster that went missing <div clas

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Lots of ec volumes went missing,about seaweedfs/seaweedfs

wzrdtales commented on June 3, 2024

This happened after the upgrade to 3.56, which we downgraded to 3.55 as it has major bugs as it seems

from seaweedfs.

wzrdtales commented on June 3, 2024

We can find multiple instances of for example ec.00 on different nodes. With different timestamp, looks like the balancing logic is not working properly.

from seaweedfs.

wzrdtales commented on June 3, 2024

First guess is this is caused by ec.balance or ec.rebuild. We guess that b/c just overnight this happened. Some files that are now not accessible anymore, where accessible just yesterday.

Also in the master overview page ErasureCodingShards was showing a - number for one node.

from seaweedfs.

wzrdtales commented on June 3, 2024

Has there been significant changes to the ec since 3.43?

from seaweedfs.

chrislusf commented on June 3, 2024

There were some benign looking error handling changes.

from seaweedfs.

wzrdtales commented on June 3, 2024

@chrislusf See email, for some in depth details that I can't post here in this issue.

Here my last mail also for the issue tracker:

"
so, it took a lot of trial and error.

I was able to recover the file.

The following I did to make it work:

mv /mnt/d3/weed/redacted-2_1859.ec02 /root

What I noticed is, that all 1859 volume files were redacted-1_1859, except for this single one, that was named redacted-2. I moved this 0 byte file (redacted-2) to a different location, and immediately the file download and verify worked again.
"

from seaweedfs.

wzrdtales commented on June 3, 2024

now the question is, how can this happen. Is it normal that there are the same volumeIds for multiple collections @chrislusf ?

from seaweedfs.

wzrdtales commented on June 3, 2024

command back, the file is again not accessible. hmm

from seaweedfs.

wzrdtales commented on June 3, 2024

so its mostly not working, and just sporadically worked for a moment, now the error changed to ReadEcShardIntervals: too few shards given

from seaweedfs.

wzrdtales commented on June 3, 2024

made it working again, by getting rid of all 0 byte files of volume 1859.

They are being created again and again though. So this is the issue then, at least for this file. Have to check what is going on with the other things @chrislusf

from seaweedfs.

wzrdtales commented on June 3, 2024

the error volumeId 1859 not found in fs.verify remains though

from seaweedfs.

wzrdtales commented on June 3, 2024

So summary:

There are 0 byte .ec** files created. There are duplicates of .ec** files. There are sometimes even duplicates on the same volumeId with a different collection.

How this happened? No idea yet.

from seaweedfs.

wzrdtales commented on June 3, 2024

Found another instance of this pattern (with 5 different collections)

-rw-r--r-- 1 root root    0 Sep 22 07:06 /mnt/d4/weed/b1_1857.ec03
-rw-r--r-- 1 root root    0 Sep 22 09:10 /mnt/d4/weed/b2_1857.ec03
-rw-r--r-- 1 root root    0 Sep 22 11:13 /mnt/d4/weed/b3_1857.ec03
-rw-r--r-- 1 root root    0 Sep 22 18:37 /mnt/d1/weed/b4_1857.ec03
-rw-r--r-- 1 root root    0 Sep 22 15:05 /mnt/d1/weed/b5_1857.ec03

Not sure yet if this is something expected though @chrislusf

from seaweedfs.

chrislusf commented on June 3, 2024

what is the output of volume.list?

there should not be different collection with the same volume id 1857. I do not understand why it happened.

from seaweedfs.

wzrdtales commented on June 3, 2024

> volume.list -volumeId 1857
Topology volumeSizeLimit:1024 MB hdd(volume:3976/107208 active:3958 free:103232 remote:0)
  DataCenter dc1 hdd(volume:3976/107208 active:3958 free:103232 remote:0)
    Rack rack1 hdd(volume:3976/107208 active:3958 free:103232 remote:0)
      DataNode dt64:8080 hdd(volume:536/14487 active:534 free:13951 remote:0)
        Disk hdd(volume:536/14487 active:534 free:13951 remote:0)
          ec volume id:1857 collection:buck1 shards:[2 9]
        Disk hdd total size:0 file_count:0 
      DataNode dt64:8080 total size:0 file_count:0 
      DataNode dt65:8080 hdd(volume:537/14488 active:535 free:13951 remote:0)
        Disk hdd(volume:537/14488 active:535 free:13951 remote:0)
          ec volume id:1857 collection:buck1 shards:[1 8]
        Disk hdd total size:0 file_count:0 
      DataNode dt65:8080 total size:0 file_count:0 
      DataNode dt66:8080 hdd(volume:540/14492 active:539 free:13952 remote:0)
        Disk hdd(volume:540/14492 active:539 free:13952 remote:0)
          ec volume id:1857 collection:buck1 shards:[0 7]
        Disk hdd total size:0 file_count:0 
      DataNode dt66:8080 total size:0 file_count:0 
      DataNode dt67:8080 hdd(volume:537/14491 active:534 free:13954 remote:0)
        Disk hdd(volume:537/14491 active:534 free:13954 remote:0)
          ec volume id:1857 collection:buck1 shards:[10]
        Disk hdd total size:0 file_count:0 
      DataNode dt67:8080 total size:0 file_count:0 
      DataNode dt68:8080 hdd(volume:539/14492 active:537 free:13953 remote:0)
        Disk hdd(volume:539/14492 active:537 free:13953 remote:0)
          ec volume id:1857 collection:buck1 shards:[6 13]
        Disk hdd total size:0 file_count:0 
      DataNode dt68:8080 total size:0 file_count:0 
      DataNode dt69:8080 hdd(volume:537/14482 active:532 free:13945 remote:0)
        Disk hdd(volume:537/14482 active:532 free:13945 remote:0)
          ec volume id:1857 collection:buck1 shards:[6 13]
        Disk hdd total size:0 file_count:0 
      DataNode dt69:8080 total size:0 file_count:0 
      DataNode dt70:8080 hdd(volume:533/14391 active:530 free:13858 remote:0)
        Disk hdd(volume:533/14391 active:530 free:13858 remote:0)
          ec volume id:1857 collection:buck1 shards:[4]
        Disk hdd total size:0 file_count:0 
      DataNode dt70:8080 total size:0 file_count:0 
      DataNode dt71:8080 hdd(volume:217/5885 active:217 free:5668 remote:0)
        Disk hdd(volume:217/5885 active:217 free:5668 remote:0)
          ec volume id:1857 collection:buck1 shards:[3 5 11 12]
        Disk hdd total size:0 file_count:0 
      DataNode dt71:8080 total size:0 file_count:0 
    Rack rack1 total size:0 file_count:0 
  DataCenter dc1 total size:0 file_count:0 
total size:0 file_count:0

In the volume list nothing is there of these weird ones.

This cluster is running two filers, leveldb. Is the synchronization between them enforced? I opened earlier a ticket where we noticed already before, that this synchronization does not seem to actually work in all cases. We had keys that exist only in one filer.

We were thinking already to switch to cockroachdb as a backend for the filer, to guarantee the HA, but have yet to test if scaling works with multiple filers accessing the same cockroachdb.

from seaweedfs.

chrislusf commented on June 3, 2024

I do not see this from the volume.list output, which all have the same bucket buck1.

-rw-r--r-- 1 root root    0 Sep 22 07:06 /mnt/d4/weed/b1_1857.ec03
-rw-r--r-- 1 root root    0 Sep 22 09:10 /mnt/d4/weed/b2_1857.ec03
-rw-r--r-- 1 root root    0 Sep 22 11:13 /mnt/d4/weed/b3_1857.ec03
-rw-r--r-- 1 root root    0 Sep 22 18:37 /mnt/d1/weed/b4_1857.ec03
-rw-r--r-- 1 root root    0 Sep 22 15:05 /mnt/d1/weed/b5_1857.ec03

from seaweedfs.

wzrdtales commented on June 3, 2024

buck1, is the replacement of original name I did, and equals to b1 of that list.

As you state correct, it is not in the list. That doesn't change the fact that the other output gets generated by find /mnt/*/weed | grep 1857.ec03 | xargs sudo ls -alh and causes major problems.

from seaweedfs.

chrislusf commented on June 3, 2024

the volume id should never be reused in other collections. I do not understand how it happened.

from seaweedfs.

wzrdtales commented on June 3, 2024

I am studying the code to understand what is happening currently, the most interesting thing about the random files from other collections is, only the actual collection has its vif, ecx and ecj files. The random ones not.

-rw-r--r-- 1 root root    0 Sep 21 19:21 /mnt/d2/weed/b1_1857.ecj
-rw-r--r-- 1 root root 4.4K Sep 21 19:21 /mnt/d2/weed/b1_1857.ecx
-rwxr-xr-x 1 root root   78 Sep 21 19:21 /mnt/d2/weed/b1_1857.vif

from seaweedfs.

wzrdtales commented on June 3, 2024

I don't have a good guess yet, but its either correlated to the multiple filers (although they should be syncing but as said just before, we already had issues with that not being true and items randomly missing, but I thought that already got fixed, b/c I didn't see this appear again in 3.43), or something else.

Who decides the volumeIds? The filer or the master?

from seaweedfs.

wzrdtales commented on June 3, 2024

"find /mnt/*/weed | grep _1857.* | xargs sudo ls -alh" 
-rw-r--r-- 1 root root    0 Sep 22 18:37 /mnt/d1/weed/b2_1857.ec03
-rw-r--r-- 1 root root    0 Sep 22 15:05 /mnt/d1/weed/b3_1857.ec03
-rw-r--r-- 1 root root 104M Sep 21 19:21 /mnt/d2/weed/b1_1857.ec06
-rw-r--r-- 1 root root 104M Sep 21 19:21 /mnt/d2/weed/b1_1857.ec13
-rw-r--r-- 1 root root    0 Sep 21 19:21 /mnt/d2/weed/b1_1857.ecj
-rw-r--r-- 1 root root 4.4K Sep 21 19:21 /mnt/d2/weed/b1_1857.ecx
-rwxr-xr-x 1 root root   78 Sep 21 19:21 /mnt/d2/weed/b1_1857.vif
-rw-r--r-- 1 root root    0 Sep 22 07:06 /mnt/d4/weed/b3_1857.ec03
-rw-r--r-- 1 root root    0 Sep 22 09:10 /mnt/d4/weed/b4_1857.ec03
-rw-r--r-- 1 root root    0 Sep 22 11:13 /mnt/d4/weed/b5_1857.ec03

Looking at this again, according to their timestamp, all of these extra volumeIds on foreign collections, definitely got created WAY later, so the volumeId existed for quite som time (multiple hours)

from seaweedfs.

wzrdtales commented on June 3, 2024

Also next question is of course. Why does seaweedfs gets troubled by those foreign collection files. I did not yet find the point where the shards gets read in would be really helpful if you could point me there quickly

from seaweedfs.

wzrdtales commented on June 3, 2024

Btw. this cluster was running fine for a long time on 3.43, troubles started after the upgrade to first 3.56 and then downgrade to 3.55 due to a complete lockup caused by a bug in 3.56 (which you seem to have fixed already but not released yet)

from seaweedfs.

chrislusf commented on June 3, 2024

the system is designed to have only unique volume ids.

from seaweedfs.

wzrdtales commented on June 3, 2024

the system is designed to have only unique volume ids.

I guessed that, so I can pin point this is happening in 3.55, according to the timestamps. And nothing special happened.

The targets that come to mind that could cause this are

ec.encode -fullPercent=95 -quietFor=1h
ec.rebuild -force
ec.balance -force

In 3.24 there was still a bug that was causing ec.encode -fullPercent=95 -quietFor=1h to not address collection correctly on its own. So we have had explicit ec.encode -collection=b1 -fullPercent=95 -quietFor=1h commands to work around this for a few buckets. In 3.55 this now works as expected. The extra collections on that volumeId actually contain collections that we did not erasure code before, so it can't be any of the explicit encode commands, but the general. And otherwise only the balancing and rebuilding is left.
Again trying currently to work through the logic.

Biggest question for me is why is collection_1857.ec03 and a parallel collection2_1857.ec03 even a problem currently for the system. I found by now only logic which is explicitly building the name, and nothing that only filters for the ending _volumeId.ecxx

from seaweedfs.

wzrdtales commented on June 3, 2024

One more info. It looks like these extra files were always created on the node that was used for the reconstruction action in that moment.

from seaweedfs.

wzrdtales commented on June 3, 2024

Ok, the 0 byte files come could come from here

seaweedfs/weed/storage/erasure_coding/ec_encoder.go

Line 104 in 76a6285

    
           outputFiles[shardId], err = os.OpenFile(shardFileName, os.O_TRUNC|os.O_WRONLY|os.O_CREATE, 0644)

or

seaweedfs/weed/storage/erasure_coding/ec_encoder.go

Line 145 in 76a6285

openOption := os.O_TRUNC | os.O_CREATE | os.O_WRONLY

More likely the first one as it looks.

from seaweedfs.

wzrdtales commented on June 3, 2024

Ok, so 0 byte files can be created if there is already a 0 byte file existing:

seaweedfs/weed/storage/erasure_coding/ec_encoder.go

Lines 259 to 261 in 76a6285

    
           if n == 0 { 
        
           	return nil 
        
           }

This return causes the whole rebuild procedure to cancel without writing anything to the new files. But the new 0 byte files are not getting cleaned up as well.

So the possible scenario for a first 0 byte file in the first place might be an unfortunate exit or crash of the volume server. I am not sure if a crash is necessary, or if an exit would be enough. I will try if I can reproduce this with all this information, might be barking at the wrong tree though.

from seaweedfs.

wzrdtales commented on June 3, 2024

That is not an explanation to the foreign collections at all yet though.

from seaweedfs.

wzrdtales commented on June 3, 2024

I wrote a short script

import { globStream } from 'glob'
import fs from 'fs/promises'

const hashmap = {}
const originals = {}
const delay = {}
const pt = '/mnt/d1/weed'.length

const gS = globStream(['/mnt/*/weed/*.ec[0-9][0-9]'])
gS.on('data', async path => {
  const vol = path.substring(path.lastIndexOf('_') + 1, path.length - 5)
  const fin = path.substring(pt, path.length - 5)
  const {size} = await fs.stat(path);

  if (!size) console.log(`0 byte ${path}`)

  if(!hashmap[vol] && size) {
	  hashmap[vol] = fin;
	  originals[vol] = path;
  }
  else if(hashmap[vol] && hashmap[vol] !== fin) {

	  if(originals[vol]) {
	  	console.log(`original ${originals[vol]}`)
		delete originals[vol]
	  }

	  console.log(`duplicate ${path}`, size)
	  delay[vol]?.forEach(x => console.log(`duplicate ${x.path}`, x.size))
	  delete delay[vol]
  } else if(!hashmap[vol] && size) {
	  if(!delay[vol]) delay[vol] = []
	  delay[vol].push({path, size});
  }
})

(needs npm i glob)

To identify these weirdities and also the collection duplicates.

I found with that also instances, where

-rw-r--r-- 1 root root 103M Sep 22 21:11 /mnt/d3/weed/b1_1904.ec04
-rw-r--r-- 1 root root    0 Sep 22 04:09 /mnt/d4/weed/b2_1904.ec04

the actual collection came later, than the foreign collection.

The only explanation that I could think of is that req.Collection is actually not a stable reference, but could change in the worst case. Everything else makes not much sense yet, but it is happening, just dunno why yet.

from seaweedfs.

wzrdtales commented on June 3, 2024

Even found one example, where there is an instance without collection name + only foreign collections but not the actual one of the same ec shard, in the other cases the ec shard of the actual one was always there.

-rw-r--r-- 1 root root 103M Sep 21 20:08 /mnt/d1/weed/b1_1944.ec08
-rw-r--r-- 1 root root    0 Sep 22 04:45 /mnt/d4/weed/b2_1944.ec02
-rw-r--r-- 1 root root 0 Sep 22 05:38 /mnt/d4/weed/1944.ec02

from seaweedfs.

wzrdtales commented on June 3, 2024

The only good thing out of all this is, that I get to know the codebase in depth :p

So something I suspected already, the info gets simply extracted by the underscore. Which makes sense of course:

seaweedfs/weed/storage/disk_location.go

Lines 109 to 116 in 76a6285

    
           func parseCollectionVolumeId(base string) (collection string, vid needle.VolumeId, err error) { 
        
           	i := strings.LastIndex(base, "_") 
        
           	if i > 0 { 
        
           		collection, base = base[0:i], base[i+1:] 
        
           	} 
        
           	vol, err := needle.NewVolumeId(base) 
        
           	return collection, vol, err 
        
           }

And regardless of the collection they get added to the same array

seaweedfs/weed/storage/disk_location_ec.go

Lines 160 to 173 in 76a6285

    
           collection, volumeId, err := parseCollectionVolumeId(baseName) 
        
           if err != nil { 
        
           	continue 
        
           } 
        
           if re.MatchString(ext) { 
        
           	if prevVolumeId == 0 || volumeId == prevVolumeId { 
        
           		sameVolumeShards = append(sameVolumeShards, fileInfo.Name()) 
        
           	} else { 
        
           		sameVolumeShards = []string{fileInfo.Name()} 
        
           	} 
        
           	prevVolumeId = volumeId 
        
           	continue 
        
           }

That is why those foreign collections make trouble... .

I still have no Idea though how this happened. My guess is that the streaming change you reverted maybe is related to this. We had 3.56 running for one day. Then it completely locked up and marked all volumes non writeable and we downgraded to 3.55. So maybe those files were generated by a bug in 3.56 and causing trouble now in 3.55.

So the second logic probably should check also that the collection is not something else, and ignore these files (auto delete is probably only a good idea if it is a 0 byte file)

@chrislusf

from seaweedfs.

wzrdtales commented on June 3, 2024

I think I might have found the issue @chrislusf

seaweedfs/weed/server/volume_grpc_copy.go

Line 218 in 23f334d

    
           modifiedTsNs, err = writeToFile(copyFileClient, baseFileName+ext, throttler, isAppend, progressFn)

This call is being called without knowing yet whether the file exists or not and creates the file with whatever reached this function.

seaweedfs/weed/server/volume_grpc_erasure_coding.go

Line 142 in 23f334d

    
           if _, err := vs.doCopyFile(client, true, req.Collection, req.VolumeId, math.MaxUint32, math.MaxInt64, dataBaseFileName, erasure_coding.ToExt(int(shardId)), false, false, nil); err != nil {

This is called by

seaweedfs/weed/shell/command_ec_common.go

Line 72 in 23f334d

    
           _, copyErr := volumeServerClient.VolumeEcShardsCopy(context.Background(), &volume_server_pb.VolumeEcShardsCopyRequest{

which earlier gets called by

seaweedfs/weed/shell/command_ec_balance.go

Line 278 in 23f334d

    
           err := pickOneEcNodeAndMoveOneShard(commandEnv, averageShardsPerEcRack, ecNode, collection, vid, shardId, possibleDestinationEcNodes, applyBalancing)

called by

seaweedfs/weed/shell/command_ec_balance.go

Line 472 in 23f334d

    
           err := moveMountedShardToEcNode(commandEnv, existingLocation, collection, vid, shardId, destEcNode, applyBalancing)

called by

seaweedfs/weed/shell/command_ec_balance.go

Line 359 in 23f334d

    
           err := pickOneEcNodeAndMoveOneShard(commandEnv, averageShardsPerEcNode, ecNode, collection, vid, shardId, possibleDestinationEcNodes, applyBalancing)

The collection is not in a single step retrieved from the shard itself, nor the shards get filtered in any step by the collection being passed as parameter. But the initial collection name is what will reach the final copy command in the end. I might have missed something, but checked it multiple times now

T

from seaweedfs.

wzrdtales commented on June 3, 2024

Let me know if I missed something and am wrong @chrislusf , am on the road for the next few hours again.

from seaweedfs.

chrislusf commented on June 3, 2024

The call graph is correct.

from seaweedfs.

wzrdtales commented on June 3, 2024

ok then this is the issue then

from seaweedfs.

Lots of ec volumes went missing about seaweedfs HOT 36 CLOSED

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	func parseCollectionVolumeId(base string) (collection string, vid needle.VolumeId, err error) {
	i := strings.LastIndex(base, "_")
	if i > 0 {
	collection, base = base[0:i], base[i+1:]
	}
	vol, err := needle.NewVolumeId(base)
	return collection, vol, err
	}

	collection, volumeId, err := parseCollectionVolumeId(baseName)
	if err != nil {
	continue
	}

	if re.MatchString(ext) {
	if prevVolumeId == 0 \|\| volumeId == prevVolumeId {
	sameVolumeShards = append(sameVolumeShards, fileInfo.Name())
	} else {
	sameVolumeShards = []string{fileInfo.Name()}
	}
	prevVolumeId = volumeId
	continue
	}