Coder Social home page Coder Social logo

"Buggy" initial scan? about efutool HOT 15 CLOSED

zybexxl avatar zybexxl commented on June 25, 2024
"Buggy" initial scan?

from efutool.

Comments (15)

zybexXL avatar zybexXL commented on June 25, 2024

Hi @ExSport,
Apologies for not replying sooner - looks like I missed your post, didn't get any notification for it.
Thank for reporting these issues. Version 1.0.6 fixes the wrong folder sizes on first scan - they now match the updated folder sizes.

Regarding the "1 changed" - this is the root share folder itself. EFUTool uses the folder timestamp to determine if the contents are changed or not. However, a share has no timestamp, so EFUTool always considers the root folder to be changed to force a rescan/refresh of the root contents. That's why it will always say "1 changed". If your root folder is not a share (like "\\server\share\folder") then a valid timestamp will exist and you will not see this behavior.

Your rescan times are also a bit strange: the 12s one seems to indicate some caching going on (on the share server). I would also expect subsequent scans to be faster than the first scan, but they're not (except that 12s one). EFUTool is really great for huge volumes (I test it on shares with millions of files and about 100K folders) - perhaps it's not so fast for your particular setup.

Let me know if you find any other issues - Github notifications are enabled, but add a mention to @zybexXL just in case.

from efutool.

ExSport avatar ExSport commented on June 25, 2024

Hello @zybexXL
Thanks for your fix.
Still some weirdness happens :)

  1. remote server UNC path has about 65 ms RTT.
  2. With such RTT the initial scan takes 2:30-2:50 min.
  3. rescan takes 3:10-3:30 min.
  4. One rescan took 20sec only. After byte to byte comparison of the old and new EFU files, there was no file change (filelist order starts with files from root folder as any other rescan changes this order so root files are moved from start to the different position)
  5. As stated above, initial scan has filelist starting with files from root folder but it is not true for rescans where the order is changed - always! Only once no change was done - at time rescan took 20sec only. Unfortunately after maybe 20 rescans I spotted such fast rescan only once. All the other time rescan was slower then initial scan and filelist order was changed. But not between additional scans!
  6. It seems when there is no change during rescan in fileorder in EFU file, rescan was very fast. Initial filelist starts with files from root but don't know why, next rescan changed this order and makes rescans took longer.
  7. share is static, no changes done there during testing
  8. Spotted that some folders have different size when initial scan and rescans compared. When I did few other new initial scans and rescans, it doesn't happen again. In this example, the initial scan shows incorrect (1774383598) folder size. Btw. all folders have "compression bit" set.
    initial scan:
    "\\smbserver\temp\Server_scans",1774383598,132045312240733675,131763931934786691,10256
    rescans:
    "\\smbserver\temp\Server_scans",1769402862,132045312240733675,131763931934786691,10256
  9. Tested \smbserver\h$\temp with using admin share and tool still shows 1 folder changed (TEMP timestamps are visible). Adding additional trailing slash: \smbserver\h$\temp\ shows 2 folders changed. It generates different filelists as it shows root files in TEMP folder twice when trailing "" is used after "TEMP". It seems it takes TEMP and TEMP\ as different folders.

Thanks for looking into it 👍

from efutool.

zybexXL avatar zybexXL commented on June 25, 2024

Hi @ExSport,
The first scan always writes the root folder contents first, while the re-scan will write the non-modified folders first, and then the modified ones (re-scanned). Since Roots are always considered to be modified, the root contents are written last in re-scans. This is by design and doesn't affect functionality.

You are right, the root folder is ALWAYS marked as new/changed. I did this because of DFS links (mountpoints) which are common in server environments. DFS links have a static timestamp which does not change even when contents of the root folder change, and it's not trivial to detect if a given folder is a DFS link or not (there's no attribute indicating it). To work around it and since DFS links are mostly used as root folders/shares, I've decided to always consider the root folders as "changed".

I've published v1.07 which fixes the trailing slash issue. You would see "2 changed" because the app would count both the root folder you indicated on the command line, plus the root folder already present on the existing EFU file - in the end they would be found to be the same folder (so scan results are correct), but still counted 2 changes. In practice, when doing re-scans, you don't need to specify which folder you are scanning - the app gets that from the existing EFU file already.

Regarding scan times - a compressed volume forces the OS to extract an entire directory even when requesting only a single entry, so when EFUTool scans for changes, all directories have to be de-compressed (by the OS), so I'm guessing that negates any speed advantage of the change-detection algorithm.

The filesize difference you saw... are you sure that isn't real? Maybe some app added/deleted a file in one of the subfolders of that path between your scans. Sometimes just going to a folder with images causes Windows to create/update a hidden thumbs.db file - you should see this change in your EFU comparison too though.

Br,
ZybexXL

from efutool.

ExSport avatar ExSport commented on June 25, 2024

Hi @zybexXL
What I remember from past, in Powershell I used for DFS checks properties/attributes like Link/Junction/ReparsePoint or to get DFS link properties I used something like:

$signature = @'
using System;
using System.Collections.Generic;
using System.Management.Automation;
using System.Runtime.InteropServices;

public class Win32Api
{
    [DllImport("netapi32.dll", SetLastError = true)]
    private static extern int NetApiBufferFree(IntPtr buffer);

    [DllImport("Netapi32.dll", CharSet = CharSet.Unicode, SetLastError = true)]
    private static extern int NetDfsGetClientInfo
    (
    [MarshalAs(UnmanagedType.LPWStr)] string EntryPath,
    [MarshalAs(UnmanagedType.LPWStr)] string ServerName,
    [MarshalAs(UnmanagedType.LPWStr)] string ShareName,
    int Level,
    ref IntPtr Buffer
    );

    public struct DFS_INFO_3
    {
        [MarshalAs(UnmanagedType.LPWStr)]
        public string EntryPath;
        [MarshalAs(UnmanagedType.LPWStr)]
        public string Comment;
        public UInt32 State;
        public UInt32 NumberOfStorages;
        public IntPtr Storages;
    }
    public struct DFS_STORAGE_INFO
    {
        public Int32 State;
        [MarshalAs(UnmanagedType.LPWStr)]
        public string ServerName;
        [MarshalAs(UnmanagedType.LPWStr)]
        public string ShareName;
    }

    public static List<PSObject> NetDfsGetClientInfo(string DfsPath)
    {
        IntPtr buffer = new IntPtr();
        List<PSObject> returnList = new List<PSObject>();

        try
        {
            int result = NetDfsGetClientInfo(DfsPath, null, null, 3, ref buffer);

            if (result != 0)
            {
                throw (new SystemException("Error getting DFS information"));
            }
            else
            {
                DFS_INFO_3 dfsInfo = (DFS_INFO_3)Marshal.PtrToStructure(buffer, typeof(DFS_INFO_3));

                for (int i = 0; i < dfsInfo.NumberOfStorages; i++)
                {
                    IntPtr storage = new IntPtr(dfsInfo.Storages.ToInt64() + i * Marshal.SizeOf(typeof(DFS_STORAGE_INFO)));

                    DFS_STORAGE_INFO storageInfo = (DFS_STORAGE_INFO)Marshal.PtrToStructure(storage, typeof(DFS_STORAGE_INFO));

                    PSObject psObject = new PSObject();

                    psObject.Properties.Add(new PSNoteProperty("State", storageInfo.State));
                    psObject.Properties.Add(new PSNoteProperty("ServerName", storageInfo.ServerName));
                    psObject.Properties.Add(new PSNoteProperty("ShareName", storageInfo.ShareName));

                    returnList.Add(psObject);
                }
            }
        }
        catch (Exception e)
        {
            throw(e);
        }
        finally
        {
            NetApiBufferFree(buffer);
        }
        return returnList;
    }
}
'@

        if (-not ('Win32Api' -as [Type])) {
            Add-Type -TypeDefinition $signature
        }
$DFS = [Win32Api]::NetDfsGetClientInfo($CheckPath)

About the speed. Ok, I hoped there will be some glitch so 20seconds after the fix will be the default behavior but it seems due to some weird glitch user/I got 20sec rescan times what is not good, like in case intermittent network problem during EFUTool rescan what led to skipping "something" what created so short scans.
About the compression bit, if the full traverse is forced locally, still it is better that send such full traverse list from remote to the client as SMB overhead on such slow networks is killing any dirtree queries as user can see in Wireshark. Definitely SMB is not good for queries via WAN networks.
I will try to clear compression bit if anything change in speeds of rescan.
In IT you can't be sure by anything but as I said, the share is "dead backup folder" where no changes are made for ages, by anything (Antivirus, backups, manually browsed by users, etc...) so yes, it should be totally static and as you can see, only some initial scans shown wrong size but any other rescan was ok. Other tries I did, also the initial scan was ok, so there must be some glitch somewhere:-) Anyway folder timestamp stays intact also in case folder size differs so it doesn't seem there is some temp file created and deleted afterwards so during the scans the size changes. I can understand it in case folder timestamp will also change but here there is no change in time but in size only. Weird.
Btw. my tests were done on Wifi via VPN so maybe some network glitch can do this weird behavior. But I should understand it in case wrong folder size will be smaller (some date were not queried and summed up) but not bigger like in our example. Totally weird.
Will try to do more tests later to dig this annoying behavior :)
Thx

from efutool.

zybexXL avatar zybexXL commented on June 25, 2024

@ExSport, thanks for all the feedback!

You do have an environment which is not ideal for performance... Wifi+VPN+Compression... :) We can't easily know what's going on. The 20sec rescan time is what I would expect on normal circumstances (7x faster than first scan), but there's nothing in the code that would make a 3rd scan different.

Getting DFS info for each folder would slow down things too much. NetDfsGetClientInfo is fast, but only returns info already stored on the local client cache - if a given server folder is not in cache yet, then there's no info.
The proper solution would perhaps be to use NetDfsEnum and NetDfsGetInfo to get a list of DFS links on the target volume, but that's outside of scope for now. I may add it some point in the future, but the current "assume it's changed" behavior solves the immediate problem.

from efutool.

ExSport avatar ExSport commented on June 25, 2024

@zybexXL
Hi again
So I created script to repeat thousand times creation of initial EFU files. Some hundreds/thousands runs there were no difference, other runs the script logged many anomalies in folder sizes. Checked few of them and....
Btw. I used New-NetQosPolicy to shape in/out data to 1Mbit/s to simulate slower network as I tested it between servers with RTT <1ms. I also run it between two locations connected by DirectConnect, so very reliable connection, RTT about 40ms due to the distance, 500Mbit/s and spotted same issues.
So what I spotted:
Comparing different EFU files between each other (good vs. bad), there is a difference in folder sizes. The difference is in 99% exactly 4096 bytes (cluster size/allocation unit size of the volume). Only in one example I spotted the difference 8192 bytes.
So I summed files from different/incorrect folders defined in EFU file and the sum is correct but not the counted size defined in EFU.
It seems in some circumstances EFU tool incorrectly "rounds" folder sizes which it gets from indexed files. File sizes are correct, but not the final size of its parents folder what EFU tool counts.
It was very unpredictable when EFU tool started to generate incorrect EFU files as sometimes everything was ok after many cycles so I restarted job and in some cases it started to generate invalid EFU files, other time everything was OK, for many hours....so very very hardly reproducible.
Hopefully you will get idea what is "wrong" with info where the difference lies and that it is exactly 4096 bytes 👍
Thx

from efutool.

zybexXL avatar zybexXL commented on June 25, 2024

@ExSport
The code is deterministic, it cannot "sometimes" forget about 4 or 8KB. The code just calls into the OS to get a list of folder contents and adds it up. Any change as you describe must come from the OS API call results.

It sounds like there's some TMP file being created on the target volume (taking 1 or 2 clusters), which sometimes happens during scanning. You should see this file on your EFU compares though.

I suggest you run the same stress test with a simple "dir /s \target\share > someLogFile.txt" and see if you spot the same difference in folder sizes/contents. This is basically what EFUTool is doing :)

from efutool.

ExSport avatar ExSport commented on June 25, 2024

@zybexXL
I understand you got file sizes from OS API (NtQueryInformationFile, GetFileSizeEx,...) but I suppose there is no option how to ask OS the size of folder size property so you must sum them up manually or am I wrong? E.g. in EFU file there is a folder with incorrect TOTAL size but its files have correct filesize indexed. If I manually count them, SUM is different with the TOTAL in EFU file for root folder. Another run file sizes are same=correct and also TOTAL of parent folder is OK. When TOTAL is incorrect, there is no change for any of files in subfolders, parent of subfolders itself. It means there was no temporary file added/removed. Strange, right?

As stated above, there is no new TMP file indexed in EFU file, no timestamp change, only incorrect TOTAL (sometimes EFU file have 2 incorrect TOTALS, other time zero or ten. It is unpredictable).

I already did this dir traversing tests during stress tests I described in the morning. There was no difference. I run it on two different servers, at time when EFU file generated incorrect TOTALS on one server but on the other server it was OK, also in such case when EFU files differ between servers, DIR log was exactly same between servers. No size change, file change, timestamp, anything. Totally weird :)
Unfortunately creating EFU export can take only 40 secs or 2mins per server, DIR log export took 3-4 hours.

Powershell or cmd is my daily job so I did a lot of stress tests to see what is happening but unfortunately some tests are so slow so they are useless and when I reduced the scope, it was hard to reproduce it. Tried it locally where speed is ok, but I was not able to reproduce it.
And I don't expect there is some TEMP file generated but deleted immediately so it is not captured by EFU Tool or my scripts and without modifying any TIMESTAMP :)

Btw. how you get TOTALS for folders? Nevermind, we are on GitHub...C#...as I said you count files to get folder size :) So again, if you understood me bad. It is not problem the size of files you get via OS API, problem sometimes arise for TOTALS which are calculated for folders. Unfortunately to reproduce it is a nightmage :))

Thx for support.... Btw. don't lose much time with it, it is nothing critical. I reported it only as I spotted such behavior during few tests (I don't use your tool right now for anything, only wanted to test speed difference between your tool and Everything 👍)
Anyway great tool, thanks for it!

from efutool.

zybexXL avatar zybexXL commented on June 25, 2024

@ExSport
Don't worry, I too like a good discussion and a puzzle :)
There's no "total folder size" property than can be queried. A given folder size is the sum of all files in it, plus the sizes of all subfolders in it (recursively). That is what the code calculates.

If the inputs are always the same, then the outputs MUST always be the same. This is what I mean by "deterministic". There are only 3 ways to get different results on different scans:

  • there is some change to the filesystem (most likely)
  • there's a random element of calculation in the code (there is not, in this case - it's deterministic, 2+2=4 each and every time)
  • you have some hardware/RAM issue that is causing some bit flips! (veeeery unlikely)

Perhaps you can send me a couple of EFU files with those differences?

from efutool.

ExSport avatar ExSport commented on June 25, 2024

@zybexXL
I thought it is clear that I know there is no size property for dir, that it is counted in your code. ;)
It was a reaction on your response that it can't be different as the sizes are what OS API returns, no special code on your side.

  • There should not be any and EFU file with indexed files confirms it there is no new one or even there is no change in file sizes or file/folder timestamps.
  • I agree, but some unknown aspect sometimes change folder size for 4096 bytes. Who knows if it is some caching, timeout or compression bit weirdness which creates such inconsistency. I know, 2+2 is 4 but...
  • No as I can reproduce it on VMware VM, Amanon VM or Azure VM. Very unlikely all have memory corruption :)

I will try to reproduce it again on some general data and upload it.

from efutool.

ExSport avatar ExSport commented on June 25, 2024

@zybexXL
Here you go: https://github.com/ExSport/UniversalMediaServer/blob/2cd3194afaf1ccb0851c4a815b352077bf09f779/stresstest-short-efutool.zip
Further info is provided in hash-short.txt file.

  • Generally I slowed down network so UNC path is not so responsive (servers have <1ms RTT)
  • iterate efutool thousand times to get wrong efu file. For validity I checked hash for such EFU file to see if there is something unusual or not.
  • forgot what else I wanted to say :)

from efutool.

zybexXL avatar zybexXL commented on June 25, 2024

@ExSport
Thank you for the logs. I think I may have found the cause, please try latest v1.0.8 :)

When asking windows for a folder info (FindFirstFileW/FindNextFileW) the returned size for directories SHOULD be zero. I was assuming that is the case, so I was using the returned value as the initial folder size. However, it is possible that Windows API is not always resetting that field to zero, and sometimes returning 4K/8K/whatever. In v8, I'm now ignoring the returned folder size and setting it manually to zero.

Let me know if you can still reproduce it.

from efutool.

ExSport avatar ExSport commented on June 25, 2024

@zybexXL that is exactly what I expected it is happening. That sometimes API will return nonzero value for folders. I even wrote post about such possible behaviour but then deleted it as meanwhile I tested it in powershell and saw the size property is not initialized at all :)
Great you found the culprit. Will test it hopefully soon 👍
Thanks

from efutool.

ExSport avatar ExSport commented on June 25, 2024

@zybexXL
After about 2000 and 6000 iterations from two different servers, it seems generated EFU files are finally consistent 👍
Thanks. You can close it. I will reopen it in case I will spot another issues with another test folders.
👍 🥇 👍

from efutool.

zybexXL avatar zybexXL commented on June 25, 2024

Great! Thanks for finding the issue and sticking with it until we solved it.

from efutool.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.