Coder Social home page Coder Social logo

Comments (19)

gvanrossum avatar gvanrossum commented on June 23, 2024 1

The transfer tool thought that your GH-32112 was a reference to bpo-32112 (rather than referring to a GitHub PR)

How could that happen? That seems terrible. :-(

from gh-migration.

JustAnotherArchivist avatar JustAnotherArchivist commented on June 23, 2024 1

I can get an estimate of this shortly.

from gh-migration.

JustAnotherArchivist avatar JustAnotherArchivist commented on June 23, 2024

Found two more:

python/cpython#91294 (comment) references GH-76305 instead of GH-32124.
python/cpython#91302 (comment) references GH-76242 instead of GH-32061.

from gh-migration.

ezio-melotti avatar ezio-melotti commented on June 23, 2024

It seems like GH-* were interpreted as issue number and converted automatically to the new migrated issue during the transfer. This automatic conversion was supposed to be disabled during the transfer, but apparently it wasn't.

from gh-migration.

vstinner avatar vstinner commented on June 23, 2024

Another example. In https://bugs.python.org/issue47075 I explicitly wrote I proposed GH-32112 for that. to get a link to my PR python/cpython#32112 but the converted issue python/cpython#91231 points to python/cpython#76293 i don't understand how GH-32112 was transformed to #76293 which is an unrelated issue (not a PR)

from gh-migration.

ezio-melotti avatar ezio-melotti commented on June 23, 2024

i don't understand how GH-32112 was transformed to #76293 which is an unrelated issue (not a PR)

During the transfer, https://bugs.python.org/issue32112 was transferred to python/cpython#76293. The transfer tool thought that your GH-32112 was a reference to bpo-32112 (rather than referring to a GitHub PR), and since bpo-32112 is now GH-76293, it converted GH-32112 into GH-76293.

from gh-migration.

JustAnotherArchivist avatar JustAnotherArchivist commented on June 23, 2024

Any update on whether/how/when this could hopefully get fixed?

from gh-migration.

vstinner avatar vstinner commented on June 23, 2024

The transfer tool thought that your GH-32112 was a reference to bpo-32112

Would it be possible to process all data once more time to detect the issue and fix it? This issue is quite annoying.

The current workaround is to go to the original issue, find the message to look for the correct GH issue/PR number.

from gh-migration.

serhiy-storchaka avatar serhiy-storchaka commented on June 23, 2024

I agree that it is very serious issue. For now we have incorrect references and the reader can not even know that they are incorrect.

Can we repeat translation which correctly detects GitHub references locally, compare it with the old incorrect translation, and apply the diff? The diff should be orders smaller that the whole data (but it still can be thousands messages).

from gh-migration.

ezio-melotti avatar ezio-melotti commented on June 23, 2024

Since this happened at transfer time, one of the test repos that I kept around might still have the correct references and it might be used to fix the links. Doing it from bpo is a bit trickier because there is quite a bit of processing that happened at import time (even though it should be possible to extract only the references).

The main issue is that this requires a mass-rewrite which will have to go through thousands of issues and comments and edit them, and this is not trivial to write and test. At this point we will also have to use the public API, which is rate limited. If it is done, the same infrastructure could also be used to replace bpo links with GH links in the first message of the PRs, and to @mention again nosy list members in the first message of issues. I think this will also edit the "last updated" field of all the affected issues/PRs, making it more difficult to find old/stale issues/PRs.

from gh-migration.

serhiy-storchaka avatar serhiy-storchaka commented on June 23, 2024

How many issues are affected? 100, 1000 or 10000? If it is only few 100s, we can survive editing the "last updated" field. But in long term it would be better to find a way to edit messages without changing the "last updated" field. We may need it to fix references to old Subversion and Mercurial revisions.

from gh-migration.

ezio-melotti avatar ezio-melotti commented on June 23, 2024

I don't have a number, and when we discussed different options to edit the messages after the fact and they all changed the "last updated" field (technically GitHub could rewrite the messages directly in the DB, but it's not something they would like to do and we would need their help).

from gh-migration.

JustAnotherArchivist avatar JustAnotherArchivist commented on June 23, 2024

There are about 19270 lines in messages on bpo (as of late March, shortly before the migration) that contain a link with the text GH-*. This number includes the 'New changeset' messages, which were migrated correctly, and I'm not sure how to filter them out (without a complete export of the GitHub issues to compare it directly). I don't see any differences between the two in bpo's HTML. For example, the links mentioned in the original issue are:

  • <a href="https://github.com/python/cpython/pull/31270" class="closed" title="GitHub PR 31270: [merged] Merged">GH-31270</a> – Incorrect in GH issue
  • <a href="https://github.com/python/cpython/pull/31313" title="GitHub PR 31313">GH-31313</a> – Incorrect in GH issue
  • <a href="https://github.com/python/cpython/pull/31270" class="closed" title="GitHub PR 31270: [merged] Merged">GH-31270</a> – Correct in GH issue

In addition, there are about 5993 lines in bpo messages that contain other links to GH PRs. Roughly half of these are PR # links (example), the others are mostly just plain URLs (example). A brief sampling seems to indicate that those were transferred correctly.

from gh-migration.

JustAnotherArchivist avatar JustAnotherArchivist commented on June 23, 2024

One thing I suspected regarding those correctly migrated links is that perhaps messages that first contain a bpo link and then a GH PR link worked. This does not appear to be the case: exhibit A, exhibit B.

I also found one issue with a weird link on GitHub: this comment was rewritten as python/issues-test-cpython#30631. It should clearly point to python/cpython#30631.

from gh-migration.

ezio-melotti avatar ezio-melotti commented on June 23, 2024

A few things that might help you:

  • the migration happened in two stages: the import from bpo to a new GitHub repo, and the transfer from that repo to python/cpython.
  • before the import, I did a lot of processing to fix/convert some links, add MarkDown markup, add the first message with the table, etc. In particular, I converted GH-* and PR-* references (but not plain URLs) using this regex:
      CPYTHON_PR = 'https://github.com/python/cpython/pull/'
      # PR-123, PR 123, PR123, pull request 123, GH 123, GH123
      # use a markdown link to preserve the original text -- the popup works
      (re.compile(r'(?P<text>(\b(?<![-/_])(PR-?|GH|pull\s*request))\s*'
                  r'(?P<pr_no>\d+))\b', re.I),
       r'[\g<text>](%s\g<pr_no>)' % CPYTHON_PR),
    and bpo issues with
      # issue1234, issue 1234, bpo 1234, #1234
      (re.compile(r'(?:(?<!\w)\#|\b(?<![-/_])(?:issue|bpo))'
                  r' ?(1?\d{4,6})\b', re.I),
       r'bpo-\1'),
  • during the transfer, the tool used by GitHub apparently converted some of the links. I know that one limitation of the tool is that it could only convert issues that had already been migrated. For example, if there was an issue with id 1234 that was migrated to e.g. https://github.com/python/cpython/issue/4321, issues that had references like #1234 could have been converted to #4321, but only in messages migrated after issue 1234. This was the reason why we converted all the issue references to bpo-* instead of relying on their link conversion (that was supposed to be disabled, since it wasn't relyable).

This number includes the 'New changeset' messages, which were migrated correctly, and I'm not sure how to filter them out (without a complete export of the GitHub issues to compare it directly).

Maybe they were migrated correctly because the regex used by GitHub to find links/references didn't match when the reference was surrounded by (...), like in the "New changesets" messages. All those message have the same format, so it shouldn't be difficult to filter them out (a msg.startswith('New changeset') is probably enough).

If you want to see the pre-transfer messages, I can give you access to one of the test repos. If you have other questions let me know.

from gh-migration.

JustAnotherArchivist avatar JustAnotherArchivist commented on June 23, 2024

Hmm, your regex would not match GH-1234 with a hyphen at all. This would suggest that such references were all rewritten by GitHub...?

I know that one limitation of the tool is that it could only convert issues that had already been migrated.

This makes sense. However, it should not affect PR references since all those PRs were already on GitHub beforehand.

Maybe they were migrated correctly because the regex used by GitHub to find links/references didn't match when the reference was surrounded by (...), like in the "New changesets" messages. All those message have the same format, so it shouldn't be difficult to filter them out (a msg.startswith('New changeset') is probably enough).

I can't easily filter for New changeset (since it's generally on a different line than the PR reference in the HTML served by bpo that I'm processing for this), but the hint about brackets is a good one! Firstly, I found several more of those python/issues-test-cpython examples: https://bugs.python.org/msg404903, https://bugs.python.org/msg364827, https://bugs.python.org/msg410766. I also found this example of the form (GH-# and GH-#) where the former link is correct but the latter goes to that test repo: https://bugs.python.org/msg331261python/cpython#53775 (comment). Overall, there are about 777 GH-# links that are not preceded by (.

from gh-migration.

ezio-melotti avatar ezio-melotti commented on June 23, 2024

Hmm, your regex would not match GH-1234 with a hyphen at all. This would suggest that such references were all rewritten by GitHub...?

You are right: I didn't rewrite the GH-* references because those are already recognized by GitHub, but other refs like GH 1234 or GH1234 or PR-1234 aren't, so I used the regex to rewrite them. Instead of converting them into GH-1234 I converted them into MarkDown links in order to preserve the original text.

This makes sense. However, it should not affect PR references since all those PRs were already on GitHub beforehand.

Note that there is quite a bit of overlapping between the PRs ids and bpo ids, so a bpo message that linked to the PR GH-1234 could have been mistaken by the tool for a link to an issue with id 1234 (since issues and PRs share the same namespace on GitHub), and after migrating bpo-1234 to GH-4321 all the references to PR GH-1234 could have been rewritten to GH-4321.

I can't easily filter for New changeset (since it's generally on a different line than the PR reference in the HTML served by bpo that I'm processing for this)

Instead of scraping the HTML directly (if this is what you are doing), you could use the XMLRPC interface to fetch all the bpo messages. Even if you are scraping HTML, it shouldn't be difficult to isolate the individual messages. If you need more info about this let me know, however it might be better to look at the test repo to see how the messages look like after the import (I gave you read access to one of them 1).

but the hint about brackets is a good one!

Glad that helped! I noticed similar problems while testing before the migration.

Firstly, I found several more of those python/issues-test-cpython examples: https://bugs.python.org/msg404903

This might be a separate bug in the transfer tool:

  • the original message on bpo contains the text: we get a resolution in GH-25718 as it will likely conflict.
  • the imported message contains the text: we get a resolution in GH-25718 as it will likely conflict. (since my regex didn't touch the GH-*s)
  • the transferred message contains the text: we get a resolution in python/issues-test-cpython#25718 as it will likely conflict. (note that python/issues-test-cpython is the repo where we imported the issues before the transfer to python/cpython)

So my export tool kept references to GH-* unchanged and converted other references (e.g. GH *, PR *, etc.) to MarkDown links. Then, the transfer tool:

  • converted some (all?) GH-* references to point to a different id
    • this happened when the id matched an already migrated issue id and when the GH-* wasn't preceded by other chars (like ();
    • the GH-* was converted into org/repo#ID;
    • sometimes it used the source repo (python/issues-test-cpython#4321) instead of the target repo (python/cpython#4321) -- not sure why
  • left full links alone (e.g. the first two in python/cpython#91294 (comment) - https://bugs.python.org/issue47138#msg416171)
  • maybe left Markdown links alone too

Can you confirm that your observations match these statements?

Footnotes

  1. I actually realized that the one I gave you access to (issues-test-cypthon-2) was used to test the transfer, so it had the same issues as the python/cpython repo. I now revoked your permissions to that repo and added you to a different repo (issues-test-2) that had the imported messages before they were transferred.

from gh-migration.

vstinner avatar vstinner commented on June 23, 2024

when we discussed different options to edit the messages after the fact and they all changed the "last updated" field (technically GitHub could rewrite the messages directly in the DB, but it's not something they would like to do and we would need their help).

I don't see the "last updated" change as an issue. It's a good thing that changes are traced and public, no?

from gh-migration.

ezio-melotti avatar ezio-melotti commented on June 23, 2024

Technically yes, however if we mass-update the issues you won't be able to find old or stale issues anymore since they will all show as recently updated. Before the migration we decided it would be better to preserve the original "last updated" date, but we could revisit that decision. Even if we do, we would still need a script to perform the mass update.

from gh-migration.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.