Comments (19)
The transfer tool thought that your GH-32112 was a reference to bpo-32112 (rather than referring to a GitHub PR)
How could that happen? That seems terrible. :-(
from gh-migration.
I can get an estimate of this shortly.
from gh-migration.
Found two more:
python/cpython#91294 (comment) references GH-76305 instead of GH-32124.
python/cpython#91302 (comment) references GH-76242 instead of GH-32061.
from gh-migration.
It seems like GH-*
were interpreted as issue number and converted automatically to the new migrated issue during the transfer. This automatic conversion was supposed to be disabled during the transfer, but apparently it wasn't.
from gh-migration.
Another example. In https://bugs.python.org/issue47075 I explicitly wrote I proposed GH-32112 for that.
to get a link to my PR python/cpython#32112 but the converted issue python/cpython#91231 points to python/cpython#76293 i don't understand how GH-32112
was transformed to #76293
which is an unrelated issue (not a PR)
from gh-migration.
i don't understand how GH-32112 was transformed to #76293 which is an unrelated issue (not a PR)
During the transfer, https://bugs.python.org/issue32112 was transferred to python/cpython#76293. The transfer tool thought that your GH-32112 was a reference to bpo-32112 (rather than referring to a GitHub PR), and since bpo-32112 is now GH-76293, it converted GH-32112 into GH-76293.
from gh-migration.
Any update on whether/how/when this could hopefully get fixed?
from gh-migration.
The transfer tool thought that your GH-32112 was a reference to bpo-32112
Would it be possible to process all data once more time to detect the issue and fix it? This issue is quite annoying.
The current workaround is to go to the original issue, find the message to look for the correct GH issue/PR number.
from gh-migration.
I agree that it is very serious issue. For now we have incorrect references and the reader can not even know that they are incorrect.
Can we repeat translation which correctly detects GitHub references locally, compare it with the old incorrect translation, and apply the diff? The diff should be orders smaller that the whole data (but it still can be thousands messages).
from gh-migration.
Since this happened at transfer time, one of the test repos that I kept around might still have the correct references and it might be used to fix the links. Doing it from bpo is a bit trickier because there is quite a bit of processing that happened at import time (even though it should be possible to extract only the references).
The main issue is that this requires a mass-rewrite which will have to go through thousands of issues and comments and edit them, and this is not trivial to write and test. At this point we will also have to use the public API, which is rate limited. If it is done, the same infrastructure could also be used to replace bpo links with GH links in the first message of the PRs, and to @mention again nosy list members in the first message of issues. I think this will also edit the "last updated" field of all the affected issues/PRs, making it more difficult to find old/stale issues/PRs.
from gh-migration.
How many issues are affected? 100, 1000 or 10000? If it is only few 100s, we can survive editing the "last updated" field. But in long term it would be better to find a way to edit messages without changing the "last updated" field. We may need it to fix references to old Subversion and Mercurial revisions.
from gh-migration.
I don't have a number, and when we discussed different options to edit the messages after the fact and they all changed the "last updated" field (technically GitHub could rewrite the messages directly in the DB, but it's not something they would like to do and we would need their help).
from gh-migration.
There are about 19270 lines in messages on bpo (as of late March, shortly before the migration) that contain a link with the text GH-*
. This number includes the 'New changeset' messages, which were migrated correctly, and I'm not sure how to filter them out (without a complete export of the GitHub issues to compare it directly). I don't see any differences between the two in bpo's HTML. For example, the links mentioned in the original issue are:
<a href="https://github.com/python/cpython/pull/31270" class="closed" title="GitHub PR 31270: [merged] Merged">GH-31270</a>
– Incorrect in GH issue<a href="https://github.com/python/cpython/pull/31313" title="GitHub PR 31313">GH-31313</a>
– Incorrect in GH issue<a href="https://github.com/python/cpython/pull/31270" class="closed" title="GitHub PR 31270: [merged] Merged">GH-31270</a>
– Correct in GH issue
In addition, there are about 5993 lines in bpo messages that contain other links to GH PRs. Roughly half of these are PR #
links (example), the others are mostly just plain URLs (example). A brief sampling seems to indicate that those were transferred correctly.
from gh-migration.
One thing I suspected regarding those correctly migrated links is that perhaps messages that first contain a bpo link and then a GH PR link worked. This does not appear to be the case: exhibit A, exhibit B.
I also found one issue with a weird link on GitHub: this comment was rewritten as python/issues-test-cpython#30631
. It should clearly point to python/cpython#30631.
from gh-migration.
A few things that might help you:
- the migration happened in two stages: the import from bpo to a new GitHub repo, and the transfer from that repo to
python/cpython
. - before the import, I did a lot of processing to fix/convert some links, add MarkDown markup, add the first message with the table, etc. In particular, I converted
GH-*
andPR-*
references (but not plain URLs) using this regex:and bpo issues withCPYTHON_PR = 'https://github.com/python/cpython/pull/' # PR-123, PR 123, PR123, pull request 123, GH 123, GH123 # use a markdown link to preserve the original text -- the popup works (re.compile(r'(?P<text>(\b(?<![-/_])(PR-?|GH|pull\s*request))\s*' r'(?P<pr_no>\d+))\b', re.I), r'[\g<text>](%s\g<pr_no>)' % CPYTHON_PR),
# issue1234, issue 1234, bpo 1234, #1234 (re.compile(r'(?:(?<!\w)\#|\b(?<![-/_])(?:issue|bpo))' r' ?(1?\d{4,6})\b', re.I), r'bpo-\1'),
- during the transfer, the tool used by GitHub apparently converted some of the links. I know that one limitation of the tool is that it could only convert issues that had already been migrated. For example, if there was an issue with id
1234
that was migrated to e.g.https://github.com/python/cpython/issue/4321
, issues that had references like#1234
could have been converted to#4321
, but only in messages migrated after issue1234
. This was the reason why we converted all the issue references tobpo-*
instead of relying on their link conversion (that was supposed to be disabled, since it wasn't relyable).
This number includes the 'New changeset' messages, which were migrated correctly, and I'm not sure how to filter them out (without a complete export of the GitHub issues to compare it directly).
Maybe they were migrated correctly because the regex used by GitHub to find links/references didn't match when the reference was surrounded by (...)
, like in the "New changesets" messages. All those message have the same format, so it shouldn't be difficult to filter them out (a msg.startswith('New changeset')
is probably enough).
If you want to see the pre-transfer messages, I can give you access to one of the test repos. If you have other questions let me know.
from gh-migration.
Hmm, your regex would not match GH-1234
with a hyphen at all. This would suggest that such references were all rewritten by GitHub...?
I know that one limitation of the tool is that it could only convert issues that had already been migrated.
This makes sense. However, it should not affect PR references since all those PRs were already on GitHub beforehand.
Maybe they were migrated correctly because the regex used by GitHub to find links/references didn't match when the reference was surrounded by
(...)
, like in the "New changesets" messages. All those message have the same format, so it shouldn't be difficult to filter them out (amsg.startswith('New changeset')
is probably enough).
I can't easily filter for New changeset
(since it's generally on a different line than the PR reference in the HTML served by bpo that I'm processing for this), but the hint about brackets is a good one! Firstly, I found several more of those python/issues-test-cpython
examples: https://bugs.python.org/msg404903, https://bugs.python.org/msg364827, https://bugs.python.org/msg410766. I also found this example of the form (GH-# and GH-#)
where the former link is correct but the latter goes to that test repo: https://bugs.python.org/msg331261 → python/cpython#53775 (comment). Overall, there are about 777 GH-#
links that are not preceded by (
.
from gh-migration.
Hmm, your regex would not match GH-1234 with a hyphen at all. This would suggest that such references were all rewritten by GitHub...?
You are right: I didn't rewrite the GH-*
references because those are already recognized by GitHub, but other refs like GH 1234
or GH1234
or PR-1234
aren't, so I used the regex to rewrite them. Instead of converting them into GH-1234
I converted them into MarkDown links in order to preserve the original text.
This makes sense. However, it should not affect PR references since all those PRs were already on GitHub beforehand.
Note that there is quite a bit of overlapping between the PRs ids and bpo ids, so a bpo message that linked to the PR GH-1234
could have been mistaken by the tool for a link to an issue with id 1234
(since issues and PRs share the same namespace on GitHub), and after migrating bpo-1234
to GH-4321
all the references to PR GH-1234
could have been rewritten to GH-4321
.
I can't easily filter for New changeset (since it's generally on a different line than the PR reference in the HTML served by bpo that I'm processing for this)
Instead of scraping the HTML directly (if this is what you are doing), you could use the XMLRPC interface to fetch all the bpo messages. Even if you are scraping HTML, it shouldn't be difficult to isolate the individual messages. If you need more info about this let me know, however it might be better to look at the test repo to see how the messages look like after the import (I gave you read access to one of them 1).
but the hint about brackets is a good one!
Glad that helped! I noticed similar problems while testing before the migration.
Firstly, I found several more of those python/issues-test-cpython examples: https://bugs.python.org/msg404903
This might be a separate bug in the transfer tool:
- the original message on bpo contains the text:
we get a resolution in GH-25718 as it will likely conflict.
- the imported message contains the text:
we get a resolution in GH-25718 as it will likely conflict.
(since my regex didn't touch theGH-*
s) - the transferred message contains the text:
we get a resolution in python/issues-test-cpython#25718 as it will likely conflict.
(note thatpython/issues-test-cpython
is the repo where we imported the issues before the transfer topython/cpython
)
So my export tool kept references to GH-*
unchanged and converted other references (e.g. GH *
, PR *
, etc.) to MarkDown links. Then, the transfer tool:
- converted some (all?)
GH-*
references to point to a different id- this happened when the id matched an already migrated issue id and when the
GH-*
wasn't preceded by other chars (like(
); - the
GH-*
was converted intoorg/repo#ID
; - sometimes it used the source repo (
python/issues-test-cpython#4321
) instead of the target repo (python/cpython#4321
) -- not sure why
- this happened when the id matched an already migrated issue id and when the
- left full links alone (e.g. the first two in python/cpython#91294 (comment) - https://bugs.python.org/issue47138#msg416171)
- maybe left Markdown links alone too
Can you confirm that your observations match these statements?
Footnotes
-
I actually realized that the one I gave you access to (
issues-test-cypthon-2
) was used to test the transfer, so it had the same issues as thepython/cpython
repo. I now revoked your permissions to that repo and added you to a different repo (issues-test-2
) that had the imported messages before they were transferred. ↩
from gh-migration.
when we discussed different options to edit the messages after the fact and they all changed the "last updated" field (technically GitHub could rewrite the messages directly in the DB, but it's not something they would like to do and we would need their help).
I don't see the "last updated" change as an issue. It's a good thing that changes are traced and public, no?
from gh-migration.
Technically yes, however if we mass-update the issues you won't be able to find old or stale issues anymore since they will all show as recently updated. Before the migration we decided it would be better to preserve the original "last updated" date, but we could revisit that decision. Even if we do, we would still need a script to perform the mass update.
from gh-migration.
Related Issues (20)
- Issue (re)numbering HOT 9
- Replace the local_replace.py script HOT 1
- Map bpo users to GitHub users HOT 15
- Notify bpo users once the migration is done HOT 12
- Migration and risk management plans HOT 2
- Make bpo read-only HOT 3
- Add links from bpo to GitHub
- Set up autonosy on labels HOT 2
- Add a page that redirects from bpo to GitHub HOT 3
- Fate of Roundup and the instances HOT 1
- Subversion revisions references were not ported HOT 6
- Archive bpo-related repos HOT 6
- Message conversion and formatting HOT 2
- Write a tool to export data from bpo HOT 1
- Map bpo issue metadata to GitHub fields/labels HOT 7
- Replace the weekly summary report HOT 20
- Replace the sendmail script HOT 13
- Replace the irker detector HOT 3
- Replace the stats page HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gh-migration.