Comments (10)
Apparently not all unicode characters are supported. A certain set of them (=Unicode 4.1) are, which are defined in PEP-3131. See this link for the full list. I'll choose some exotic ones from this list to test with.
from pyflame.
Can someone point to the part of the code where this change would have to be done? I'm not terribly proficient at C++ but can take a look and maybe a first pass
from pyflame.
@nichochar I'm not completely familiar with the code, but it seems that the Unicode support must be implemented somewhere around the pystring code: https://github.com/uber/pyflame/blob/master/src/pystring.cc#L38
from pyflame.
I updated the top-level comment on this issue to provide more background, if anyone wants to take a stab.
from pyflame.
I have got a first (very ugly) version of this to work. I will clean it up over the next couple of days, but then I would certainly like to get some feedback. I have made some assertions right now that may or may not be allowed, and it will probably not work for early versions of Python 3 (<3.3) either.
I also ran into the issue where FlameGraph does not really handle UTF-8 correctly either, so that will have to be fixed before people can actually use it.
EDIT:
I made a pull request to FlameGraph to interpret all input as UTF-8
from pyflame.
I have pushed a commit which I would like some feedback on.
Besides the bunch of TODOs and asserts, The most important part I think is the discrepancy in the signature of Python 3's StringData() function. It currently returns a std::string, because it does all the heavy lifting itself. I think I have to do it this way, because the address of the second PtracePeekBytes depends on the result of some processing on the first PtracePeekBytes. I could move this processing and PtracePeekBytes stuff to the FollowFrame() function, but that would only make everything uglier and harder to understand.
EDIT:
Fixed url to commit. I committed it to my "master" branch before, which apparently also meant it was automatically included in my pull request for the line-number fix. It is now in its own WIP branch.
from pyflame.
@jackvreeken Sorry for not seeing this earlier. Your commit looks pretty good to me! Note that some stuff in frob.cc
has changed very slightly, and #70 is also changing the interfaces for the Python 3.6 changes. But neither of those changes are too invasive.
As for testing, I think it's sufficient to create copies of the test Python scripts in the tests/
directory with unicode names, and then test against those as well. Ideally we could just do this using a symlink, so that we don't need to keep changs in files in sync. If that doesn't work (e.g. because Python resolves the file name after following the link) we could change runtests.sh
to create copies of the files and then clean them up after.
I don't know how to test every file encoding (perhaps you do), but if we just confirmed that non-ASCII worked that would be good enough for me.
from pyflame.
Alright, when #81 gets merged I will rebase on top and fix anything that has changed. For testing I would not just want to test unicode filenames, but also unicode function names (just in case).
What do you mean by every file encoding? Do you mean Latin-1, UCS-1, UCS-2 and UCS-4, or other exotic types like BIG5 and SHIFT-JIS? The latter two are handled by Python itself, so we only have to deal the "standard" set. I think Python just chooses the smallest one that fits to store strings, so I could come up with some filenames/function names that don't fit in UCS-2 but do fit in UCS-4. I would have to check that though.
Our output is always UTF-8 by the way, as that is what FlameGraph expects.
from pyflame.
That change is landed now, and I released it as 1.4.0. If we can fix this issue, I will tag/release 1.5.0.
Good call on testing Unicode function names -- that's easier to test anyway (although I agree we should still have some Unicode file name tests).
For the encoding, what I remember when I first wrote the code for Python 3 string handling, is that Python internally uses one of four possible representations (e.g. ASCII, UCS-{1,2,4}, etc.) based on what bytes are actually in the string. Ideally we would have file/function names that test all possible internal representations (or at least, all of the ones Pyflame supports). I know the ASCII heuristic is just to use the ASCII internal representation if the string is representable as ASCII, for multi-byte encodings I'm not sure how it decides which multi-byte representation it wants to use.
For the very large characters, I know the newest emoji are large (I believe the terminology is that they're in the astral plane?). For instance, https://unicode-table.com/en/2603/ is three bytes in UTF-8, and https://unicode-table.com/en/1F356/ is four bytes in UTF-8.
from pyflame.
This is fixed by #92 , contributed by @jackvreeken . I've tagged and released v1.5.0 which includes this change!
from pyflame.
Related Issues (20)
- Cannot build on ubuntu 14.04 arm HOT 1
- Building pyflame on aarch64 HOT 3
- Pyflame fails on multithreaded applications with setuid HOT 3
- Work with uwsgi run on Docker HOT 2
- Add support for Python 3.7 HOT 6
- Support for Chrome 58+ style profiles
- flame-chart-json: ValueError: invalid literal for int() with base 10: '(idle) 135\n' HOT 1
- Python 3.6.7 support, seems broken HOT 3
- Profile until the profiled process is terminated? HOT 2
- How can I get the details of each thread ?
- x86_64 ubuntu14.04 not work for Failed to seize PID HOT 2
- Failed to PTRACE_PEEKDATA - in miniconda environment HOT 4
- Failed to PTRACE_PEEKDATA (pid 25920, addr 0x55c4c233b5a8) in anaconda environment HOT 1
- Make pyflame create output after itself being terminated HOT 1
- Package for pypi/pip HOT 2
- pyflame seems to be unmaintained HOT 3
- [RFC] Supports multiples PID (-p PID1 -p PID2 ...)
- Segfaults with --threads in OSQP
- Failed to locate libpython within timeout period when execute pyflame -p PID in python 2.7 HOT 1
- make fail on agx-xavier
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyflame.