Comments (4)
I'm not sure if there will be much interest in this feature.
Agreed @Aloqeely - CSV reading in pandas is already very complex and I think we would like to simplify it where possible. This feature seems too niche to me to have it supported directly in pandas.
If I have a large dataset with 100K rows, and I want to select 1000 rows at random, I first need to load the whole dataset into my RAM
How does one determine which 1000 rows to choose without knowing the total number of rows? Wouldn't that require reading the entire file anyways?
from pandas.
Thanks for the request but it seems like there's not much interest in this feature so closing
from pandas.
I think you can use skiprows
for this (I did not test it)
def selectFirstNRandom(size: int):
selected_count = 0
def skip(index: int):
if selected_count >= size:
return True
skip_row = random.random() > 0.5
if not skip_row:
selected_count += 1
return skip_row
return skip
df = pd.read_csv('your_file.txt', skiprows=selectFirstNRandom(100))
from pandas.
I'm not sure if there will be much interest in this feature.
@twoertwein's skiprows
solution probably works, but I think using chunksize
and iterator
is cleaner, it's less random though.
it = pd.read_csv("test.csv", chunksize=1000, iterator=True)
chunk1 = it.get_chunk() # returns DataFrame with first 1000 lines
chunk2 = it.get_chunk() # returns DataFrame with the next thousand lines
For more information on the chunksize
and iterator
parameters, you can see the IO Tools user guide
from pandas.
Related Issues (20)
- DOC: Enhance the docstrings to provide more detailed explanations for the functions and their parameters in Common.py HOT 3
- Reduce redundancy: Common.py HOT 1
- ENH: add `atol`, `rtol` and `check_exact` to the Object.compare() method HOT 1
- BUG: Warning when compiling pandas/_libs/algos.c
- BUG: to_excel() cuts off list of values when creating Excel file HOT 4
- BUG: matplotlib 3.9.0 has issue with pandas when using `subplots=True` on `df.plot.scatter()`
- BUG: pandas read_xml with iterparse and stylesheets silently ignores stylesheets
- BUG: Inconsistent types for groupby group names HOT 1
- BUG: to_datetime behaves differently depending of the format of the string provided HOT 1
- BUG: pandas.errors.LossySetitemError when backtesting Freqtrade HOT 1
- BUG: from_dict() hidden (correct) behavior not aligned with documentation and typing: accepts and processes lists of dicts HOT 2
- ENH: Flip order of bar chart bars for multi-column data HOT 1
- Test Suite: Expand Test Coverage for `script\tests\test_inconsistent_namespace.py HOT 1
- BUG: pandas.read_json casts float column to int HOT 6
- BUG: pyarrow dtype_backend incorrectly loads columns (from parquet) when the data stored is a list of structs and one of the struct fields has only None/null values HOT 1
- BUG: Rolling .apply() with method='table' ignores min_periods HOT 2
- BUG: Unexpected cast to float for `DataFrame.groupby().agg(engine="numba")` HOT 1
- ENH: skiprows after header in read_csv HOT 1
- BUG: Pandas Styler HTML not rendering properly HOT 2
- ENH: Support for semi-join on index (subsetting a multi-index dataframe with a subset of the multi-index) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandas.