|
|
阳刚的红茶 · Jefferson Said It | ...· 10 月前 · |
|
|
狂野的白开水 · 伤痕文学的主要代表作及其作家_百科TA说· 11 月前 · |
|
|
面冷心慈的绿茶 · WEB开发常用术语中英文对照- 钱途无梁- 博客园· 2 年前 · |
|
|
性感的钥匙扣 · 保护非遗在行动——针尖上的宫廷京绣艺术_北京旅游网· 2 年前 · |
|
|
很酷的莴苣 · 十大经典反腐剧:《人民的名义》仅排第三,第一 ...· 2 年前 · |
Read a CSV file into a DataFrame.
Path to a file or a file-like object (by “file-like object” we refer to objects
that have a
read()
method, such as a file handler like the builtin
open
function, or a
BytesIO
instance). If
fsspec
is installed, it will be used
to open remote files.
Indicate if the first row of the dataset is a header or not. If set to False,
column names will be autogenerated in the following format:
column_x
, with
x
being an enumeration over every column in the dataset, starting at 1.
Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.
Rename columns right after parsing the CSV file. If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.
Single byte character to use as separator in the file.
A string used to indicate the start of a comment line. Comment lines are skipped
during parsing. Common examples of comment prefixes are
#
and
//
.
Single byte character used for csv quoting, default =
"
.
Set to None to turn off special handling and escaping of quotes.
Start reading after
skip_rows
lines.
Provide the schema. This means that polars doesn’t do schema inference.
This argument expects the complete schema, whereas
schema_overrides
can be
used to partially overwrite a schema.
Overwrite dtypes for specific or all columns during schema inference.
Values to interpret as null values. You can provide a:
str
: All values equal to this string will be null.
List[str]
: All values equal to any string in this list will be null.
Dict[str,
str]
: A dictionary that maps column name to a
null value string.
By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.
Try to keep reading lines if some lines yield errors.
Before using this option, try to increase the number of lines used for schema
inference with e.g
infer_schema_length=10000
or override automatic dtype
inference for specific columns with the
schema_overrides
option or use
infer_schema_length=0
to read all columns as
pl.String
to check which
values might cause an issue.
Try to automatically parse dates. Most ISO8601-like formats can
be inferred, as well as a handful of others. If this does not succeed,
the column remains of data type
pl.String
.
If
use_pyarrow=True
, dates will always be parsed.
Number of threads to use in csv parsing. Defaults to the number of physical cpu’s of your system.
The maximum number of rows to scan for schema inference.
If set to
0
, all columns will be read as
pl.String
.
If set to
None
, the full data may be scanned
(this is slow)
.
Number of lines to read into the buffer at once. Modify this to change performance.
Stop reading from CSV file after reading
n_rows
.
During multi-threaded parsing, an upper bound of
n_rows
rows cannot be guaranteed.
Lossy means that invalid utf8 values are replaced with
�
characters. When using other encodings than
utf8
or
utf8-lossy
, the input is first decoded in memory with
python. Defaults to
utf8
.
Reduce memory pressure at the expense of performance.
Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.
Try to use pyarrow’s native CSV parser. This will always
parse dates, even if
try_parse_dates=False
.
This is not always possible. The set of arguments given to
this function determines if it is possible to use pyarrow’s
native parser. Note that pyarrow and polars may have a
different strategy regarding type inference.
Extra options that make sense for
fsspec.open()
or a
particular storage connection.
e.g. host, port, username, password, etc.
Skip this number of rows when the header is parsed.
Insert a row index column with the given name into the DataFrame as the first
column. If set to
None
(default), no row index column is created.
Start the row index at this offset. Cannot be negative.
Only used if
row_index_name
is set.
Set the sample size. This is used to sample statistics to estimate the allocation needed.
Single byte end of line character (default:
n
). When encountering a file
with windows line endings (
rn
), one can go with the default
n
. The extra
r
will be removed when processed.
When there is no data in the source,`NoDataError` is raised. If this parameter is set to False, an empty DataFrame (with no columns) is returned instead.
Truncate lines that are longer than the schema.
Parse floats using a comma as the decimal separator instead of a period.
Expand path given via globbing rules.
Notes
If the schema is inferred incorrectly (e.g. as
pl.Int64
instead of
pl.Float64
),
try to increase the number of lines used to infer the schema with
infer_schema_length
or override the inferred dtype for those columns with
schema_overrides
.
This operation defaults to a
rechunk
operation at the end, meaning that all data
will be stored continuously in memory. Set
rechunk=False
if you are benchmarking
the csv-reader. A
rechunk
is an expensive operation.
Examples
>>> pl.read_csv("data.csv", separator="|")
Demonstrate use against a BytesIO object, parsing string dates.
>>> from io import BytesIO
>>> data = BytesIO(
... b"ID,Name,Birthday\n"
... b"1,Alice,1995-07-12\n"
... b"2,Bob,1990-09-20\n"
... b"3,Charlie,2002-03-08\n"
... )
>>> pl.read_csv(data, try_parse_dates=True)
shape: (3, 3)
┌─────┬─────────┬────────────┐
│ ID ┆ Name ┆ Birthday │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ date │
╞═════╪═════════╪════════════╡
│ 1 ┆ Alice ┆ 1995-07-12 │
│ 2 ┆ Bob ┆ 1990-09-20 │
│ 3 ┆ Charlie ┆ 2002-03-08 │
└─────┴─────────┴────────────┘
|
|
阳刚的红茶 · Jefferson Said It | Monticello 10 月前 |
|
|
狂野的白开水 · 伤痕文学的主要代表作及其作家_百科TA说 11 月前 |
|
|
面冷心慈的绿茶 · WEB开发常用术语中英文对照- 钱途无梁- 博客园 2 年前 |
|
|
性感的钥匙扣 · 保护非遗在行动——针尖上的宫廷京绣艺术_北京旅游网 2 年前 |