Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
Need to left padding in Array column of pyspark dataframe without using pandasudf.
Input Dataframe:
|lags|
|----|
|[0]|
|[0,1,2]|
|[0,1]|
Output Data frame:
|lags|
|----|
|[0,0,0]|
|[0,1,2]|
|[0,0,1]|
You can use array_repeat to create zero padding array and concat
them.
Use @ARCrow's function to identify the max array size.
max_arr_size = 3
df = (df.withColumn('pad', F.array_repeat(F.lit(0), max_arr_size - F.size('lags')))
.withColumn('padded', F.concat('pad', 'lags')))
.withColumn('array_size', f.size(f.col('lags')))
.groupBy()
.agg(f.max(f.col('array_size')).alias('max_size'))
.collect()[0].max_size
df = (df
.withColumn('lags', f.when(f.col('lags').isNull(), f.array(*[])).otherwise(f.col('lags'))) #to deal with null values
.withColumn('pre_zeros', f.sequence(f.lit(0), f.lit(max_size) - f.size(f.col('lags'))))
.withColumn('zeros', f.expr('transform(slice(pre_zeros, 1, size(pre_zeros) - 1), element -> 0)'))
.withColumn('final_lags', f.concat(f.col('zeros'), f.col('lags')))
df.show()
And the output is:
+---------+------------+---------+----------+
| lags| pre_zeros| zeros|final_lags|
+---------+------------+---------+----------+
| [0]| [0, 1, 2]| [0, 0]| [0, 0, 0]|
|[0, 1, 2]| [0]| []| [0, 1, 2]|
| [0, 1]| [0, 1]| [0]| [0, 0, 1]|
| []|[0, 1, 2, 3]|[0, 0, 0]| [0, 0, 0]|
+---------+------------+---------+----------+
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.