parallel processing in PDAL

Frankfurt Airport tunnel.JPG
Frankfurt Airport tunnel” by Peter IsotaloOwn work. Licensed under CC BY-SA 3.0 via Commons.

In my ongoing quest to process all the LiDAR data for Pennsylvania and Ohio into one gigantic usable dataset, I finally had to break down and learn how to do parallel processing in BASH. Yes, I still need to jump on the Python band wagon (the wagon is even long in the tooth, if we choose to mix metaphors), but BASH makes me soooo happy.

So, in a previous post, I wanted to process relative height in a point cloud. By relative height, I mean height relative to ground. PDAL has a nice utility for this, and it’s pretty easy to use, if you get PDAL installed successfully.

pdal translate 55001640PAN.las 55001640PAN_height.bpf height --writers.bpf.output_dims="X,Y,Z,Height";

Installing PDAL is not too easy, so I used the dockerized version of PDAL and it worked great. Problem is, the dockerized version complicates my commands on the command line, especially if I want to run it on a bunch of files.

Naturally, the next step is to run it on a whole bunch of LiDAR files. For that I need a little control script which I called, and then I need to run that in a “for” loop.

# Get the pathname from the input value

# Get the short name of the file, sans path and sans extension
name=`basename $1 .las`

docker run -v $pathname:/data pdal/master pdal translate //data/"$name".las //data/"$name"_height.bpf height --writers.bpf.output_dims="X,Y,Z,Intensity,ReturnNumber,NumberOfReturns,ScanDirectionFlag,EdgeOfFlightLine,Classification,ScanAngleRank,UserData,PointSourceId";

Now we need a basic for loop will take care of sending the las files into our, thus looping through all available las files:

for OUTPUT in $(ls *.las); do ~/./ $OUTPUT; done;

This is great, but I calculated it would take 13 days to complete on my 58366 LiDAR files. We’re talking approximately 41,000 square miles of non-water areas for Ohio, and approximately 44,000 square miles of non-water areas for Pennsylvania. I’m on no particular timeline, but I’m not really that patient. Quick duckduckgo search later, and I remember the GNU Parallel project. It’s wicked easy to use for this use case.

ls *.las | parallel -j24 ~/./

How simple! First, we list our las files, then we “pipe” them as a list to parallel, we tell parallel we want it to spawn 24 independent processes using that list as the input for our pdal_height script.

Now we can run it on 24 cores simultaneously. Sadly, I have slow disks 😦 so really I ran it like this:

ls *.las | parallel -j6 ~/./

Time to upgrade my disks! Finally, I want to process my entire LiDAR dataset irrespective of location. For this, we use the find command, name the starting directory location, and tell it we want to search based on name.

find /home/myspecialuser/LiDAR/ -name "*.las" | parallel -j6 ~/./

Estimated completion time: 2 days. I can live with that until I get better disks. Only one problem, I should make sure this doesn’t stop if my network drops for any reason. Let’s wrap this in nohup which will prevent network-based hangups:

nohup sh -c 'find /home/myspecialuser/LiDAR -name "*.las" | parallel -j6 ~/./ {}'

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.