“Frankfurt Airport tunnel” by Peter Isotalo – Own work. Licensed under CC BY-SA 3.0 via Commons.
In my ongoing quest to process all the LiDAR data for Pennsylvania and Ohio into one gigantic usable dataset, I finally had to break down and learn how to do parallel processing in BASH. Yes, I still need to jump on the Python band wagon (the wagon is even long in the tooth, if we choose to mix metaphors), but BASH makes me soooo happy.
So, in a previous post, I wanted to process relative height in a point cloud. By relative height, I mean height relative to ground. PDAL has a nice utility for this, and it’s pretty easy to use, if you get PDAL installed successfully.
pdal translate 55001640PAN.las 55001640PAN_height.bpf height --writers.bpf.output_dims="X,Y,Z,Height";
Installing PDAL is not too easy, so I used the dockerized version of PDAL and it worked great. Problem is, the dockerized version complicates my commands on the command line, especially if I want to run it on a bunch of files.
Naturally, the next step is to run it on a whole bunch of LiDAR files. For that I need a little control script which I called pdal_height.sh, and then I need to run that in a “for” loop.
#!/bin/bash # Get the pathname from the input value pathname="${1%/*}"; # Get the short name of the file, sans path and sans extension name=`basename $1 .las` docker run -v $pathname:/data pdal/master pdal translate //data/"$name".las //data/"$name"_height.bpf height --writers.bpf.output_dims="X,Y,Z,Intensity,ReturnNumber,NumberOfReturns,ScanDirectionFlag,EdgeOfFlightLine,Classification,ScanAngleRank,UserData,PointSourceId";
Now we need a basic for loop will take care of sending the las files into our pdal_height.sh, thus looping through all available las files:
for OUTPUT in $(ls *.las); do ~/./pdal_height.sh $OUTPUT; done;
This is great, but I calculated it would take 13 days to complete on my 58366 LiDAR files. We’re talking approximately 41,000 square miles of non-water areas for Ohio, and approximately 44,000 square miles of non-water areas for Pennsylvania. I’m on no particular timeline, but I’m not really that patient. Quick duckduckgo search later, and I remember the GNU Parallel project. It’s wicked easy to use for this use case.
ls *.las | parallel -j24 ~/./pdal_height.sh
How simple! First, we list our las files, then we “pipe” them as a list to parallel, we tell parallel we want it to spawn 24 independent processes using that list as the input for our pdal_height script.
Now we can run it on 24 cores simultaneously. Sadly, I have slow disks 😦 so really I ran it like this:
ls *.las | parallel -j6 ~/./pdal_height.sh
Time to upgrade my disks! Finally, I want to process my entire LiDAR dataset irrespective of location. For this, we use the find command, name the starting directory location, and tell it we want to search based on name.
find /home/myspecialuser/LiDAR/ -name "*.las" | parallel -j6 ~/./pdal_height.sh
Estimated completion time: 2 days. I can live with that until I get better disks. Only one problem, I should make sure this doesn’t stop if my network drops for any reason. Let’s wrap this in nohup which will prevent network-based hangups:
nohup sh -c 'find /home/myspecialuser/LiDAR -name "*.las" | parallel -j6 ~/./pdal_height.sh {}'