Glorious Gawk part II

Here’s a snapshot from a shell script to extract various important segments from a *ps file, after conversion from *pdf. It uses various gawk/awk tricks including using patterns for brackets, checking lengths of records to discriminate lines/polylines. Your mileage may vary a little, but if you check the *ps file preamble, you should be able to translate this to your specific tasks

# line/text extract from pdf wjb 12/08, 02/08

[ -z $1 ] && echo “ <filename>” && exit $NOFILE


#convert to ps

echo “converting pdf -> ps…”

pdf2ps $myfile

echo “…done”

#take out line drawing sections w/ line numbers

echo “extracting lines & text…”

awk ‘$4==”scale”,$1==”Q” {print NR > “gHjLq.txt”}’
awk ‘$4==”scale”,$1==”Q” {print $0 > “gHjLp.txt”}’
awk ‘$1==”q” {print NR > “test.txt”}’

# _p == polylines,  _l == lines

awk ‘BEGIN { RS = “q” } ; {if (NF > 12) print NR,$0 > “gHjLp_p.txt”; else print NR,$0 > “gHjLp_l.txt”}’  gHjLp.txt

awk ‘$4==”scale” {print $1 > “gHjLp_foo.txt”}’ gHjLp_p.txt

awk ‘$4==”scale”,$1==”Q” {print NR,$0 > “gHjLp_mol.txt”}’

#take out text w/ line numbers

awk ‘$4 == “,” {print NR,$0 > “gHjLp_text.txt”}; $1 == “$C” {print NR,x > “gHjLp_text.txt”}; {x=$0}; $1 == “$C”, $1 == “,” {print NR,$0> “gHjLp_text.txt”}’


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s