Unix & Linux Asked by edperez on November 26, 2021
I’m trying to get gawk to split a text file into a different file each time a paragraph contains an occurrence of a code of the form "7-04/PNLP-000001". So, for instance, if the original text file contains the following:
Proposición no de Ley 7-04/PNLP-000009, relativa a la línea Ave Sevilla-Córdoba-Madrid.
La señora PRESIDENTA
Proposición no de Ley 7-04/PNLP-000001, relativa a la restitución y mejora de los derechos y la cobertura social de los trabajadores del medio rural andaluz.
La señora PRESIDENTA
I would like to get a file with this content:
Proposición no de Ley 7-04/PNLP-000009, relativa a la línea Ave Sevilla-Córdoba-Madrid.
La señora PRESIDENTA
and another with this content:
Proposición no de Ley 7-04/PNLP-000001, relativa a la restitución y mejora de los derechos y la cobertura social de los trabajadores del medio rural andaluz.
La señora PRESIDENTA
I’m trying to to this with this code:
gawk '
/^n.+[0-9]-[0-9]{2}/.+-[0-9]{6}$/
{if (p) close (p)
p = sprintf("split%05i.txt", ++i) }
{ print > p; }
' input.txt
However, this just creates one file per line, whatever its content. Does anyone know what I’m doing wrong? Thanks in advance!
I would do it like this:perl -ne 'my $fh="/dev/stdout"; if(/7-04/PNLP-(d+)/) { close $fh; open($fh,">/path/to/outputfiles/file$1"); } ; print $fh $_;' < /path/to/inputfile
Answered by Garo on November 26, 2021
You're close:
awk '/[0-9]-[0-9]{2}/[[:upper:]]+-[0-9]{6}/ {
if (file) close (file)
file = sprintf("split%05i.txt", ++i)
}
file {print > file}' input.txt
You want the { if... }
code block to be run for the lines that match the [0-9]...
pattern, so, it should be on the same line as the /.../
.
The second code block {print > file}
is to be run for every record as long as file
is set, using file
as the condition.
Having n
in your pattern here doesn't make sense as each record that awk
processes in turn is the contents of each line (as the default record separator (RS
) is n
), so a record is never going to contain a newline character. You also don't want to anchor your regexp here (^
and $
).
I've replaced your .+
with [[:upper:]]+
so as to be more specific. With .+
, it would match on blah 5-10/2 blah blah €1000000
for instance. You may need to adapt depending on what you want to accept in place of PNLP
.
Note that it also matches on blah 1234-56/XX-1234567890 blah
as that does contain a string that matches the pattern (see part in bold).
I've removed the g
in gawk
as that code is not gawk
specific. However note that there are still a few awk implementations that don't support the {2}
/{6}
operators above (even though that's a POSIX requirement), so if you know gawk
is going to be available, you might as well use it to make sure it works.
Answered by Stéphane Chazelas on November 26, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP