patterns/zip: Change how the end of central directory record is found (#60)

Previously, the zip pattern was searching for the end-of-central-directory
header signature (50 4B 05 06) by searching the entire file for it. This is
*very* slow for large files, and risks false positives since those bytes
could randomly appear in compressed data. I had this happen on the first
large (>2GB) zip file I tried.

I'm now checking for the EOCD signature at exactly 22 bytes from the end of
the file (in the common case there is no zip comment), and if that fails
I search for it in the last 64KB of the file (in case there *is* a comment
at the end of the EOCD, which can't be larger than 64KB). This is much
faster, and fixes loading my zip file where it was spuriously finding the
signature in the wrong place.

This still has a low risk of false positives (what if the comment has the
50 4B 05 06 bytes? what if there is a short comment but the signature
appears in the last 64KB of compressed data?), but I don't know what's the
"right" way to find the EOCD, or how proper zip-reading tools handle the
ambiguity...
This commit is contained in:
Nicolás Alvarez
2022-11-19 14:05:26 -03:00
committed by GitHub
parent fbb6a84323
commit 01a1bd0d9f

View File

@@ -1,6 +1,7 @@
#pragma MIME application/zip
#include <std/mem.pat>
#include <std/math.pat>
struct EndOfCentralDirectory {
u32 headerSignature [[color("00000000")]];
@@ -14,7 +15,24 @@ struct EndOfCentralDirectory {
char coment[commentLength] [[name("Comment")]];
};
EndOfCentralDirectory fileInfo @ std::mem::find_sequence(0,0x50,0x4B,0x05,0x06) [[name("End of Central Directory Record")]];
fn find_eocd() {
// If there is no zip comment, which is the common case,
// the end-of-central-directory record will be 22 bytes long
// at the end of the file; check if size-22 has the signature.
if (std::mem::read_unsigned(std::mem::size()-22, 4, std::mem::Endian::Little) == 0x06054B50) {
return std::mem::size()-22;
} else {
// If it's not there, then there's probably a zip comment;
// search the last 64KB of the file for the signature.
// This is not entirely reliable, since the signature could
// randomly appear in compressed data before the actual EOCD,
// but it should be good enough...
u128 last64k = std::math::max(0, std::mem::size()-65536-22);
return std::mem::find_sequence_in_range(0, last64k, std::mem::size(), 0x50,0x4B,0x05,0x06);
}
};
EndOfCentralDirectory fileInfo @ find_eocd() [[name("End of Central Directory Record")]];
struct CentralDirectoryFileHeader {
u32 headerSignature [[color("00000000")]];