C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML
This can be removed. We rewrite the contents of our GZIP files to remove the optional filename metadata. This data increases the file size with no benefit.
Specification. After researching this problem, I found the GZIP File Format Specification, which outlines the contents of every valid GZIP file's headers. From the specification, each file begins with specific bytes.
GZIP file format specification: gzip.org
Byte order specification +---+---+---+---+---+---+---+---+---+---+ |ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->) +---+---+---+---+---+---+---+---+---+---+ Text description of bits FLG (FLaGs) This flag byte is divided into individual bits as follows: bit 0 FTEXT bit 1 FHCRC bit 2 FEXTRA bit 3 FNAME bit 4 FCOMMENT bit 5 reserved bit 6 reserved bit 7 reserved
The FLG byte (the fourth byte) contains 8 bits that can be set to 1 or 0 depending on what optional header information exists. If a bit is set to 1, you can find the data for that bit starting at the 11th bit in the GZIP file.
Note: We can see from the above information that the flag byte contains the FNAME bit at the fourth position, bit 3.
Therefore: If the FNAME bit is set to 1, we can remove filename data starting at bit 11 through null.
Removes file name from GZIP files: C# /// <summary> /// Rewrite the GZIP file specified and remove the file name bytes in it. /// </summary> static void GZipRemoveFileName(string fn) { // Read in GZIP byte[] b = File.ReadAllBytes(fn); // See if the file name is set. // We don't deal with the file if any other flags are set. if (b[3] == 8) { // Allocate the copy bytes List<byte> copy = new List<byte>(b.Length); // The flag byte will be set to 0 b[3] = 0; // Add the first ten bytes for (int i = 0; i < 10; i++) { copy.Add(b[i]); } // Ignore all non-null name characters int a = 10; while (b[a] != 0) { a++; } // Ignore the null a++; // Add the rest of the file for (int i = a; i < b.Length; i++) { copy.Add(b[i]); } // Write the new byte array File.WriteAllBytes(fn, copy.ToArray()); } else { // Note that we could not rewrite the file Console.WriteLine("Flag invalid: {0}", fn); } }
We receive the filename of a GZIP file, and then read its bytes with File.ReadAllBytes. We then test the flag byte for 8. If the flag byte is 8, that means the fourth byte is set to 1. This means we can remove bytes 10 through null.
Info: The loops that follow simply copy the first 10 bytes, setting the fourth byte to 0, and then skip the filename bytes.
Finally: The rest of the bytes are copied unchanged. We call the Add method on the List to do this.
Discussion. This algorithm can be implemented in any language that lets you check individual bytes. This article isn't about the C# language but rather the algorithm you can use to remove the filename, as well as the general structure of GZIP.
Some GZIP programs such as 7-Zip will leave the filename even if you don't want it. I haven't found a way to remove it in these programs. This method can improve compression ratios for 7-Zip files.
Result. This algorithm modifies the GZIP files in my website correctly and it saves a significant number of bytes in each file. In an archive with about 300 GZIP files, as well as other stuff.
Method results Before (with filenames): 5.56 MB (5,840,304 bytes) After (no filenames): 5.56 MB (5,835,142 bytes) Savings per file: 17 bytes
Summary. We saw an algorithm that uses the GZIP specification information to rewrite GZIP files, saving an average of 17 bytes per GZIP file. The file name in GZIP is optional, and serves no purpose often.
Further: We saw how to read the GZIP specification. We used it to manipulate the bytes in a GZIP file.
And: These two tasks could help programmers in any language, not just .NET Framework ones.