[00:33] *** Arcorann_ has joined #archiveteam-ot [00:34] *** Arcorann_ has quit IRC (Remote host closed the connection) [00:34] *** Arcorann_ has joined #archiveteam-ot [00:51] *** Wingy has joined #archiveteam-ot [02:15] does anyone have experience with grab-site generating invalid WARC records? [02:20] I'm getting lots of "Exception: Error -3 while decompressing data: invalid distance too far back" when validating/indexing them [02:22] The description makes it sound like it's something with the compression rather than the warc [02:25] Terbium: might be related to https://github.com/ArchiveTeam/wpull/issues/424? [02:26] Looks like "invalid distance too far back" is a zlib error message [02:27] even the generated CDX record lines are cut off for the corrupted/invalid records [02:27] Thanks ivan, i also encountered that issue where warcat found lots of invalid digest/checksums [02:28] if you have a way to reproduce this please file a grab-site bug so that at least people know [02:29] Will do, im an looking into why grabsite is either producing invalid headers or the WARC reader is reading the records improperly [03:48] *** qw3rty__ has joined #archiveteam-ot [03:56] *** qw3rty_ has quit IRC (Read error: Operation timed out) [04:44] *** nyany has quit IRC (Read error: Operation timed out) [04:45] *** revi has quit IRC (Ping timeout: 260 seconds) [04:45] *** jrwr has quit IRC (Ping timeout: 260 seconds) [04:45] *** namespace has quit IRC (Read error: Operation timed out) [04:45] *** prq has quit IRC (Write error: Broken pipe) [04:45] *** mtntmnky has quit IRC (Read error: Operation timed out) [04:47] *** Igloo has quit IRC (Read error: Operation timed out) [04:47] *** DigiDigi has quit IRC (Read error: Operation timed out) [04:54] *** Arcorann_ has quit IRC (Read error: Connection reset by peer) [05:01] *** nyany has joined #archiveteam-ot [05:01] *** mtntmnky has joined #archiveteam-ot [05:01] *** prq has joined #archiveteam-ot [05:01] *** namespace has joined #archiveteam-ot [05:01] *** Igloo has joined #archiveteam-ot [05:02] *** DigiDigi has joined #archiveteam-ot [05:03] *** revi has joined #archiveteam-ot [05:07] *** jrwr has joined #archiveteam-ot [05:20] *** Arcorann has joined #archiveteam-ot [06:02] *** Raccoon has quit IRC (Remote host closed the connection) [06:44] *** Ryz has quit IRC (Remote host closed the connection) [06:44] *** kiska1825 has quit IRC (Remote host closed the connection) [06:45] *** kiska1825 has joined #archiveteam-ot [06:45] *** Ryz has joined #archiveteam-ot [06:50] *** WalkFly has joined #archiveteam-ot [07:10] *** Stiletto has joined #archiveteam-ot [07:21] *** Stiletto has quit IRC () [07:53] *** Arcorann has quit IRC (Read error: Connection reset by peer) [07:57] *** Arcorann has joined #archiveteam-ot [08:40] *** nepeat_ has joined #archiveteam-ot [08:40] *** nepeat has quit IRC (Read error: Connection reset by peer) [08:52] *** BlueMax has quit IRC (Read error: Connection reset by peer) [09:39] *** Stiletto has joined #archiveteam-ot [10:43] *** Arcorann has quit IRC (Read error: Connection reset by peer) [10:44] *** Arcorann has joined #archiveteam-ot [10:56] *** Arcorann has quit IRC (Read error: Connection reset by peer) [10:57] *** Arcorann has joined #archiveteam-ot [11:48] *** Arcorann_ has joined #archiveteam-ot [11:52] *** Arcorann has quit IRC (Ping timeout: 265 seconds) [12:31] Terbium: Yes, I've seen this before. The exact error message was slightly different though. In my case, a partial record was written to the WARC. I have no idea what caused it though. [12:31] Was never able to reproduce it either. [12:35] Could I get a copy of your file? [14:06] *** Ctrl has quit IRC (Read error: Operation timed out) [14:18] *** nepeat_ has quit IRC (Quit: ZNC 1.7.5 - https://znc.in) [14:22] *** nepeat has joined #archiveteam-ot [14:34] *** Arcorann_ has quit IRC (Read error: Connection reset by peer) [14:52] *** Ctrl has joined #archiveteam-ot [15:26] JAA: Sure, i have tons of these invalid WARC files that grab-site seems to be generating [15:28] I've been trying to reproduce it by trying to archive the specific URLs of the invalid records but havent been successful [15:29] Most if the time, it appears that either invalid gzip headers are written or it finds the incorrect gzip magic number [15:33] Tons? Oof, that sounds bad. I've only seen a single one across TBs of grab-site WARCs. [15:35] around 60% of the WARCs i get via grab-site has some sort in invalid compression / partial record writes [15:35] that both warcat and warctools pickup [15:35] Damn [15:35] Well, at least there's a chance this could be reproduced then. [15:36] interestingly enough, if i rerun the same crawl, enough times, i'll get a valid final WARC [15:36] Were all these crawls run on the same machine? [15:36] was hoping to spend some time to dissect the WARC file and extract the corrupted parts and attempt to manually read them to see what parts of the record actually have issues [15:36] yeah same machine [15:48] Let's move this into #archiveteam-dev (on hackint) as this certainly seems like a bug somewhere in grab-site, wpull, or maybe even deeper. [16:31] *** DogsRNice has joined #archiveteam-ot [17:09] *** SirSpain has joined #archiveteam-ot [17:09] Kaz: DO NOT BAN ME NEVER. [17:09] *** JAA sets mode: +b *!*webchat@181.94.209.* [17:09] *** SirSpain was kicked by JAA (SirSpain) [17:09] ow, I don't have op here [17:09] *** JAA sets mode: +o Kaz [17:10] *** kiska sets mode: +o Kaz [17:10] Now you have two ops! [17:10] :D [17:11] *** Raccoon has joined #archiveteam-ot [17:12] Sucks that EFnet doesn't have extbans so we can ban him globally [17:12] * Raccoon is so evasive [17:22] *** SirSpain_ has joined #archiveteam-ot [17:22] JAA: IDIOT [17:23] *** JAA sets mode: +b *!*webchat@195.198.105.* [17:23] *** SirSpain_ was kicked by JAA (no u) [17:59] this guy is sure insistent.... [20:17] and so weirdly vague about what it is that he actually *wants* [20:18] who the hell joins a chat room just to say: "don't remove me?" [20:20] *** schbirid has quit IRC (Quit: Leaving) [20:34] I think that it's frustration [20:37] Somebody2: He first joined #archivebot with ridiculous archival requests, then everywhere else when he got kickbanned from there. [20:38] Just to give you an idea, his first interaction with AT (as far as I know) was this: 2017-11-22 17:19:09 UTC < joaquinit> !archive https://mega.nz/ [20:38] Was that in #archivebot? [20:38] Yes [20:38] Oh, explains why it's not in the logs [20:39] I'm at least 99 % sure he thought that would archive all files on MEGA. [20:39] He tried to archive all of Youtube with Chromebot a few days ago, so I wouldn't doubt it [20:40] Yup [20:41] Oh yeah, he also impersonated WikiMedia Foundation people. And he tried to impersonate me on Wikipedia. ¯\_(ツ)_/¯ [20:43] When was that? [20:43] Impersonating you on Wikipedia? [20:44] A couple weeks ago. [20:44] Someone let me know, and the account was banned within a few hours. [20:44] Huh [20:47] *** VADemon has joined #archiveteam-ot [21:25] *** HP_Archiv has joined #archiveteam-ot [21:28] *** HP_Archiv has quit IRC (Client Quit) [22:58] *** Arcorann_ has joined #archiveteam-ot [22:59] *** Arcorann_ has quit IRC (Remote host closed the connection) [23:00] *** Arcorann_ has joined #archiveteam-ot [23:02] *** britmob has quit IRC (Read error: Connection reset by peer) [23:27] *** HP_Archiv has joined #archiveteam-ot [23:27] *** Arcorann_ has quit IRC (Read error: Connection reset by peer) [23:27] *** Arcorann_ has joined #archiveteam-ot [23:30] *** britmob has joined #archiveteam-ot [23:33] *** Arcorann_ has quit IRC (Ping timeout: 265 seconds) [23:36] *** Arcorann has joined #archiveteam-ot [23:41] *** HP_Archiv has quit IRC (Quit: Leaving)