#archiveteam-ot 2020-07-26,Sun

↑back Search

Time Nickname Message
00:33 🔗 Arcorann_ has joined #archiveteam-ot
00:34 🔗 Arcorann_ has quit IRC (Remote host closed the connection)
00:34 🔗 Arcorann_ has joined #archiveteam-ot
00:51 🔗 Wingy has joined #archiveteam-ot
02:15 🔗 Terbium does anyone have experience with grab-site generating invalid WARC records?
02:20 🔗 Terbium I'm getting lots of "Exception: Error -3 while decompressing data: invalid distance too far back" when validating/indexing them
02:22 🔗 OrIdow6 The description makes it sound like it's something with the compression rather than the warc
02:25 🔗 ivan Terbium: might be related to https://github.com/ArchiveTeam/wpull/issues/424?
02:26 🔗 OrIdow6 Looks like "invalid distance too far back" is a zlib error message
02:27 🔗 Terbium even the generated CDX record lines are cut off for the corrupted/invalid records
02:27 🔗 Terbium Thanks ivan, i also encountered that issue where warcat found lots of invalid digest/checksums
02:28 🔗 ivan if you have a way to reproduce this please file a grab-site bug so that at least people know
02:29 🔗 Terbium Will do, im an looking into why grabsite is either producing invalid headers or the WARC reader is reading the records improperly
03:48 🔗 qw3rty__ has joined #archiveteam-ot
03:56 🔗 qw3rty_ has quit IRC (Read error: Operation timed out)
04:44 🔗 nyany has quit IRC (Read error: Operation timed out)
04:45 🔗 revi has quit IRC (Ping timeout: 260 seconds)
04:45 🔗 jrwr has quit IRC (Ping timeout: 260 seconds)
04:45 🔗 namespace has quit IRC (Read error: Operation timed out)
04:45 🔗 prq has quit IRC (Write error: Broken pipe)
04:45 🔗 mtntmnky has quit IRC (Read error: Operation timed out)
04:47 🔗 Igloo has quit IRC (Read error: Operation timed out)
04:47 🔗 DigiDigi has quit IRC (Read error: Operation timed out)
04:54 🔗 Arcorann_ has quit IRC (Read error: Connection reset by peer)
05:01 🔗 nyany has joined #archiveteam-ot
05:01 🔗 mtntmnky has joined #archiveteam-ot
05:01 🔗 prq has joined #archiveteam-ot
05:01 🔗 namespace has joined #archiveteam-ot
05:01 🔗 Igloo has joined #archiveteam-ot
05:02 🔗 DigiDigi has joined #archiveteam-ot
05:03 🔗 revi has joined #archiveteam-ot
05:07 🔗 jrwr has joined #archiveteam-ot
05:20 🔗 Arcorann has joined #archiveteam-ot
06:02 🔗 Raccoon has quit IRC (Remote host closed the connection)
06:44 🔗 Ryz has quit IRC (Remote host closed the connection)
06:44 🔗 kiska1825 has quit IRC (Remote host closed the connection)
06:45 🔗 kiska1825 has joined #archiveteam-ot
06:45 🔗 Ryz has joined #archiveteam-ot
06:50 🔗 WalkFly has joined #archiveteam-ot
07:10 🔗 Stiletto has joined #archiveteam-ot
07:21 🔗 Stiletto has quit IRC ()
07:53 🔗 Arcorann has quit IRC (Read error: Connection reset by peer)
07:57 🔗 Arcorann has joined #archiveteam-ot
08:40 🔗 nepeat_ has joined #archiveteam-ot
08:40 🔗 nepeat has quit IRC (Read error: Connection reset by peer)
08:52 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
09:39 🔗 Stiletto has joined #archiveteam-ot
10:43 🔗 Arcorann has quit IRC (Read error: Connection reset by peer)
10:44 🔗 Arcorann has joined #archiveteam-ot
10:56 🔗 Arcorann has quit IRC (Read error: Connection reset by peer)
10:57 🔗 Arcorann has joined #archiveteam-ot
11:48 🔗 Arcorann_ has joined #archiveteam-ot
11:52 🔗 Arcorann has quit IRC (Ping timeout: 265 seconds)
12:31 🔗 JAA Terbium: Yes, I've seen this before. The exact error message was slightly different though. In my case, a partial record was written to the WARC. I have no idea what caused it though.
12:31 🔗 JAA Was never able to reproduce it either.
12:35 🔗 JAA Could I get a copy of your file?
14:06 🔗 Ctrl has quit IRC (Read error: Operation timed out)
14:18 🔗 nepeat_ has quit IRC (Quit: ZNC 1.7.5 - https://znc.in)
14:22 🔗 nepeat has joined #archiveteam-ot
14:34 🔗 Arcorann_ has quit IRC (Read error: Connection reset by peer)
14:52 🔗 Ctrl has joined #archiveteam-ot
15:26 🔗 Terbium JAA: Sure, i have tons of these invalid WARC files that grab-site seems to be generating
15:28 🔗 Terbium I've been trying to reproduce it by trying to archive the specific URLs of the invalid records but havent been successful
15:29 🔗 Terbium Most if the time, it appears that either invalid gzip headers are written or it finds the incorrect gzip magic number
15:33 🔗 JAA Tons? Oof, that sounds bad. I've only seen a single one across TBs of grab-site WARCs.
15:35 🔗 Terbium around 60% of the WARCs i get via grab-site has some sort in invalid compression / partial record writes
15:35 🔗 Terbium that both warcat and warctools pickup
15:35 🔗 JAA Damn
15:35 🔗 JAA Well, at least there's a chance this could be reproduced then.
15:36 🔗 Terbium interestingly enough, if i rerun the same crawl, enough times, i'll get a valid final WARC
15:36 🔗 JAA Were all these crawls run on the same machine?
15:36 🔗 Terbium was hoping to spend some time to dissect the WARC file and extract the corrupted parts and attempt to manually read them to see what parts of the record actually have issues
15:36 🔗 Terbium yeah same machine
15:48 🔗 JAA Let's move this into #archiveteam-dev (on hackint) as this certainly seems like a bug somewhere in grab-site, wpull, or maybe even deeper.
16:31 🔗 DogsRNice has joined #archiveteam-ot
17:09 🔗 SirSpain has joined #archiveteam-ot
17:09 🔗 SirSpain Kaz: DO NOT BAN ME NEVER.
17:09 🔗 JAA sets mode: +b *!*webchat@181.94.209.*
17:09 🔗 SirSpain was kicked by JAA (SirSpain)
17:09 🔗 Kaz ow, I don't have op here
17:09 🔗 JAA sets mode: +o Kaz
17:10 🔗 kiska sets mode: +o Kaz
17:10 🔗 JAA Now you have two ops!
17:10 🔗 kiska :D
17:11 🔗 Raccoon has joined #archiveteam-ot
17:12 🔗 JAA Sucks that EFnet doesn't have extbans so we can ban him globally
17:12 🔗 * Raccoon is so evasive
17:22 🔗 SirSpain_ has joined #archiveteam-ot
17:22 🔗 SirSpain_ JAA: IDIOT
17:23 🔗 JAA sets mode: +b *!*webchat@195.198.105.*
17:23 🔗 SirSpain_ was kicked by JAA (no u)
17:59 🔗 Terbium this guy is sure insistent....
20:17 🔗 Somebody2 and so weirdly vague about what it is that he actually *wants*
20:18 🔗 Somebody2 who the hell joins a chat room just to say: "don't remove me?"
20:20 🔗 schbirid has quit IRC (Quit: Leaving)
20:34 🔗 OrIdow6 I think that it's frustration
20:37 🔗 JAA Somebody2: He first joined #archivebot with ridiculous archival requests, then everywhere else when he got kickbanned from there.
20:38 🔗 JAA Just to give you an idea, his first interaction with AT (as far as I know) was this: 2017-11-22 17:19:09 UTC < joaquinit> !archive https://mega.nz/
20:38 🔗 OrIdow6 Was that in #archivebot?
20:38 🔗 JAA Yes
20:38 🔗 OrIdow6 Oh, explains why it's not in the logs
20:39 🔗 JAA I'm at least 99 % sure he thought that would archive all files on MEGA.
20:39 🔗 OrIdow6 He tried to archive all of Youtube with Chromebot a few days ago, so I wouldn't doubt it
20:40 🔗 JAA Yup
20:41 🔗 JAA Oh yeah, he also impersonated WikiMedia Foundation people. And he tried to impersonate me on Wikipedia. ¯\_(ツ)_/¯
20:43 🔗 OrIdow6 When was that?
20:43 🔗 OrIdow6 Impersonating you on Wikipedia?
20:44 🔗 JAA A couple weeks ago.
20:44 🔗 JAA Someone let me know, and the account was banned within a few hours.
20:44 🔗 OrIdow6 Huh
20:47 🔗 VADemon has joined #archiveteam-ot
21:25 🔗 HP_Archiv has joined #archiveteam-ot
21:28 🔗 HP_Archiv has quit IRC (Client Quit)
22:58 🔗 Arcorann_ has joined #archiveteam-ot
22:59 🔗 Arcorann_ has quit IRC (Remote host closed the connection)
23:00 🔗 Arcorann_ has joined #archiveteam-ot
23:02 🔗 britmob has quit IRC (Read error: Connection reset by peer)
23:27 🔗 HP_Archiv has joined #archiveteam-ot
23:27 🔗 Arcorann_ has quit IRC (Read error: Connection reset by peer)
23:27 🔗 Arcorann_ has joined #archiveteam-ot
23:30 🔗 britmob has joined #archiveteam-ot
23:33 🔗 Arcorann_ has quit IRC (Ping timeout: 265 seconds)
23:36 🔗 Arcorann has joined #archiveteam-ot
23:41 🔗 HP_Archiv has quit IRC (Quit: Leaving)

irclogger-viewer