[00:21] dashcloud: your down by 3 items based on google cache copy [00:22] what was taking down anyways? [00:23] all i can tell is its 3 things with archiveteam subject in them [00:25] the farside calendar thing I uploaded [00:25] probably the original disk, and the one or two tries at having it emulate under windows 3.1 [00:35] *** RichardG has joined #archiveteam-bs [00:59] *** pnJay has quit IRC (Leaving) [00:59] *** pnJay has joined #archiveteam-bs [01:05] *** bwn has quit IRC (Ping timeout: 244 seconds) [01:13] *** bwn has joined #archiveteam-bs [01:14] *** icedice2 has quit IRC (Quit: Leaving) [01:39] *** RichardG has quit IRC (Read error: Operation timed out) [01:39] *** RichardG has joined #archiveteam-bs [01:46] MLKSHK is shutting down and removing the ability to view posts without logging in on April 1, and then stop serving files May 1. Blog post is here: http://mlkshk.typepad.com/mlkshk/2017/02/mlkshk-shutting-down.html [01:47] There was previously a project for this, were any scripts kept around? http://archiveteam.org/index.php?title=MLKSHK [01:52] Signups are closed, which means unless you have an account there is only 1 week left till all this content disappears. [02:11] *** zino has quit IRC (Read error: Operation timed out) [02:15] *** zino has joined #archiveteam-bs [02:37] *** ndiddy has quit IRC () [03:15] *** yuitimoth has quit IRC (Remote host closed the connection) [03:16] *** yuitimoth has joined #archiveteam-bs [03:18] *** yuitimoth has quit IRC (Remote host closed the connection) [03:21] *** yuitimoth has joined #archiveteam-bs [03:28] *** yuitimoth has quit IRC (Remote host closed the connection) [03:29] *** yuitimoth has joined #archiveteam-bs [03:30] *** pizzaiolo has quit IRC (Remote host closed the connection) [03:30] *** yuitimoth has quit IRC (Remote host closed the connection) [03:30] *** yuitimoth has joined #archiveteam-bs [03:31] *** yuitimoth has quit IRC (Remote host closed the connection) [03:31] *** yuitimoth has joined #archiveteam-bs [03:31] *** yuitimoth has quit IRC (Remote host closed the connection) [03:31] *** yuitimoth has joined #archiveteam-bs [03:32] *** yuitimoth has quit IRC (Remote host closed the connection) [03:33] *** yuitimoth has joined #archiveteam-bs [03:33] *** yuitimoth has quit IRC (Remote host closed the connection) [03:33] *** yuitimoth has joined #archiveteam-bs [03:34] *** yuitimoth has quit IRC (Remote host closed the connection) [03:34] *** yuitimoth has joined #archiveteam-bs [03:34] *** yuitimoth has quit IRC (Remote host closed the connection) [03:34] *** yuitimoth has joined #archiveteam-bs [04:00] *** yuitimoth has quit IRC (Remote host closed the connection) [04:00] *** yuitimoth has joined #archiveteam-bs [04:01] *** yuitimoth has quit IRC (Remote host closed the connection) [04:01] *** yuitimoth has joined #archiveteam-bs [05:02] *** yuitimoth has quit IRC (Remote host closed the connection) [05:03] *** yuitimoth has joined #archiveteam-bs [05:44] *** Sk1d has joined #archiveteam-bs [05:54] Content block length changed from 4327 to 4318 [05:54] is that fine? [05:54] (output from python3 -m warcat verify) [06:54] sort of paranoid about these WARCs I'm getting out of wpull, especially since I've stopped and resumed it a few times to change options. it's resulted in having multiple WARC files (--warc-append and --warc-max-size used together make a new WARC every time you restart). but they all seem fine. [06:54] and it works great if I load them all into pywb [06:55] yeah, there's probably no issue here. [07:08] Somebody2: 4.71million so far, with 12 timeouts [07:21] *** JAA has joined #archiveteam-bs [07:39] *** odemg has quit IRC (Remote host closed the connection) [07:43] *** odemg has joined #archiveteam-bs [08:00] *** GE has joined #archiveteam-bs [08:09] *** odemg has quit IRC (Remote host closed the connection) [08:50] *** jtn2 has quit IRC (Ping timeout: 255 seconds) [10:06] *** GE has quit IRC (Quit: zzz) [11:08] *** BlueMaxim has quit IRC (Quit: Leaving) [11:23] *** GE has joined #archiveteam-bs [11:28] *** BartoCH has quit IRC (Read error: Connection reset by peer) [11:29] *** BartoCH has joined #archiveteam-bs [12:48] *** n00b811 has joined #archiveteam-bs [12:48] Does anyone have experience opening large (>10GB) .warc files [12:54] Patience is a virtue [12:57] I tried to open it on my windows 2012 R2 server but webarchiveplayer just closed after ~18 hours [13:13] https://twitter.com/GossiTheDog/status/845446263244050434 [13:14] *** odemg has joined #archiveteam-bs [13:24] o_O [13:26] Just store everything in the cloud, they said. It'll be glorious, they said. [13:27] its like kazaa but without porn [13:28] * SpaffGarg searches for "passwords" [13:29] * JAA searches for "CCV" [13:32] CVV* [13:36] haha, wow [13:43] "wpull.engine - WARNING - Discarding 1 unprocessed item." - Is this something to worry about? [13:44] Happened when I Ctrl-C'd wpull to increase the concurrency. [13:47] So.. I just found a load of bank statements for someone on that site [13:50] Identity theft made easy: https://docs.com/en-us/search?q=curriculum%20vitae [13:52] I found some birth certificates, social security numbers, and passports... [13:54] passports are easy, people post their new ones on twitter all the time [14:00] Oh great, found a huge list containing various information about over 1000 people: name, address, date of birth, SSN, bank, credit card number + CVV + expiration date, name + SSN of the spouse, etc. [14:16] Lol, I just wondered why wpull had stalled. Then I realised that I was in scrollback mode. [14:19] *** odemg has quit IRC (Remote host closed the connection) [14:29] *** fie has quit IRC (Ping timeout: 250 seconds) [14:40] *** fie has joined #archiveteam-bs [14:45] *** odemg has joined #archiveteam-bs [14:58] *** kristian_ has joined #archiveteam-bs [15:52] *** RichardG has quit IRC (Ping timeout: 255 seconds) [16:00] *** RichardG has joined #archiveteam-bs [16:01] *** odemg has quit IRC (Remote host closed the connection) [16:03] HCross2: cool, good to know about the progress on the census. [16:08] *** odemg has joined #archiveteam-bs [16:21] wpull does the epoll_wait(4, thing on my machine too when I ctrl+c it [16:21] forcing me to press ctrl+C again [16:24] hopefully doing that doesn't break things [17:22] I get a UnicodeDecodeError when trying to extract a WARC with warcat :| [17:23] this bug https://github.com/chfoo/warcat/issues/12 [17:55] *** odemg2 has joined #archiveteam-bs [17:58] *** odemg has quit IRC (Read error: Operation timed out) [17:58] okay, it's tripping up on an invalid character in an HTTP header [17:59] curl -I http://images2.wikia.nocookie.net/__cb20120621080252/aonoexorcistsp/es/images/9/9a/Mephisto_gui%C3%B1o.gif [17:59] Content-Disposition: inline; filename="Mephisto_gui�o.gif"; filename*=UTF-8''Mephisto_gui%C3%B1o.gif [17:59] that thing in the filename= field [18:01] if you're confident doing that, you can change .decode() to .decode('utf-8', 'replace') [18:02] Yeah, I can do that [18:09] *** pizzaiolo has joined #archiveteam-bs [18:15] *** jtn2 has joined #archiveteam-bs [19:01] *** pnJay has quit IRC (Read error: Operation timed out) [19:22] *** GE has quit IRC (Remote host closed the connection) [20:00] *** icedice has joined #archiveteam-bs [20:39] *** Zebranky has quit IRC (Ping timeout: 250 seconds) [20:43] *** Zebranky has joined #archiveteam-bs [20:45] *** GE has joined #archiveteam-bs [21:02] Just spent like 3 hours figuring out why my WARC files had invalid payload hashes. Wpull discards trailing whitespace on its internal representation of the header field values, leading to an incorrect payload offset [21:02] so it reads from the wrong spot and gets a bad hash (the actual data is fine however) [21:23] Ugh, not good [21:23] chfoo: ^ [21:23] I can submit a patch [21:23] probably will after I make a test case [21:24] if I can't figure it out I'll at least make a github issue [21:25] Hmm, "I'll merge when I have time to work on Wpull again." on https://github.com/chfoo/wpull/pull/348 doesn't sound promising to be honest. :-/ [21:32] reading the HTTP RFC. It's fine that wpull ignores leading/trailing whitespace in header fields. But it should probably store the actual length of the header separately, because it needs it [21:36] I don't think the length is sufficient. [21:36] why not? [21:36] From section 4.2 of RFC 2616: "Such leading or trailing LWS MAY be removed without changing the semantics of the field value." [21:37] And in section 2.2, LWS is defined as '[CRLF] 1*( SP | HT )', where SP is the space character (0x20) and HT is the horizontal tab (0x09). [21:38] yes. so the client (wpull) is allowed to discard it. but wpull also needs to checksum the payload for the WARC file, and the payload is everything after the message headers (or that's the general idea). So what it's doing (paraphrased) is "payload_offset = len(response.headers.toString())" [21:39] and if toString() has discarded some bytes then the offset will be wrong [21:39] Yeah, that's why it would need to keep the original content returned from the server, before any parsing. [21:39] it does, that's what goes into the WARC file. [21:39] the problem is the discrepancy between what it's using to calculate the offset, and what's actually been saved [21:39] Hm, maybe I misunderstood you - which offset are you talking about? [21:40] The payload offset, which is where the message body starts (and thus, where the headers end) [21:41] it saves everything exactly as received. this is a post-processing issue [21:46] *** odemg2 has quit IRC (Remote host closed the connection) [22:17] *** dashcloud has quit IRC (Read error: Connection reset by peer) [22:17] *** dashcloud has joined #archiveteam-bs [22:17] *** bwn has quit IRC (Read error: Operation timed out) [22:26] *** bwn has joined #archiveteam-bs [23:14] *** matt_lock has joined #archiveteam-bs [23:15] *** kristian_ has quit IRC (Quit: Leaving) [23:20] *** BlueMaxim has joined #archiveteam-bs [23:22] Sorry if this question has been asked before. I couldn't find the chat log archives for the citeseerx IRC What's going on with the citeseerx warrior project? It claims that rate limiting is active, but there haven't been any downloads for almost a year, and there are a ton of items left to download/upload. [23:22] Sorry. Those were 2 lines in my txt file, copying it must have removed the newline, [23:24] *** bwn has quit IRC (Ping timeout: 244 seconds) [23:32] matt_lock: "ArchiveTeam is first saving about 1 terabyte of files, then the Internet Archive decides whether they are able to store all downloadable stuff, that is going to be tens or hundreds of terabytes." [23:33] From http://archiveteam.org/index.php?title=PDF_2016 [23:34] (I realise that this is only a partial answer though) [23:35] So we're waiting on them to find out whether we ought to continue? [23:35] Fair enough. [23:35] I guess? I have no idea really since I'm pretty new here. [23:38] I wasn't following that project but is a reasonable conclusion from what the page says [23:38] but that is* [23:38] But yeah, logs for the project channels would be great [23:46] My Mininova grab is grinding to a halt again -- it times out on most /stat pages and gets lots of 500 errors in general. I'm at about 120k now. I suspect that the number of URLs is significantly higher than my previous estimate of 500k, so I'm not sure this will finish in time. :-/ [23:49] I'm also working on a better estimate of the total size of all torrented data on the site based on the ArchiveBot grab from last month so we can figure out whether it's feasible to grab that. [23:52] It will still only be an estimate though; ArchiveBot only grabbed about 48k of the 72k torrents. (It attempted retrieving a few thousand more and failed there with error 500, but that still means that it didn't even *try* to download about 20k torrents?!) [23:52] ^ Based on the CDX [23:55] The WunderBlogs grab is going well, 150k URLs done and currently 350k left (but that number is still growing; no idea how many URLs there are in total). No bans or rate limits so far. If it stays like that until tomorrow morning, I'll try increasing the concurrency a bit more.