#archiveteam-bs 2017-03-25,Sat

↑back Search

Time Nickname Message
00:21 🔗 godane dashcloud: your down by 3 items based on google cache copy
00:22 🔗 godane what was taking down anyways?
00:23 🔗 godane all i can tell is its 3 things with archiveteam subject in them
00:25 🔗 dashcloud the farside calendar thing I uploaded
00:25 🔗 dashcloud probably the original disk, and the one or two tries at having it emulate under windows 3.1
00:35 🔗 RichardG has joined #archiveteam-bs
00:59 🔗 pnJay has quit IRC (Leaving)
00:59 🔗 pnJay has joined #archiveteam-bs
01:05 🔗 bwn has quit IRC (Ping timeout: 244 seconds)
01:13 🔗 bwn has joined #archiveteam-bs
01:14 🔗 icedice2 has quit IRC (Quit: Leaving)
01:39 🔗 RichardG has quit IRC (Read error: Operation timed out)
01:39 🔗 RichardG has joined #archiveteam-bs
01:46 🔗 tklk MLKSHK is shutting down and removing the ability to view posts without logging in on April 1, and then stop serving files May 1. Blog post is here: http://mlkshk.typepad.com/mlkshk/2017/02/mlkshk-shutting-down.html
01:47 🔗 tklk There was previously a project for this, were any scripts kept around? http://archiveteam.org/index.php?title=MLKSHK
01:52 🔗 tklk Signups are closed, which means unless you have an account there is only 1 week left till all this content disappears.
02:11 🔗 zino has quit IRC (Read error: Operation timed out)
02:15 🔗 zino has joined #archiveteam-bs
02:37 🔗 ndiddy has quit IRC ()
03:15 🔗 yuitimoth has quit IRC (Remote host closed the connection)
03:16 🔗 yuitimoth has joined #archiveteam-bs
03:18 🔗 yuitimoth has quit IRC (Remote host closed the connection)
03:21 🔗 yuitimoth has joined #archiveteam-bs
03:28 🔗 yuitimoth has quit IRC (Remote host closed the connection)
03:29 🔗 yuitimoth has joined #archiveteam-bs
03:30 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
03:30 🔗 yuitimoth has quit IRC (Remote host closed the connection)
03:30 🔗 yuitimoth has joined #archiveteam-bs
03:31 🔗 yuitimoth has quit IRC (Remote host closed the connection)
03:31 🔗 yuitimoth has joined #archiveteam-bs
03:31 🔗 yuitimoth has quit IRC (Remote host closed the connection)
03:31 🔗 yuitimoth has joined #archiveteam-bs
03:32 🔗 yuitimoth has quit IRC (Remote host closed the connection)
03:33 🔗 yuitimoth has joined #archiveteam-bs
03:33 🔗 yuitimoth has quit IRC (Remote host closed the connection)
03:33 🔗 yuitimoth has joined #archiveteam-bs
03:34 🔗 yuitimoth has quit IRC (Remote host closed the connection)
03:34 🔗 yuitimoth has joined #archiveteam-bs
03:34 🔗 yuitimoth has quit IRC (Remote host closed the connection)
03:34 🔗 yuitimoth has joined #archiveteam-bs
04:00 🔗 yuitimoth has quit IRC (Remote host closed the connection)
04:00 🔗 yuitimoth has joined #archiveteam-bs
04:01 🔗 yuitimoth has quit IRC (Remote host closed the connection)
04:01 🔗 yuitimoth has joined #archiveteam-bs
05:02 🔗 yuitimoth has quit IRC (Remote host closed the connection)
05:03 🔗 yuitimoth has joined #archiveteam-bs
05:44 🔗 Sk1d has joined #archiveteam-bs
05:54 🔗 Frogging Content block length changed from 4327 to 4318
05:54 🔗 Frogging is that fine?
05:54 🔗 Frogging (output from python3 -m warcat verify)
06:54 🔗 Frogging sort of paranoid about these WARCs I'm getting out of wpull, especially since I've stopped and resumed it a few times to change options. it's resulted in having multiple WARC files (--warc-append and --warc-max-size used together make a new WARC every time you restart). but they all seem fine.
06:54 🔗 Frogging and it works great if I load them all into pywb
06:55 🔗 Frogging yeah, there's probably no issue here.
07:08 🔗 HCross2 Somebody2: 4.71million so far, with 12 timeouts
07:21 🔗 JAA has joined #archiveteam-bs
07:39 🔗 odemg has quit IRC (Remote host closed the connection)
07:43 🔗 odemg has joined #archiveteam-bs
08:00 🔗 GE has joined #archiveteam-bs
08:09 🔗 odemg has quit IRC (Remote host closed the connection)
08:50 🔗 jtn2 has quit IRC (Ping timeout: 255 seconds)
10:06 🔗 GE has quit IRC (Quit: zzz)
11:08 🔗 BlueMaxim has quit IRC (Quit: Leaving)
11:23 🔗 GE has joined #archiveteam-bs
11:28 🔗 BartoCH has quit IRC (Read error: Connection reset by peer)
11:29 🔗 BartoCH has joined #archiveteam-bs
12:48 🔗 n00b811 has joined #archiveteam-bs
12:48 🔗 n00b811 Does anyone have experience opening large (>10GB) .warc files
12:54 🔗 HCross2 Patience is a virtue
12:57 🔗 n00b811 I tried to open it on my windows 2012 R2 server but webarchiveplayer just closed after ~18 hours
13:13 🔗 Sanqui https://twitter.com/GossiTheDog/status/845446263244050434
13:14 🔗 odemg has joined #archiveteam-bs
13:24 🔗 Aoede o_O
13:26 🔗 JAA Just store everything in the cloud, they said. It'll be glorious, they said.
13:27 🔗 SpaffGarg its like kazaa but without porn
13:28 🔗 * SpaffGarg searches for "passwords"
13:29 🔗 * JAA searches for "CCV"
13:32 🔗 JAA CVV*
13:36 🔗 joepie91 haha, wow
13:43 🔗 JAA "wpull.engine - WARNING - Discarding 1 unprocessed item." - Is this something to worry about?
13:44 🔗 JAA Happened when I Ctrl-C'd wpull to increase the concurrency.
13:47 🔗 HCross2 So.. I just found a load of bank statements for someone on that site
13:50 🔗 PurpleSym Identity theft made easy: https://docs.com/en-us/search?q=curriculum%20vitae
13:52 🔗 JAA I found some birth certificates, social security numbers, and passports...
13:54 🔗 SpaffGarg passports are easy, people post their new ones on twitter all the time
14:00 🔗 JAA Oh great, found a huge list containing various information about over 1000 people: name, address, date of birth, SSN, bank, credit card number + CVV + expiration date, name + SSN of the spouse, etc.
14:16 🔗 JAA Lol, I just wondered why wpull had stalled. Then I realised that I was in scrollback mode.
14:19 🔗 odemg has quit IRC (Remote host closed the connection)
14:29 🔗 fie has quit IRC (Ping timeout: 250 seconds)
14:40 🔗 fie has joined #archiveteam-bs
14:45 🔗 odemg has joined #archiveteam-bs
14:58 🔗 kristian_ has joined #archiveteam-bs
15:52 🔗 RichardG has quit IRC (Ping timeout: 255 seconds)
16:00 🔗 RichardG has joined #archiveteam-bs
16:01 🔗 odemg has quit IRC (Remote host closed the connection)
16:03 🔗 Somebody2 HCross2: cool, good to know about the progress on the census.
16:08 🔗 odemg has joined #archiveteam-bs
16:21 🔗 Frogging wpull does the epoll_wait(4, thing on my machine too when I ctrl+c it
16:21 🔗 Frogging forcing me to press ctrl+C again
16:24 🔗 Frogging hopefully doing that doesn't break things
17:22 🔗 Frogging I get a UnicodeDecodeError when trying to extract a WARC with warcat :|
17:23 🔗 Frogging this bug https://github.com/chfoo/warcat/issues/12
17:55 🔗 odemg2 has joined #archiveteam-bs
17:58 🔗 odemg has quit IRC (Read error: Operation timed out)
17:58 🔗 Frogging okay, it's tripping up on an invalid character in an HTTP header
17:59 🔗 Frogging curl -I http://images2.wikia.nocookie.net/__cb20120621080252/aonoexorcistsp/es/images/9/9a/Mephisto_gui%C3%B1o.gif
17:59 🔗 Frogging Content-Disposition: inline; filename="Mephisto_gui�o.gif"; filename*=UTF-8''Mephisto_gui%C3%B1o.gif
17:59 🔗 Frogging that thing in the filename= field
18:01 🔗 Sanqui if you're confident doing that, you can change .decode() to .decode('utf-8', 'replace')
18:02 🔗 Frogging Yeah, I can do that
18:09 🔗 pizzaiolo has joined #archiveteam-bs
18:15 🔗 jtn2 has joined #archiveteam-bs
19:01 🔗 pnJay has quit IRC (Read error: Operation timed out)
19:22 🔗 GE has quit IRC (Remote host closed the connection)
20:00 🔗 icedice has joined #archiveteam-bs
20:39 🔗 Zebranky has quit IRC (Ping timeout: 250 seconds)
20:43 🔗 Zebranky has joined #archiveteam-bs
20:45 🔗 GE has joined #archiveteam-bs
21:02 🔗 Frogging Just spent like 3 hours figuring out why my WARC files had invalid payload hashes. Wpull discards trailing whitespace on its internal representation of the header field values, leading to an incorrect payload offset
21:02 🔗 Frogging so it reads from the wrong spot and gets a bad hash (the actual data is fine however)
21:23 🔗 JAA Ugh, not good
21:23 🔗 JAA chfoo: ^
21:23 🔗 Frogging I can submit a patch
21:23 🔗 Frogging probably will after I make a test case
21:24 🔗 Frogging if I can't figure it out I'll at least make a github issue
21:25 🔗 JAA Hmm, "I'll merge when I have time to work on Wpull again." on https://github.com/chfoo/wpull/pull/348 doesn't sound promising to be honest. :-/
21:32 🔗 Frogging reading the HTTP RFC. It's fine that wpull ignores leading/trailing whitespace in header fields. But it should probably store the actual length of the header separately, because it needs it
21:36 🔗 JAA I don't think the length is sufficient.
21:36 🔗 Frogging why not?
21:36 🔗 JAA From section 4.2 of RFC 2616: "Such leading or trailing LWS MAY be removed without changing the semantics of the field value."
21:37 🔗 JAA And in section 2.2, LWS is defined as '[CRLF] 1*( SP | HT )', where SP is the space character (0x20) and HT is the horizontal tab (0x09).
21:38 🔗 Frogging yes. so the client (wpull) is allowed to discard it. but wpull also needs to checksum the payload for the WARC file, and the payload is everything after the message headers (or that's the general idea). So what it's doing (paraphrased) is "payload_offset = len(response.headers.toString())"
21:39 🔗 Frogging and if toString() has discarded some bytes then the offset will be wrong
21:39 🔗 JAA Yeah, that's why it would need to keep the original content returned from the server, before any parsing.
21:39 🔗 Frogging it does, that's what goes into the WARC file.
21:39 🔗 Frogging the problem is the discrepancy between what it's using to calculate the offset, and what's actually been saved
21:39 🔗 JAA Hm, maybe I misunderstood you - which offset are you talking about?
21:40 🔗 Frogging The payload offset, which is where the message body starts (and thus, where the headers end)
21:41 🔗 Frogging it saves everything exactly as received. this is a post-processing issue
21:46 🔗 odemg2 has quit IRC (Remote host closed the connection)
22:17 🔗 dashcloud has quit IRC (Read error: Connection reset by peer)
22:17 🔗 dashcloud has joined #archiveteam-bs
22:17 🔗 bwn has quit IRC (Read error: Operation timed out)
22:26 🔗 bwn has joined #archiveteam-bs
23:14 🔗 matt_lock has joined #archiveteam-bs
23:15 🔗 kristian_ has quit IRC (Quit: Leaving)
23:20 🔗 BlueMaxim has joined #archiveteam-bs
23:22 🔗 matt_lock Sorry if this question has been asked before. I couldn't find the chat log archives for the citeseerx IRC What's going on with the citeseerx warrior project? It claims that rate limiting is active, but there haven't been any downloads for almost a year, and there are a ton of items left to download/upload.
23:22 🔗 matt_lock Sorry. Those were 2 lines in my txt file, copying it must have removed the newline,
23:24 🔗 bwn has quit IRC (Ping timeout: 244 seconds)
23:32 🔗 JAA matt_lock: "ArchiveTeam is first saving about 1 terabyte of files, then the Internet Archive decides whether they are able to store all downloadable stuff, that is going to be tens or hundreds of terabytes."
23:33 🔗 JAA From http://archiveteam.org/index.php?title=PDF_2016
23:34 🔗 JAA (I realise that this is only a partial answer though)
23:35 🔗 matt_lock So we're waiting on them to find out whether we ought to continue?
23:35 🔗 matt_lock Fair enough.
23:35 🔗 JAA I guess? I have no idea really since I'm pretty new here.
23:38 🔗 Frogging I wasn't following that project but is a reasonable conclusion from what the page says
23:38 🔗 Frogging but that is*
23:38 🔗 JAA But yeah, logs for the project channels would be great
23:46 🔗 JAA My Mininova grab is grinding to a halt again -- it times out on most /stat pages and gets lots of 500 errors in general. I'm at about 120k now. I suspect that the number of URLs is significantly higher than my previous estimate of 500k, so I'm not sure this will finish in time. :-/
23:49 🔗 JAA I'm also working on a better estimate of the total size of all torrented data on the site based on the ArchiveBot grab from last month so we can figure out whether it's feasible to grab that.
23:52 🔗 JAA It will still only be an estimate though; ArchiveBot only grabbed about 48k of the 72k torrents. (It attempted retrieving a few thousand more and failed there with error 500, but that still means that it didn't even *try* to download about 20k torrents?!)
23:52 🔗 JAA ^ Based on the CDX
23:55 🔗 JAA The WunderBlogs grab is going well, 150k URLs done and currently 350k left (but that number is still growing; no idea how many URLs there are in total). No bans or rate limits so far. If it stays like that until tomorrow morning, I'll try increasing the concurrency a bit more.

irclogger-viewer