#archiveteam-bs 2017-03-25,Sat

↑back Search

Time	Nickname	Message
00:21 ^🔗	godane	dashcloud: your down by 3 items based on google cache copy
00:22 ^🔗	godane	what was taking down anyways?
00:23 ^🔗	godane	all i can tell is its 3 things with archiveteam subject in them
00:25 ^🔗	dashcloud	the farside calendar thing I uploaded
00:25 ^🔗	dashcloud	probably the original disk, and the one or two tries at having it emulate under windows 3.1
00:35 ^🔗		RichardG has joined #archiveteam-bs
00:59 ^🔗		pnJay has quit IRC (Leaving)
00:59 ^🔗		pnJay has joined #archiveteam-bs
01:05 ^🔗		bwn has quit IRC (Ping timeout: 244 seconds)
01:13 ^🔗		bwn has joined #archiveteam-bs
01:14 ^🔗		icedice2 has quit IRC (Quit: Leaving)
01:39 ^🔗		RichardG has quit IRC (Read error: Operation timed out)
01:39 ^🔗		RichardG has joined #archiveteam-bs
01:46 ^🔗	tklk	MLKSHK is shutting down and removing the ability to view posts without logging in on April 1, and then stop serving files May 1. Blog post is here: http://mlkshk.typepad.com/mlkshk/2017/02/mlkshk-shutting-down.html
01:47 ^🔗	tklk	There was previously a project for this, were any scripts kept around? http://archiveteam.org/index.php?title=MLKSHK
01:52 ^🔗	tklk	Signups are closed, which means unless you have an account there is only 1 week left till all this content disappears.
02:11 ^🔗		zino has quit IRC (Read error: Operation timed out)
02:15 ^🔗		zino has joined #archiveteam-bs
02:37 ^🔗		ndiddy has quit IRC ()
03:15 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
03:16 ^🔗		yuitimoth has joined #archiveteam-bs
03:18 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
03:21 ^🔗		yuitimoth has joined #archiveteam-bs
03:28 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
03:29 ^🔗		yuitimoth has joined #archiveteam-bs
03:30 ^🔗		pizzaiolo has quit IRC (Remote host closed the connection)
03:30 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
03:30 ^🔗		yuitimoth has joined #archiveteam-bs
03:31 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
03:31 ^🔗		yuitimoth has joined #archiveteam-bs
03:31 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
03:31 ^🔗		yuitimoth has joined #archiveteam-bs
03:32 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
03:33 ^🔗		yuitimoth has joined #archiveteam-bs
03:33 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
03:33 ^🔗		yuitimoth has joined #archiveteam-bs
03:34 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
03:34 ^🔗		yuitimoth has joined #archiveteam-bs
03:34 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
03:34 ^🔗		yuitimoth has joined #archiveteam-bs
04:00 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
04:00 ^🔗		yuitimoth has joined #archiveteam-bs
04:01 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
04:01 ^🔗		yuitimoth has joined #archiveteam-bs
05:02 ^🔗		yuitimoth has quit IRC (Remote host closed the connection)
05:03 ^🔗		yuitimoth has joined #archiveteam-bs
05:44 ^🔗		Sk1d has joined #archiveteam-bs
05:54 ^🔗	Frogging	Content block length changed from 4327 to 4318
05:54 ^🔗	Frogging	is that fine?
05:54 ^🔗	Frogging	(output from python3 -m warcat verify)
06:54 ^🔗	Frogging	sort of paranoid about these WARCs I'm getting out of wpull, especially since I've stopped and resumed it a few times to change options. it's resulted in having multiple WARC files (--warc-append and --warc-max-size used together make a new WARC every time you restart). but they all seem fine.
06:54 ^🔗	Frogging	and it works great if I load them all into pywb
06:55 ^🔗	Frogging	yeah, there's probably no issue here.
07:08 ^🔗	HCross2	Somebody2: 4.71million so far, with 12 timeouts
07:21 ^🔗		JAA has joined #archiveteam-bs
07:39 ^🔗		odemg has quit IRC (Remote host closed the connection)
07:43 ^🔗		odemg has joined #archiveteam-bs
08:00 ^🔗		GE has joined #archiveteam-bs
08:09 ^🔗		odemg has quit IRC (Remote host closed the connection)
08:50 ^🔗		jtn2 has quit IRC (Ping timeout: 255 seconds)
10:06 ^🔗		GE has quit IRC (Quit: zzz)
11:08 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
11:23 ^🔗		GE has joined #archiveteam-bs
11:28 ^🔗		BartoCH has quit IRC (Read error: Connection reset by peer)
11:29 ^🔗		BartoCH has joined #archiveteam-bs
12:48 ^🔗		n00b811 has joined #archiveteam-bs
12:48 ^🔗	n00b811	Does anyone have experience opening large (>10GB) .warc files
12:54 ^🔗	HCross2	Patience is a virtue
12:57 ^🔗	n00b811	I tried to open it on my windows 2012 R2 server but webarchiveplayer just closed after ~18 hours
13:13 ^🔗	Sanqui	https://twitter.com/GossiTheDog/status/845446263244050434
13:14 ^🔗		odemg has joined #archiveteam-bs
13:24 ^🔗	Aoede	o_O
13:26 ^🔗	JAA	Just store everything in the cloud, they said. It'll be glorious, they said.
13:27 ^🔗	SpaffGarg	its like kazaa but without porn
13:28 ^🔗	*	SpaffGarg searches for "passwords"
13:29 ^🔗	*	JAA searches for "CCV"
13:32 ^🔗	JAA	CVV*
13:36 ^🔗	joepie91	haha, wow
13:43 ^🔗	JAA	"wpull.engine - WARNING - Discarding 1 unprocessed item." - Is this something to worry about?
13:44 ^🔗	JAA	Happened when I Ctrl-C'd wpull to increase the concurrency.
13:47 ^🔗	HCross2	So.. I just found a load of bank statements for someone on that site
13:50 ^🔗	PurpleSym	Identity theft made easy: https://docs.com/en-us/search?q=curriculum%20vitae
13:52 ^🔗	JAA	I found some birth certificates, social security numbers, and passports...
13:54 ^🔗	SpaffGarg	passports are easy, people post their new ones on twitter all the time
14:00 ^🔗	JAA	Oh great, found a huge list containing various information about over 1000 people: name, address, date of birth, SSN, bank, credit card number + CVV + expiration date, name + SSN of the spouse, etc.
14:16 ^🔗	JAA	Lol, I just wondered why wpull had stalled. Then I realised that I was in scrollback mode.
14:19 ^🔗		odemg has quit IRC (Remote host closed the connection)
14:29 ^🔗		fie has quit IRC (Ping timeout: 250 seconds)
14:40 ^🔗		fie has joined #archiveteam-bs
14:45 ^🔗		odemg has joined #archiveteam-bs
14:58 ^🔗		kristian_ has joined #archiveteam-bs
15:52 ^🔗		RichardG has quit IRC (Ping timeout: 255 seconds)
16:00 ^🔗		RichardG has joined #archiveteam-bs
16:01 ^🔗		odemg has quit IRC (Remote host closed the connection)
16:03 ^🔗	Somebody2	HCross2: cool, good to know about the progress on the census.
16:08 ^🔗		odemg has joined #archiveteam-bs
16:21 ^🔗	Frogging	wpull does the epoll_wait(4, thing on my machine too when I ctrl+c it
16:21 ^🔗	Frogging	forcing me to press ctrl+C again
16:24 ^🔗	Frogging	hopefully doing that doesn't break things
17:22 ^🔗	Frogging	I get a UnicodeDecodeError when trying to extract a WARC with warcat :\|
17:23 ^🔗	Frogging	this bug https://github.com/chfoo/warcat/issues/12
17:55 ^🔗		odemg2 has joined #archiveteam-bs
17:58 ^🔗		odemg has quit IRC (Read error: Operation timed out)
17:58 ^🔗	Frogging	okay, it's tripping up on an invalid character in an HTTP header
17:59 ^🔗	Frogging	curl -I http://images2.wikia.nocookie.net/__cb20120621080252/aonoexorcistsp/es/images/9/9a/Mephisto_gui%C3%B1o.gif
17:59 ^🔗	Frogging	Content-Disposition: inline; filename="Mephisto_gui�o.gif"; filename*=UTF-8''Mephisto_gui%C3%B1o.gif
17:59 ^🔗	Frogging	that thing in the filename= field
18:01 ^🔗	Sanqui	if you're confident doing that, you can change .decode() to .decode('utf-8', 'replace')
18:02 ^🔗	Frogging	Yeah, I can do that
18:09 ^🔗		pizzaiolo has joined #archiveteam-bs
18:15 ^🔗		jtn2 has joined #archiveteam-bs
19:01 ^🔗		pnJay has quit IRC (Read error: Operation timed out)
19:22 ^🔗		GE has quit IRC (Remote host closed the connection)
20:00 ^🔗		icedice has joined #archiveteam-bs
20:39 ^🔗		Zebranky has quit IRC (Ping timeout: 250 seconds)
20:43 ^🔗		Zebranky has joined #archiveteam-bs
20:45 ^🔗		GE has joined #archiveteam-bs
21:02 ^🔗	Frogging	Just spent like 3 hours figuring out why my WARC files had invalid payload hashes. Wpull discards trailing whitespace on its internal representation of the header field values, leading to an incorrect payload offset
21:02 ^🔗	Frogging	so it reads from the wrong spot and gets a bad hash (the actual data is fine however)
21:23 ^🔗	JAA	Ugh, not good
21:23 ^🔗	JAA	chfoo: ^
21:23 ^🔗	Frogging	I can submit a patch
21:23 ^🔗	Frogging	probably will after I make a test case
21:24 ^🔗	Frogging	if I can't figure it out I'll at least make a github issue
21:25 ^🔗	JAA	Hmm, "I'll merge when I have time to work on Wpull again." on https://github.com/chfoo/wpull/pull/348 doesn't sound promising to be honest. :-/
21:32 ^🔗	Frogging	reading the HTTP RFC. It's fine that wpull ignores leading/trailing whitespace in header fields. But it should probably store the actual length of the header separately, because it needs it
21:36 ^🔗	JAA	I don't think the length is sufficient.
21:36 ^🔗	Frogging	why not?
21:36 ^🔗	JAA	From section 4.2 of RFC 2616: "Such leading or trailing LWS MAY be removed without changing the semantics of the field value."
21:37 ^🔗	JAA	And in section 2.2, LWS is defined as '[CRLF] 1*( SP \| HT )', where SP is the space character (0x20) and HT is the horizontal tab (0x09).
21:38 ^🔗	Frogging	yes. so the client (wpull) is allowed to discard it. but wpull also needs to checksum the payload for the WARC file, and the payload is everything after the message headers (or that's the general idea). So what it's doing (paraphrased) is "payload_offset = len(response.headers.toString())"
21:39 ^🔗	Frogging	and if toString() has discarded some bytes then the offset will be wrong
21:39 ^🔗	JAA	Yeah, that's why it would need to keep the original content returned from the server, before any parsing.
21:39 ^🔗	Frogging	it does, that's what goes into the WARC file.
21:39 ^🔗	Frogging	the problem is the discrepancy between what it's using to calculate the offset, and what's actually been saved
21:39 ^🔗	JAA	Hm, maybe I misunderstood you - which offset are you talking about?
21:40 ^🔗	Frogging	The payload offset, which is where the message body starts (and thus, where the headers end)
21:41 ^🔗	Frogging	it saves everything exactly as received. this is a post-processing issue
21:46 ^🔗		odemg2 has quit IRC (Remote host closed the connection)
22:17 ^🔗		dashcloud has quit IRC (Read error: Connection reset by peer)
22:17 ^🔗		dashcloud has joined #archiveteam-bs
22:17 ^🔗		bwn has quit IRC (Read error: Operation timed out)
22:26 ^🔗		bwn has joined #archiveteam-bs
23:14 ^🔗		matt_lock has joined #archiveteam-bs
23:15 ^🔗		kristian_ has quit IRC (Quit: Leaving)
23:20 ^🔗		BlueMaxim has joined #archiveteam-bs
23:22 ^🔗	matt_lock	Sorry if this question has been asked before. I couldn't find the chat log archives for the citeseerx IRC What's going on with the citeseerx warrior project? It claims that rate limiting is active, but there haven't been any downloads for almost a year, and there are a ton of items left to download/upload.
23:22 ^🔗	matt_lock	Sorry. Those were 2 lines in my txt file, copying it must have removed the newline,
23:24 ^🔗		bwn has quit IRC (Ping timeout: 244 seconds)
23:32 ^🔗	JAA	matt_lock: "ArchiveTeam is first saving about 1 terabyte of files, then the Internet Archive decides whether they are able to store all downloadable stuff, that is going to be tens or hundreds of terabytes."
23:33 ^🔗	JAA	From http://archiveteam.org/index.php?title=PDF_2016
23:34 ^🔗	JAA	(I realise that this is only a partial answer though)
23:35 ^🔗	matt_lock	So we're waiting on them to find out whether we ought to continue?
23:35 ^🔗	matt_lock	Fair enough.
23:35 ^🔗	JAA	I guess? I have no idea really since I'm pretty new here.
23:38 ^🔗	Frogging	I wasn't following that project but is a reasonable conclusion from what the page says
23:38 ^🔗	Frogging	but that is*
23:38 ^🔗	JAA	But yeah, logs for the project channels would be great
23:46 ^🔗	JAA	My Mininova grab is grinding to a halt again -- it times out on most /stat pages and gets lots of 500 errors in general. I'm at about 120k now. I suspect that the number of URLs is significantly higher than my previous estimate of 500k, so I'm not sure this will finish in time. :-/
23:49 ^🔗	JAA	I'm also working on a better estimate of the total size of all torrented data on the site based on the ArchiveBot grab from last month so we can figure out whether it's feasible to grab that.
23:52 ^🔗	JAA	It will still only be an estimate though; ArchiveBot only grabbed about 48k of the 72k torrents. (It attempted retrieving a few thousand more and failed there with error 500, but that still means that it didn't even try to download about 20k torrents?!)
23:52 ^🔗	JAA	^ Based on the CDX
23:55 ^🔗	JAA	The WunderBlogs grab is going well, 150k URLs done and currently 350k left (but that number is still growing; no idea how many URLs there are in total). No bans or rate limits so far. If it stays like that until tomorrow morning, I'll try increasing the concurrency a bit more.

irclogger-viewer