#archiveteam-bs 2017-08-15,Tue

↑back Search

Time	Nickname	Message
00:02 ^🔗		SN4T14 has quit IRC (Quit: ZNC 1.6.3 - http://znc.in)
00:03 ^🔗		SN4T14 has joined #archiveteam-bs
00:11 ^🔗		nightpool has quit IRC (Read error: Operation timed out)
00:13 ^🔗		nightpool has joined #archiveteam-bs
01:01 ^🔗		TheLovina has quit IRC (Read error: Operation timed out)
01:03 ^🔗		TheLovina has joined #archiveteam-bs
01:16 ^🔗	hook54321	!a http://82.221.129.208/ --useragent firefox
01:28 ^🔗		username1 has joined #archiveteam-bs
01:31 ^🔗		schbirid2 has quit IRC (Read error: Operation timed out)
01:51 ^🔗		pizzaiolo has quit IRC (Quit: pizzaiolo)
02:41 ^🔗		Odd0002 has quit IRC (Remote host closed the connection)
03:00 ^🔗	hook54321	So I just saw this: https://github.com/chfoo/wpull/issues/356
03:00 ^🔗	hook54321	Would it be possible to incentivize sites to not disallow ia_archiver in their robots.txt file by respecting delay specified in robots.txt?
03:01 ^🔗	SketchCow	We don't negotiate with terrorists
03:01 ^🔗	hook54321	lol.
03:02 ^🔗	Frogging	:p
03:06 ^🔗	hook54321	but like if we were going to do that as the issue suggests, i don't see why we would want to cooperate with people that disallow the wayback machine.
03:07 ^🔗	hook54321	i think that it's stupid that some sites try to tell people to use a crawl delay of 10 seconds though
03:27 ^🔗	hook54321	Brendan Eich appears to be supporting this: https://github.com/EdOverflow/security-txt
03:32 ^🔗		qw3rty119 has joined #archiveteam-bs
03:38 ^🔗		qw3rty118 has quit IRC (Read error: Operation timed out)
03:51 ^🔗		Stilett0 has joined #archiveteam-bs
04:09 ^🔗	hook54321	Wiki is acting kinda funny
04:24 ^🔗	hook54321	JAA: Daily Stormer is moving to the TOR Network
04:26 ^🔗		Sk1d has quit IRC (Ping timeout: 250 seconds)
04:33 ^🔗		Sk1d has joined #archiveteam-bs
04:37 ^🔗	hook54321	Apparently Google froze their domain, so they can't move it now.
04:46 ^🔗		robink has quit IRC (Read error: Connection reset by peer)
04:51 ^🔗		robink has joined #archiveteam-bs
04:55 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
05:01 ^🔗		dashcloud has joined #archiveteam-bs
05:05 ^🔗		kimmer22 has joined #archiveteam-bs
05:14 ^🔗		kimmer2 has quit IRC (Ping timeout: 633 seconds)
05:20 ^🔗		Stilett0 is now known as Stiletto
05:25 ^🔗		kimmer2 has joined #archiveteam-bs
05:33 ^🔗		kimmer22 has quit IRC (Ping timeout: 633 seconds)
05:53 ^🔗	zino	hook54321: Something that might be more fruitful is checking what the support for HTTP error 429 is in wpull. I've seen logs where we get a lot of 429s followed by a 200 followed by a lot of 429s again. RFC6585. Either:
05:53 ^🔗	zino	1) wpull does not handle the Retry-After header
05:53 ^🔗	zino	2) The site is still not prepared to answer requests after timeout
05:53 ^🔗	zino	3) The site does not send a Rety-After header
05:53 ^🔗	zino	If it's 2 or 3, then it's not much we can do, if it's 1 we would probably save all sides trouble by implementing it, and minimize chances to get IP-banned. Then add a pipeline override if there is reason to ignore requests from the server to back off.
05:53 ^🔗		HCross has quit IRC (Read error: Connection reset by peer)
05:54 ^🔗		HCross has joined #archiveteam-bs
05:55 ^🔗		robogoat has quit IRC (Read error: Operation timed out)
05:56 ^🔗		robogoat has joined #archiveteam-bs
06:19 ^🔗		kimmer22 has joined #archiveteam-bs
06:19 ^🔗		godane has quit IRC (Quit: Leaving.)
06:26 ^🔗		kimmer2 has quit IRC (Ping timeout: 633 seconds)
06:49 ^🔗		j08nY has joined #archiveteam-bs
06:59 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
07:14 ^🔗		BlueMaxim has joined #archiveteam-bs
07:15 ^🔗		kimmer2 has joined #archiveteam-bs
07:15 ^🔗		TheLovina has quit IRC (Ping timeout: 370 seconds)
07:15 ^🔗		TheLovina has joined #archiveteam-bs
07:20 ^🔗		kimmer22 has quit IRC (Ping timeout: 633 seconds)
07:20 ^🔗		dashcloud has joined #archiveteam-bs
07:28 ^🔗		Boppen has quit IRC (Ping timeout: 194 seconds)
07:41 ^🔗		BlueMaxim has quit IRC (Read error: Operation timed out)
07:41 ^🔗		BlueMaxim has joined #archiveteam-bs
07:48 ^🔗		Honno has joined #archiveteam-bs
07:49 ^🔗		HCross has quit IRC (Remote host closed the connection)
07:49 ^🔗		HCross has joined #archiveteam-bs
08:32 ^🔗		j08nY has quit IRC (Read error: Operation timed out)
08:34 ^🔗		kimmer22 has joined #archiveteam-bs
08:38 ^🔗		kimmer2 has quit IRC (Ping timeout: 633 seconds)
08:40 ^🔗		kimmer2 has joined #archiveteam-bs
08:40 ^🔗		Boppen has joined #archiveteam-bs
08:45 ^🔗		kimmer22 has quit IRC (Ping timeout: 633 seconds)
08:45 ^🔗		kimmer22 has joined #archiveteam-bs
08:50 ^🔗		kimmer2 has quit IRC (Ping timeout: 632 seconds)
08:51 ^🔗	hook54321	JAA: Onion address for Daily Stormer: http://dstormer6em3i4km.onion/
08:51 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
09:25 ^🔗		kimmer2 has joined #archiveteam-bs
09:30 ^🔗		kimmer22 has quit IRC (Ping timeout: 633 seconds)
09:32 ^🔗		kimmer1 has joined #archiveteam-bs
09:36 ^🔗		godane has joined #archiveteam-bs
09:37 ^🔗	godane	looks like IA is down again
09:48 ^🔗	hook54321	yup
09:49 ^🔗	hook54321	nothing on their twitter yet.
10:19 ^🔗		Honno has quit IRC (Read error: Operation timed out)
10:30 ^🔗		Mateon1 has quit IRC (Ping timeout: 268 seconds)
10:30 ^🔗		Mateon1 has joined #archiveteam-bs
10:48 ^🔗		j08nY has joined #archiveteam-bs
10:56 ^🔗		ivan has quit IRC (Leaving)
11:18 ^🔗		marvinw has joined #archiveteam-bs
11:21 ^🔗	JAA	Very interesting court decision: https://www.reuters.com/article/us-microsoft-linkedin-ruling-idUSKCN1AU2BV
11:44 ^🔗		atluxity1 has joined #archiveteam-bs
11:46 ^🔗		atluxity has quit IRC (Ping timeout: 506 seconds)
11:50 ^🔗	JAA	We should start archiving whois information.
11:50 ^🔗	JAA	And DNS records
12:43 ^🔗	joepie91	holy shit
12:43 ^🔗	joepie91	that is actually a Very Big Deal
13:17 ^🔗		s2e has joined #archiveteam-bs
13:27 ^🔗	s2e	Is there guidance on how to best submit dozens of websites to the internet archive in a way that is respectful of their infrastructure? I work in the internet freedom sector focusing on educational content and many of the resources that get created dissapear in months or a few years. I currently use a simple script to spider and submit new ones to the archive. I'd like to do this in a more automated fashion.
13:27 ^🔗	s2e	But, I want to make sure I am doing it as respectfully as possible.
13:29 ^🔗	Sanqui	to IA's infrastructure?
13:29 ^🔗		j08nY has quit IRC (Read error: Operation timed out)
13:29 ^🔗	Sanqui	I mean, respectful of IA's infrastructure?
13:29 ^🔗	Sanqui	you probably want archivebot
13:29 ^🔗	s2e	Yeah, if possible. I've seen other efforts try to archive seperately, but they are largely unavailable to others
13:30 ^🔗	Sanqui	join #archivebot, check out how it works, submit a website with !a, watch the dashboard, it'll get absorbed into wayback
13:30 ^🔗	s2e	awesome
13:32 ^🔗	Frogging	joepie91: eli5?
13:33 ^🔗	s2e	Since archivebot is a volunteer service is the method it uses the best method for doing this without a drain on others resources? Is it something I could run on my own to do the archiving and supply the WARC files in the same way?
13:33 ^🔗	Sanqui	Frogging: my understanding is - it is legal to scrape public personal information on websites for commercial purposes
13:34 ^🔗	Sanqui	s2e: you could provide a pipeline, but I'm not sure if we're accepting right now; or you can run something like grab-site yourself, but you'd have to find some avenue to get the warcs into wayback.
13:35 ^🔗	Sanqui	Frogging: not only it is legal, you cannot put measures in place against it
13:35 ^🔗	Frogging	I see.
13:35 ^🔗	Sanqui	IANAL
13:35 ^🔗	s2e	Sanqui: Thanks! I'll start with archivebot and bother IA about WARC inclusion.
13:35 ^🔗	Frogging	the applications they mentioned on the page don't instill confidence
13:36 ^🔗	Frogging	using "publicly available data and artificial intelligence to help companies identify potential customers"
13:36 ^🔗	Frogging	building "algorithms capable of predicting employee behaviors, such as when they might quit"
13:37 ^🔗	omglolbah	"If LinkedIn is going to allow profiles to be indexed by search engines to benefit their platform then why shouldn't the rest of the internet benefit from that as well?"
13:40 ^🔗		Mateon1 has quit IRC (Remote host closed the connection)
13:40 ^🔗		kimmer22 has joined #archiveteam-bs
13:41 ^🔗		Mateon1 has joined #archiveteam-bs
13:43 ^🔗		s2e has left WeeChat 1.6
13:47 ^🔗		kimmer2 has quit IRC (Ping timeout: 633 seconds)
14:15 ^🔗		j08nY has joined #archiveteam-bs
15:01 ^🔗		pizzaiolo has joined #archiveteam-bs
16:04 ^🔗		wabu has quit IRC (Read error: Operation timed out)
16:09 ^🔗		kimmer2 has joined #archiveteam-bs
16:13 ^🔗		username1 is now known as schbirid
16:14 ^🔗		wabu has joined #archiveteam-bs
16:17 ^🔗		kimmer22 has quit IRC (Ping timeout: 633 seconds)
17:07 ^🔗		pizzaiolo has quit IRC (pizzaiolo)
17:08 ^🔗	xmc	JAA, joepie91: i was talking with FalconK the other day, and he mentioned the idea of running a recursive resolver that archives all results, and having archivebot and the warrior use it as their default resolvers
17:08 ^🔗	xmc	i really like this idea
17:09 ^🔗	xmc	i'm not sure what the proper archival format for DNS would be
17:09 ^🔗	xmc	I suppose you could cram it into a warc
17:12 ^🔗	schbirid	i thought warc is http
17:12 ^🔗	schbirid	*think
17:12 ^🔗	PurpleSym	It is not limited to HTTP, there’s a generic “resource” record.
17:13 ^🔗	schbirid	oh nice
17:16 ^🔗	godane	this looks like a torrent of the IA 911 videos: http://torrentproject.se/2d64409b6f179bc999159284156b3534711447a1/
17:16 ^🔗	PurpleSym	Also, DNS perfectly fits into the request/response scheme WARC is using for HTTP.
17:19 ^🔗	JAA	That's a nice idea, apart from the fact that it introduces a single point of failure. If the resolver is down, everything crashes and burns.
17:22 ^🔗	xmc	yes, also that
17:33 ^🔗	joepie91	xmc: schbirid: heritrix stores DNS records in WARCs
17:33 ^🔗	joepie91	or well, DNS requests and responses
17:33 ^🔗	xmc	hmmmmm
17:36 ^🔗		kristian_ has joined #archiveteam-bs
18:14 ^🔗		kristian_ has quit IRC (Quit: Leaving)
18:23 ^🔗	godane	so my birthday is tomorrow
18:36 ^🔗	Aoede	happy birthday godane (I would forget to say this tomorrow :p)
18:53 ^🔗		fie_ has quit IRC (Ping timeout: 246 seconds)
19:11 ^🔗		fie has joined #archiveteam-bs
19:26 ^🔗	hook54321	godane: happy birthday
19:44 ^🔗		kimmer2 has quit IRC (Ping timeout: 633 seconds)
20:16 ^🔗		kimmer1 has quit IRC (Quit: Going offline, see ya! (www.adiirc.com))
20:56 ^🔗	hook54321	Anyone know if there's something like this for Firefox? https://github.com/kissarat/never-lose
21:08 ^🔗		bwn has quit IRC (Ping timeout: 268 seconds)
21:13 ^🔗		bwn has joined #archiveteam-bs
21:56 ^🔗		Honno has joined #archiveteam-bs
22:03 ^🔗	arkiver	it's 00:03 here now, happy birthday godane :D
22:16 ^🔗		DFJustin has quit IRC (Read error: Connection reset by peer)
22:17 ^🔗		DFJustin has joined #archiveteam-bs
22:18 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
22:18 ^🔗		dashcloud has joined #archiveteam-bs
22:23 ^🔗		pikhq has quit IRC (Read error: Operation timed out)
22:23 ^🔗	Frogging	that repo's list of porn sites seems to have a disproportionate amount of gay porn
22:24 ^🔗	Frogging	and random tumblrs. interesting. I wonder where they got it from
22:38 ^🔗		Igloo has quit IRC (Read error: Operation timed out)
22:38 ^🔗		j08nY has quit IRC (Read error: Operation timed out)
22:42 ^🔗		pikhq has joined #archiveteam-bs
22:43 ^🔗		godane has quit IRC (Ping timeout: 250 seconds)
22:43 ^🔗		Jonimus has quit IRC (Ping timeout: 268 seconds)
22:45 ^🔗		j08nY has joined #archiveteam-bs
22:47 ^🔗		godane has joined #archiveteam-bs
22:47 ^🔗		Igloo has joined #archiveteam-bs
22:56 ^🔗	*	hook54321 shrugs
23:08 ^🔗		qw3rty111 has joined #archiveteam-bs
23:10 ^🔗		Jonimus has joined #archiveteam-bs
23:10 ^🔗		swebb sets mode: +o Jonimus
23:11 ^🔗		qw3rty112 has joined #archiveteam-bs
23:11 ^🔗		qw3rty119 has quit IRC (Ping timeout: 600 seconds)
23:18 ^🔗		qw3rty111 has quit IRC (Read error: Operation timed out)
23:30 ^🔗		j08nY has quit IRC (Quit: Leaving)

irclogger-viewer