#archiveteam-bs 2017-08-15,Tue

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***SN4T14 has quit IRC (Quit: ZNC 1.6.3 - http://znc.in)
SN4T14 has joined #archiveteam-bs
[00:02]
nightpool has quit IRC (Read error: Operation timed out)
nightpool has joined #archiveteam-bs
[00:11]
.......... (idle for 48mn)
TheLovina has quit IRC (Read error: Operation timed out)
TheLovina has joined #archiveteam-bs
[01:01]
hook54321!a http://82.221.129.208/ --useragent firefox [01:16]
***username1 has joined #archiveteam-bs
schbirid2 has quit IRC (Read error: Operation timed out)
[01:28]
..... (idle for 20mn)
pizzaiolo has quit IRC (Quit: pizzaiolo) [01:51]
........... (idle for 50mn)
Odd0002 has quit IRC (Remote host closed the connection) [02:41]
.... (idle for 19mn)
hook54321So I just saw this: https://github.com/chfoo/wpull/issues/356
Would it be possible to incentivize sites to not disallow ia_archiver in their robots.txt file by respecting delay specified in robots.txt?
[03:00]
SketchCowWe don't negotiate with terrorists [03:01]
hook54321lol. [03:01]
Frogging:p [03:02]
hook54321but like if we were going to do that as the issue suggests, i don't see why we would want to cooperate with people that disallow the wayback machine.
i think that it's stupid that some sites try to tell people to use a crawl delay of 10 seconds though
[03:06]
..... (idle for 20mn)
Brendan Eich appears to be supporting this: https://github.com/EdOverflow/security-txt [03:27]
***qw3rty119 has joined #archiveteam-bs [03:32]
qw3rty118 has quit IRC (Read error: Operation timed out) [03:38]
Stilett0 has joined #archiveteam-bs [03:51]
.... (idle for 18mn)
hook54321Wiki is acting kinda funny [04:09]
.... (idle for 15mn)
JAA: Daily Stormer is moving to the TOR Network [04:24]
***Sk1d has quit IRC (Ping timeout: 250 seconds) [04:26]
Sk1d has joined #archiveteam-bs [04:33]
hook54321Apparently Google froze their domain, so they can't move it now. [04:37]
***robink has quit IRC (Read error: Connection reset by peer) [04:46]
robink has joined #archiveteam-bs
dashcloud has quit IRC (Read error: Operation timed out)
[04:51]
dashcloud has joined #archiveteam-bs
kimmer22 has joined #archiveteam-bs
[05:01]
kimmer2 has quit IRC (Ping timeout: 633 seconds) [05:14]
Stilett0 is now known as Stiletto [05:20]
kimmer2 has joined #archiveteam-bs [05:25]
kimmer22 has quit IRC (Ping timeout: 633 seconds) [05:33]
..... (idle for 20mn)
zinohook54321: Something that might be more fruitful is checking what the support for HTTP error 429 is in wpull. I've seen logs where we get a lot of 429s followed by a 200 followed by a lot of 429s again. RFC6585. Either:
1) wpull does not handle the Retry-After header
2) The site is still not prepared to answer requests after timeout
3) The site does not send a Rety-After header
If it's 2 or 3, then it's not much we can do, if it's 1 we would probably save all sides trouble by implementing it, and minimize chances to get IP-banned. Then add a pipeline override if there is reason to ignore requests from the server to back off.
[05:53]
***HCross has quit IRC (Read error: Connection reset by peer)
HCross has joined #archiveteam-bs
robogoat has quit IRC (Read error: Operation timed out)
robogoat has joined #archiveteam-bs
[05:53]
..... (idle for 23mn)
kimmer22 has joined #archiveteam-bs
godane has quit IRC (Quit: Leaving.)
[06:19]
kimmer2 has quit IRC (Ping timeout: 633 seconds) [06:26]
..... (idle for 23mn)
j08nY has joined #archiveteam-bs [06:49]
dashcloud has quit IRC (Read error: Operation timed out) [06:59]
.... (idle for 15mn)
BlueMaxim has joined #archiveteam-bs
kimmer2 has joined #archiveteam-bs
TheLovina has quit IRC (Ping timeout: 370 seconds)
TheLovina has joined #archiveteam-bs
[07:14]
kimmer22 has quit IRC (Ping timeout: 633 seconds)
dashcloud has joined #archiveteam-bs
[07:20]
Boppen has quit IRC (Ping timeout: 194 seconds) [07:28]
BlueMaxim has quit IRC (Read error: Operation timed out)
BlueMaxim has joined #archiveteam-bs
[07:41]
Honno has joined #archiveteam-bs
HCross has quit IRC (Remote host closed the connection)
HCross has joined #archiveteam-bs
[07:48]
......... (idle for 43mn)
j08nY has quit IRC (Read error: Operation timed out)
kimmer22 has joined #archiveteam-bs
kimmer2 has quit IRC (Ping timeout: 633 seconds)
kimmer2 has joined #archiveteam-bs
Boppen has joined #archiveteam-bs
[08:32]
kimmer22 has quit IRC (Ping timeout: 633 seconds)
kimmer22 has joined #archiveteam-bs
[08:45]
kimmer2 has quit IRC (Ping timeout: 632 seconds) [08:50]
hook54321JAA: Onion address for Daily Stormer: http://dstormer6em3i4km.onion/ [08:51]
***BlueMaxim has quit IRC (Quit: Leaving) [08:51]
....... (idle for 34mn)
kimmer2 has joined #archiveteam-bs [09:25]
kimmer22 has quit IRC (Ping timeout: 633 seconds)
kimmer1 has joined #archiveteam-bs
godane has joined #archiveteam-bs
[09:30]
godanelooks like IA is down again [09:37]
hook54321yup
nothing on their twitter yet.
[09:48]
....... (idle for 30mn)
***Honno has quit IRC (Read error: Operation timed out) [10:19]
Mateon1 has quit IRC (Ping timeout: 268 seconds)
Mateon1 has joined #archiveteam-bs
[10:30]
.... (idle for 18mn)
j08nY has joined #archiveteam-bs [10:48]
ivan has quit IRC (Leaving) [10:56]
..... (idle for 22mn)
marvinw has joined #archiveteam-bs [11:18]
JAAVery interesting court decision: https://www.reuters.com/article/us-microsoft-linkedin-ruling-idUSKCN1AU2BV [11:21]
..... (idle for 23mn)
***atluxity1 has joined #archiveteam-bs
atluxity has quit IRC (Ping timeout: 506 seconds)
[11:44]
JAAWe should start archiving whois information.
And DNS records
[11:50]
........... (idle for 53mn)
joepie91holy shit
that is actually a Very Big Deal
[12:43]
....... (idle for 34mn)
***s2e has joined #archiveteam-bs [13:17]
s2eIs there guidance on how to best submit dozens of websites to the internet archive in a way that is respectful of their infrastructure? I work in the internet freedom sector focusing on educational content and many of the resources that get created dissapear in months or a few years. I currently use a simple script to spider and submit new ones to the archive. I'd like to do this in a more automated fashion.
But, I want to make sure I am doing it as respectfully as possible.
[13:27]
Sanquito IA's infrastructure? [13:29]
***j08nY has quit IRC (Read error: Operation timed out) [13:29]
SanquiI mean, respectful of IA's infrastructure?
you probably want archivebot
[13:29]
s2eYeah, if possible. I've seen other efforts try to archive seperately, but they are largely unavailable to others [13:29]
Sanquijoin #archivebot, check out how it works, submit a website with !a, watch the dashboard, it'll get absorbed into wayback [13:30]
s2eawesome [13:30]
Froggingjoepie91: eli5? [13:32]
s2eSince archivebot is a volunteer service is the method it uses the best method for doing this without a drain on others resources? Is it something I could run on my own to do the archiving and supply the WARC files in the same way? [13:33]
SanquiFrogging: my understanding is - it is legal to scrape public personal information on websites for commercial purposes
s2e: you could provide a pipeline, but I'm not sure if we're accepting right now; or you can run something like grab-site yourself, but you'd have to find some avenue to get the warcs into wayback.
Frogging: not only it is legal, you cannot put measures in place against it
[13:33]
FroggingI see. [13:35]
SanquiIANAL [13:35]
s2eSanqui: Thanks! I'll start with archivebot and bother IA about WARC inclusion. [13:35]
Froggingthe applications they mentioned on the page don't instill confidence
using "publicly available data and artificial intelligence to help companies identify potential customers"
building "algorithms capable of predicting employee behaviors, such as when they might quit"
[13:35]
omglolbah"If LinkedIn is going to allow profiles to be indexed by search engines to benefit their platform then why shouldn't the rest of the internet benefit from that as well?" [13:37]
***Mateon1 has quit IRC (Remote host closed the connection)
kimmer22 has joined #archiveteam-bs
Mateon1 has joined #archiveteam-bs
s2e has left WeeChat 1.6
kimmer2 has quit IRC (Ping timeout: 633 seconds)
[13:40]
...... (idle for 28mn)
j08nY has joined #archiveteam-bs [14:15]
.......... (idle for 46mn)
pizzaiolo has joined #archiveteam-bs [15:01]
............. (idle for 1h3mn)
wabu has quit IRC (Read error: Operation timed out) [16:04]
kimmer2 has joined #archiveteam-bs
username1 is now known as schbirid
wabu has joined #archiveteam-bs
kimmer22 has quit IRC (Ping timeout: 633 seconds)
[16:09]
........... (idle for 50mn)
pizzaiolo has quit IRC (pizzaiolo) [17:07]
xmcJAA, joepie91: i was talking with FalconK the other day, and he mentioned the idea of running a recursive resolver that archives all results, and having archivebot and the warrior use it as their default resolvers
i really like this idea
i'm not sure what the proper archival format for DNS would be
I suppose you could cram it into a warc
[17:08]
schbiridi thought warc is http
*think
[17:12]
PurpleSymIt is not limited to HTTP, there’s a generic “resource” record. [17:12]
schbiridoh nice [17:13]
godanethis looks like a torrent of the IA 911 videos: http://torrentproject.se/2d64409b6f179bc999159284156b3534711447a1/ [17:16]
PurpleSymAlso, DNS perfectly fits into the request/response scheme WARC is using for HTTP. [17:16]
JAAThat's a nice idea, apart from the fact that it introduces a single point of failure. If the resolver is down, *everything* crashes and burns. [17:19]
xmcyes, also that [17:22]
joepie91xmc: schbirid: heritrix stores DNS records in WARCs
or well, DNS requests and responses
[17:33]
xmchmmmmm [17:33]
***kristian_ has joined #archiveteam-bs [17:36]
........ (idle for 38mn)
kristian_ has quit IRC (Quit: Leaving) [18:14]
godaneso my birthday is tomorrow [18:23]
Aoedehappy birthday godane (I would forget to say this tomorrow :p) [18:36]
.... (idle for 17mn)
***fie_ has quit IRC (Ping timeout: 246 seconds) [18:53]
.... (idle for 18mn)
fie has joined #archiveteam-bs [19:11]
.... (idle for 15mn)
hook54321godane: happy birthday [19:26]
.... (idle for 18mn)
***kimmer2 has quit IRC (Ping timeout: 633 seconds) [19:44]
....... (idle for 32mn)
kimmer1 has quit IRC (Quit: Going offline, see ya! (www.adiirc.com)) [20:16]
......... (idle for 40mn)
hook54321Anyone know if there's something like this for Firefox? https://github.com/kissarat/never-lose [20:56]
***bwn has quit IRC (Ping timeout: 268 seconds) [21:08]
bwn has joined #archiveteam-bs [21:13]
......... (idle for 43mn)
Honno has joined #archiveteam-bs [21:56]
arkiverit's 00:03 here now, happy birthday godane :D [22:03]
***DFJustin has quit IRC (Read error: Connection reset by peer)
DFJustin has joined #archiveteam-bs
dashcloud has quit IRC (Read error: Operation timed out)
dashcloud has joined #archiveteam-bs
[22:16]
pikhq has quit IRC (Read error: Operation timed out) [22:23]
Froggingthat repo's list of porn sites seems to have a disproportionate amount of gay porn
and random tumblrs. interesting. I wonder where they got it from
[22:23]
***Igloo has quit IRC (Read error: Operation timed out)
j08nY has quit IRC (Read error: Operation timed out)
pikhq has joined #archiveteam-bs
godane has quit IRC (Ping timeout: 250 seconds)
Jonimus has quit IRC (Ping timeout: 268 seconds)
j08nY has joined #archiveteam-bs
godane has joined #archiveteam-bs
Igloo has joined #archiveteam-bs
[22:38]
hook54321hook54321 shrugs [22:56]
***qw3rty111 has joined #archiveteam-bs
Jonimus has joined #archiveteam-bs
swebb sets mode: +o Jonimus
qw3rty112 has joined #archiveteam-bs
qw3rty119 has quit IRC (Ping timeout: 600 seconds)
[23:08]
qw3rty111 has quit IRC (Read error: Operation timed out) [23:18]
j08nY has quit IRC (Quit: Leaving) [23:30]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)