[00:19] *** BlueMaxim has joined #archiveteam [00:26] *** mismatch has quit IRC (Ping timeout: 501 seconds) [00:57] *** metalcamp has quit IRC (Ping timeout: 501 seconds) [01:03] *** MMovie1 has quit IRC (Read error: Operation timed out) [01:15] *** nightpool has quit IRC (Read error: Operation timed out) [01:29] *** MMovie has joined #archiveteam [01:35] *** Start_ has joined #archiveteam [01:35] *** Start has quit IRC (Read error: Connection reset by peer) [01:37] *** ItsYoda has quit IRC (Ping timeout: 260 seconds) [01:41] *** ItsYoda has joined #archiveteam [01:43] *** william34 has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [01:48] *** philpem has quit IRC (Ping timeout: 260 seconds) [01:50] *** trs80 has joined #archiveteam [02:07] *** nightpool has joined #archiveteam [02:21] *** ndiddy has quit IRC (Leaving) [02:22] *** ndiddy has joined #archiveteam [03:18] *** Atom-- has quit IRC (Ping timeout: 190 seconds) [03:26] *** pguth_ has quit IRC (Remote host closed the connection) [03:26] *** pguth_ has joined #archiveteam [04:05] *** ndiddy has quit IRC (Read error: Connection reset by peer) [04:40] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:43] *** pguth_ has quit IRC (Remote host closed the connection) [04:43] *** pguth_ has joined #archiveteam [04:44] *** DoomTay has quit IRC (Quit: Page closed) [04:44] *** RichardG has joined #archiveteam [04:47] *** Sk1d has joined #archiveteam [04:54] *** tomwsmf has quit IRC (Read error: Operation timed out) [04:59] *** Emcy has quit IRC (Read error: Operation timed out) [05:08] *** DoomTay has joined #archiveteam [05:12] *** RichardG has quit IRC (Ping timeout: 633 seconds) [05:33] *** RichardG has joined #archiveteam [05:54] *** VADemon has joined #archiveteam [06:07] *** RichardG has quit IRC (Ping timeout: 633 seconds) [06:26] *** pguth_ has quit IRC (Remote host closed the connection) [06:26] *** pguth_ has joined #archiveteam [06:29] *** JesseW has quit IRC (Ping timeout: 370 seconds) [06:33] *** DoomTay has quit IRC (Quit: Page closed) [06:46] *** pguth_ has quit IRC (Remote host closed the connection) [06:46] *** pguth_ has joined #archiveteam [07:24] *** nightpool has quit IRC (Read error: Operation timed out) [07:52] *** TC02 has quit IRC (Ping timeout: 1208 seconds) [08:04] *** Honno has joined #archiveteam [08:04] *** TC02 has joined #archiveteam [08:10] *** TC02 has quit IRC (Ping timeout: 246 seconds) [08:12] *** VADemon has quit IRC (Quit: left4dead) [08:15] *** TC02 has joined #archiveteam [08:17] *** VADemon has joined #archiveteam [08:22] *** RichardG has joined #archiveteam [08:25] *** Honno has quit IRC (Ping timeout: 1208 seconds) [08:26] *** RichardG has quit IRC (Read error: Connection reset by peer) [08:26] *** RichardG has joined #archiveteam [08:31] *** RichardG has quit IRC (Ping timeout: 244 seconds) [08:45] *** RichardG has joined #archiveteam [08:50] *** RichardG_ has joined #archiveteam [08:54] *** RichardG has quit IRC (Ping timeout: 370 seconds) [09:05] *** RichardG has joined #archiveteam [09:08] *** RichardG_ has quit IRC (Ping timeout: 370 seconds) [09:13] *** RichardG has quit IRC (Ping timeout: 370 seconds) [09:17] *** RichardG has joined #archiveteam [09:18] hm, you're hitting my site, randomwaffle, archiving the whole thing, tried to contact someone to rsync the files instead, but no response so far... [09:21] Scuttle: Would you prefer rsync instead? [09:23] it would be much faster, my logs are really filling up... :) [09:23] I can provide you with a rsync target and handle syncing to IA, but I don’t know how to turn off the archivebot job. [09:24] I remember you saying it was ~300G? [09:24] PurpleSym, !abort is a wonderful thing [09:24] PurpleSym: something like that [09:24] but PurpleSym - you'll need to WARC it all [09:25] Can I !abort someone else’s job? [09:25] *** RichardG_ has joined #archiveteam [09:26] yipdw is probably the person to ask. Try it after you get the files though [09:27] *** RichardG has quit IRC (Ping timeout: 370 seconds) [09:27] maybe even closer to 400GB depending on your clustersize... [09:28] That’s fine, 1.9T avail. [09:28] cool, cool [09:28] you want the source for the randomwaffle site too? [09:29] We’ll take everything you have. [09:29] k [09:30] *** RichardG_ has quit IRC (Ping timeout: 250 seconds) [09:47] *** RichardG has joined #archiveteam [09:51] *** RichardG has quit IRC (Ping timeout: 250 seconds) [09:54] *** philpem has joined #archiveteam [10:01] *** RichardG has joined #archiveteam [10:05] *** RichardG has quit IRC (Ping timeout: 258 seconds) [10:06] Scuttle: Is the (random example) URL http://img.waffleimages.com/d2500ed1caea000c2542b72cf1820b0af94aeb5d/nuclear_fireball_black_metal_dinosaur_holocaust.jpg located in /images/d2/d2500ed1caea000c2542b72cf1820b0af94aeb5d.jpg ? [10:21] *** Honno has joined #archiveteam [10:26] *** Emcy has joined #archiveteam [10:28] *** Sanqui is now known as sanquiAFK [10:37] hm [10:38] ah, yes [10:38] that's right [10:38] there is a database that comes with it [10:38] been years since I looked at this [10:38] Ah, nice. So we can map the hashes back to filenames (last part of the URL)? [10:38] yes [10:41] oh...no [10:41] actually [10:41] I only have the file hashes [10:42] the site only cares about the hash-part of the URL [10:42] or used to at least [10:42] If we don't grab the site as WARC file it will not go into the wayback machine [10:42] http://img.waffleimages.com/d2500ed1caea000c2542b72cf1820b0af94aeb5d/mr_poopy_butthole.jpg and http://img.waffleimages.com/d2500ed1caea000c2542b72cf1820b0af94aeb5d/nuclear_fireball_black_metal_dinosaur_holocaust.jpg will give the same image [10:42] but will probably end up as some tarred file somewhere, which people will have to fins first and then have to find their image in [10:42] Bummer. We could’ve reconstructed the original URLs and injected a WARC into Wayback. [10:42] ^Scuttle [10:43] hm? [10:43] PurpleSym: We do not have the original headers [10:43] it is possible to create a WARC, but it's kind of falsifying information in the wayback machine [10:43] We can just reconstruct them. [10:44] So it’s a policy decision. [10:44] the original waffleimages never cared about the original filenames [10:44] Not a technical question. [10:44] only stored the files as .jpg [10:44] or png or whatever [10:44] It's not a technical question, creating WARCs if we want to do that is not a problem [10:45] err [10:45] *** Honno has quit IRC (Ping timeout: 1208 seconds) [10:45] Is the job already aborted? [10:45] Scuttle: The wayback machine doesn’t know that, unfortunately. [10:45] hm [10:45] arkiver: Yes. [10:46] ok [10:50] did I mess stuff up now? :) [10:55] Nah, we’re good unless arkiver disagrees ;) [10:57] * arkiver is afk for a bit [11:01] *** r3c0d3x has quit IRC (Ping timeout: 260 seconds) [11:03] *** pguth_ has quit IRC (Remote host closed the connection) [11:03] surely we if have the source, then we can re-create the entire site and warc it locally on someones machine? [11:04] *** pguth_ has joined #archiveteam [11:04] maybe requiring a little bit of hosts file magic of course. [11:05] But I don't know, maybe I'm missing something [11:07] *** r3c0d3x has joined #archiveteam [11:11] *** pguth_ has quit IRC (Remote host closed the connection) [11:11] *** pguth_ has joined #archiveteam [11:12] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [11:12] *** BartoCH has joined #archiveteam [11:28] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [11:31] *** RichardG has joined #archiveteam [11:41] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [11:59] *** BartoCH has joined #archiveteam [12:20] *** pguth_ has quit IRC (Remote host closed the connection) [12:20] *** pguth_ has joined #archiveteam [12:24] *** Coderjoe has quit IRC (Read error: Operation timed out) [12:27] *** Coderjoe has joined #archiveteam [12:34] *** Silvan has joined #archiveteam [12:34] *** SilSte has quit IRC (Read error: Connection reset by peer) [12:34] *** BlueMaxim has quit IRC (Quit: Leaving) [12:44] *** metalcamp has joined #archiveteam [12:54] *** Emcy_ has joined #archiveteam [13:01] *** Emcy has quit IRC (Read error: Operation timed out) [13:02] *** Emcy has joined #archiveteam [13:02] *** Emcy_ has quit IRC (Read error: Operation timed out) [13:02] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:02] *** BartoCH has joined #archiveteam [13:28] *** Emcy_ has joined #archiveteam [13:29] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:35] *** Emcy has quit IRC (Read error: Operation timed out) [13:36] *** BartoCH has joined #archiveteam [13:38] *** WinterFox has quit IRC (Read error: Operation timed out) [14:03] *** Emcy_ has quit IRC (Leaving) [14:07] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [14:18] *** BartoCH has joined #archiveteam [14:55] *** nightpool has joined #archiveteam [14:56] Bioware Forums closing down. [14:56] ....scrolling up, I see it's being addressed. TRhanks [14:59] *** nightpool has quit IRC (Ping timeout: 260 seconds) [15:19] *** pguth_ has quit IRC (Remote host closed the connection) [15:19] *** pguth_ has joined #archiveteam [15:39] *** DoomTay has joined #archiveteam [15:54] *** Coderjoe has quit IRC (Read error: Operation timed out) [15:56] *** Coderjoe has joined #archiveteam [15:59] *** Honno has joined #archiveteam [15:59] *** JesseW has joined #archiveteam [16:02] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [16:12] *** DigDug_ is now known as DigDug [16:17] *** nightpool has joined #archiveteam [16:19] *** BartoCH has joined #archiveteam [16:25] *** DoomTay has quit IRC (Quit: Page closed) [16:29] *** DoomTay has joined #archiveteam [16:29] *** Honno has quit IRC (Ping timeout: 1208 seconds) [16:36] *** DoomTay has quit IRC (Quit: Page closed) [16:59] *** ndiddy has joined #archiveteam [17:04] *** pguth_ has quit IRC (Remote host closed the connection) [17:04] *** pguth_ has joined #archiveteam [17:06] *** pguth_ has quit IRC (Remote host closed the connection) [17:06] *** pguth_ has joined #archiveteam [17:07] *** Start_ is now known as Start [17:41] *** useretail has quit IRC (Remote host closed the connection) [17:53] *** hive-mind has quit IRC (Ping timeout: 260 seconds) [17:59] *** Nemo_bis has quit IRC (Ping timeout: 244 seconds) [18:01] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [18:01] *** hive-mind has joined #archiveteam [18:08] *** BartoCH has joined #archiveteam [18:10] *** Nemo_bis has joined #archiveteam [18:20] *** Nemo_bis has quit IRC (Ping timeout: 244 seconds) [18:21] do we have http://ftp.sunet.se/mirror/archive/ftp.sunet.se/ archived? [18:25] *** Nemo_bis has joined #archiveteam [18:32] *** pguth_ has quit IRC (Remote host closed the connection) [18:32] *** pguth_ has joined #archiveteam [18:39] *** nightpool has quit IRC (Read error: Operation timed out) [18:41] *** cole has joined #archiveteam [18:41] What would be the best way to make my own copy of KAT? [18:42] KAT? [18:42] Kickass Torrents [18:42] KickAssTorrents [18:43] I don't think we can particularly help you with that. [18:43] Its a site that was taken down and there is a few mirrors up but I would like to have my own in case thoose get taken down [18:44] You might try using wget or grab-site or similar tools [18:47] I just used wget -m lets see how this goes lol [18:48] Surely we have done ftp.sunet.se several times over at this point. [18:49] looks like the mirror of their mirror was done april 30 [18:49] not sure if by archivebot or not [18:49] apr 30 2016 [18:49] https://web.archive.org/web/20160430093210/http://ftp.sunet.se/mirror/archive/ftp.sunet.se/ [18:50] so i'm not sure how complete our mirror of their mirror of their older site is [18:50] lemme check #archivebot logs for apr 30... [18:51] SketchCow: also the 2012 mirror of the hp ftp softpaq stuff is uploaded to fos, thank god i used rsync with checksums and resume, connection failed about 5 times but md5 on both ends matches [18:51] the #effteepee project is a lot better for this [18:51] also we have a viewer of archivebot jobs, it's faster than IRC logs [18:52] its in /0/cdroms/ or something [18:52] might not be the correct directory [18:52] *** cole has quit IRC (Remote host closed the connection) [18:52] yipdw: how can i view past archivebot jobs? [18:53] http://archive.fart.website/archivebot/viewer/ [18:53] but https://archive.org/details/archiveteam_ftp?&sort=-downloads&and[]=ftp.sunet.se is faster [18:53] IA search is pretty cool [18:53] *** useretail has joined #archiveteam [18:58] does it not store the parameters used for the archivebot invocation somewhere? [19:01] the wpull command line is in each WARC as the first record [19:02] *** nightpool has joined #archiveteam [19:17] *** schbirid has joined #archiveteam [19:17] just in case this was not posted before http://blog.bioware.com/2016/07/29/concerning-our-forums/ [19:19] * JesseW just did Wayback Save page on that, and was un-pleased to find no-one had done so before me... [19:20] If one of the people here who likes making bots wanted to make a bot that automatically did Wayback Save on any URL mentioned here, I for one think that would be a great idea. [19:24] Bad idea [19:24] You're literally making an android nerd that jumps the gun before we can discuss it [19:24] Robo-DoomTay [19:25] Even for single pages, with the robots.txt rules that Wayback Save has? [19:25] How does the robot do that. [19:25] We don't even know why [19:25] No. [19:25] Archivebot does it fine. [19:25] OK. [19:27] What about a robot that, when it saw a link (like http://example.net ), would reply with a URL of the form: https://web.archive.org/web/*/http://example.net -- thereby making it easier for people here to check if the Wayback Machine already has a copy, but doesn't do anything automatically? [19:27] too much clutter [19:28] fair point [19:28] ^ [19:28] we all ready hav purplebot doing its thing [19:30] JesseW: You are trying to fix the wrong thing [19:30] *** Coderjoe has quit IRC (Read error: Operation timed out) [19:30] Archive.org has an availability API we're going to wire archivebot into [19:32] In what way? [19:32] nice. [19:32] sorry, going AFK [19:33] yeah, in what way? I'm curious [19:37] OH NOW YOU'RE ALL INTERESTED [19:37] For some time, the Archive Wayback team's been working on a "do we have it" API [19:37] Specifically because of Archive Team but other things to. [19:37] To prevent things like having 21,000 grabs of the same page for 3 years [19:37] We think a mass of disk space can be saved in the future. [19:37] Experimentally, it's already working internally. [19:38] LOL, yeah I can see why that would be good [19:38] I wonder how many copies of Google fonts we've grabbed [19:38] sounds good [19:38] Archive Team has an excellent policy: "Huh, this thing is fucking up. Let's grab ALL of it" [19:38] And fact is, sometimes it turns out we just did a 34gb grab of material that hasn't changed in 10 years and IA had 20 copies already [19:39] When it should be a pointer saying "unchanged" [19:39] * Frogging nods [19:39] Now, there's A THOUSAND AND ONE WAYS THIS CAN GO WRONG [19:39] (I don't want to hear them right now) [19:39] *** JesseW has quit IRC (Ping timeout: 370 seconds) [19:39] But the goal is that, say, Archivebot dumps in and the parts that are actually new go in. [19:39] any plan for retroactive deduping? [19:40] They will likely do that, yes. [19:40] Estimates is it will get back over a petabyte [19:40] It might be we dump in as usual and then it gets worked over internally. [19:41] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [19:42] *** tomwsmf has joined #archiveteam [19:48] *** Coderjoe has joined #archiveteam [19:49] *** nightpool has quit IRC (Read error: Operation timed out) [19:53] Anyway, the "we might be doubling or tripling data" issue is being addressed. [19:53] And as for "it was mentioned in #archiveteam but nobody brought it to #archivebot", I feel that the problem is so not a problem that the problem is it happens multiple times. [19:56] *** BartoCH has joined #archiveteam [20:11] *** anjacks0n has joined #archiveteam [20:15] *** anjacks0n has quit IRC (Ping timeout: 190 seconds) [20:29] that's actually a really good idea. [20:32] *** nightpool has joined #archiveteam [20:48] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [20:55] *** robink has quit IRC (Ping timeout: 260 seconds) [20:57] Considering it was cooked up by the best minds in web archiving, I'd hope so [21:11] SketchCow: what about files where the contents are identical, but the metadata (filename, date) differ? will it properly dedup those as well? [21:11] this is important due to systems not storing file time/date in utc [21:20] *** Asparagir has joined #archiveteam [21:28] *** nertzy has joined #archiveteam [21:28] *** robink has joined #archiveteam [21:39] *** VADemon has quit IRC (Quit: left4dead) [21:40] *** JesseW has joined #archiveteam [21:40] *** schbirid has quit IRC (Quit: Leaving) [21:43] (read the log) [21:43] Delighted to hear more about the dedup'ing effort. [21:44] I'm still confused by SketchCow's statement: "I feel that the problem is so not a problem that the problem is it happens multiple times." [21:45] I *think* maybe that means: "I feel the real problem is that people repeatedly bring up the other 'problem' as if it was a real problem." But I'm not sure. [22:00] *** pguth_ has quit IRC (Remote host closed the connection) [22:00] *** pguth_ has joined #archiveteam [22:26] *** Asparag-1 has joined #archiveteam [22:27] *** Asparag-1 has left [22:27] It means I don't think the problem is URLs are mentioned in #archiveteam that don't end up in #archivebot, but that URLs mentioned in #archiveteam are multiply submitted in #archivebot [22:28] I agree there are API issues deeply worth considering before we turn away acquired data because "old" data is there [22:29] Ah, ok. [22:30] I presumed that http://web.archive.org/save *does* do deduplication already -- do you happen to know if that is true? [22:31] (Since the WARCs from that are not published, I can't check myself) [22:36] *** Coderjoe has quit IRC (Read error: Operation timed out) [22:47] Somewhat. [22:54] *** Coderjoe has joined #archiveteam [23:08] *** BlueMaxim has joined #archiveteam [23:08] *** REiN^ has quit IRC (Ping timeout: 260 seconds) [23:15] If a few hours from now you see the ArchiveBot sitting in a corner, mumbling and sweaty and rocking back and forth, it's because I just fed it about 60-80 alt-right/neo-Nazi/affiliated tweetstreams and sites to digest, and the poor thing is probably feeling quite ill. [23:17] Thanks, Asparagir. They will likely prove a valuable resource to historians trying to figure out how TF Donald Trump was nominated for president [23:17] It's going to take at least 24 hours to work through the whole list. Additions welcome, of course. [23:17] Yep, that was my thinking too. [23:19] This ghoulish moment in history needs to be properly documented, so future librarians can laboriously catalog all the shittalking and dank pepe memes and parentheses around names and MAGA and fashy haircuts and podcasts and waifu profile pics and so on. [23:20] hah :| [23:25] Still one of the most spot-on videos from this election cycle: https://www.youtube.com/watch?v=TxKzQGHvpv4 (except it turns out they *did* matter, which is scary) [23:27] Rick Wilson is amazing to follow on Twitter too --> https://twitter.com/TheRickWilson [23:28] His tweets are like that video, but every day. #NeverTrump [23:28] I believe later in that same video or perhaps a similar timeframe, he called Trump a "douche canoe" on air. [23:29] -bs