[00:00] *** WinterFox has joined #archiveteam [00:19] *** swebb has quit IRC (Ping timeout: 370 seconds) [00:23] *** philpem has quit IRC (Ping timeout: 260 seconds) [00:24] *** BlueMaxim has joined #archiveteam [00:31] *** swebb has joined #archiveteam [00:31] *** swebb has quit IRC (Client Quit) [00:34] *** ariscop has joined #archiveteam [00:47] *** BlueMaxim has quit IRC (Quit: Leaving) [00:49] *** atlogbot has joined #archiveteam [00:50] *** BlueMaxim has joined #archiveteam [00:50] *** swebb has joined #archiveteam [00:57] *** swebb has quit IRC (Ping timeout: 246 seconds) [00:57] *** atlogbot has quit IRC (Read error: Operation timed out) [01:09] *** Froggypwn has quit IRC (Ping timeout: 1208 seconds) [01:15] *** W1nterFox has joined #archiveteam [01:21] *** swebb has joined #archiveteam [01:22] *** WinterFox has quit IRC (Read error: Operation timed out) [01:24] *** Froggypwn has joined #archiveteam [01:33] *** Ymgve has quit IRC (hub.dk irc.homelien.no) [01:33] *** altlabel has quit IRC (hub.dk irc.homelien.no) [01:33] *** Darkstar has quit IRC (hub.dk irc.homelien.no) [01:33] *** PotcFdk has quit IRC (hub.dk irc.homelien.no) [01:33] *** PurpleSym has quit IRC (hub.dk irc.homelien.no) [01:33] *** sHATNER has quit IRC (hub.dk irc.homelien.no) [01:33] *** i0npulse has quit IRC (hub.dk irc.homelien.no) [01:33] *** Meeh has quit IRC (hub.dk irc.homelien.no) [01:37] *** Meeh_ has joined #archiveteam [01:40] *** ndiddy has joined #archiveteam [01:49] *** swebb has quit IRC (Ping timeout: 246 seconds) [01:55] *** PotcFdk has joined #archiveteam [01:57] *** Darkstar has joined #archiveteam [01:59] *** i0npulse has joined #archiveteam [02:05] huh [02:05] k5 shut down? [02:07] What's k5? [02:07] their current robots.txt is blocking IA, although I assume it's been around long enough there are captures [02:08] kuro5hin [02:09] oh, wow. I mean, I hadn't thought about it in years, which is probably the point, but still -- wow. [02:10] kuro5hin was the slashdot alternative for a while, wasn't it? [02:10] before reddit, then hackernews took that title [02:11] yep [02:11] I'm sorry we didn't run an archivebot job on it. :-( [02:12] Maybe we should have a project to scan the top xxx,000 domains' robots.txt files to see if they block the IA and then put them in ArchiveBot (or similar) to ensure they're captured at least once [02:13] MrRadar: That makes sense to me, although we should probably discuss it in a non-logged channel. :-) [02:13] Good point [02:13] better hide from the cops [02:13] doesn't the internet census file have the robots.txt, or am I mistaken? [02:14] Yes, the wayback machine will always capture robots.txt files, even if the robots.txt file blocks IA [02:14] I presume if someone contacts IA directly, they can get that hidden too, but I doubt many people have. [02:15] I suggest #robotstxtsucks as a good channel name [02:35] *** ralphdnak has joined #archiveteam [02:54] *** bwn has quit IRC (Read error: Operation timed out) [03:11] *** swebb has joined #archiveteam [03:20] *** atlogbot has joined #archiveteam [03:46] *** bwn has joined #archiveteam [04:32] *** altlabel has joined #archiveteam [04:33] *** _Crocatow has quit IRC (Quit: Later Ya'll) [04:40] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:47] *** Sk1d has joined #archiveteam [04:56] *** balrog has quit IRC (Read error: Operation timed out) [04:57] *** balrog has joined #archiveteam [05:18] *** ralphdnak has quit IRC (Read error: Operation timed out) [05:30] *** SketchCow has quit IRC (Read error: Connection reset by peer) [05:39] *** schbirid has joined #archiveteam [05:42] *** SketchCow has joined #archiveteam [05:50] *** kcaj has quit IRC (Quit: ZNC - 1.6.0 - http://znc.in) [05:56] *** kcaj has joined #archiveteam [05:56] *** RichardG_ has joined #archiveteam [06:00] *** RichardG has quit IRC (Read error: Operation timed out) [06:11] *** Meroje has quit IRC (Quit: bye!) [06:11] *** Meroje has joined #archiveteam [06:33] *** JesseW has quit IRC (Ping timeout: 370 seconds) [06:39] *** ariscop has quit IRC (Read error: Operation timed out) [06:50] *** wyatt8740 has quit IRC (Ping timeout: 246 seconds) [06:50] *** PurpleSym has joined #archiveteam [06:52] *** redlob has quit IRC (Remote host closed the connection) [06:53] *** wyatt8740 has joined #archiveteam [06:57] *** redlob has joined #archiveteam [07:25] *** Honno has joined #archiveteam [07:25] *** Honno_ has joined #archiveteam [07:29] *** ralphdnak has joined #archiveteam [07:30] *** Honno_ has quit IRC (Quit: Leaving) [07:31] *** Honno_ has joined #archiveteam [07:32] *** Honno has quit IRC (Read error: Operation timed out) [07:42] *** brayden_ has joined #archiveteam [07:44] *** brayden has quit IRC (Read error: Operation timed out) [08:14] *** wyatt8740 has quit IRC (Read error: Operation timed out) [08:25] *** wyatt8740 has joined #archiveteam [08:31] *** philpem has joined #archiveteam [08:31] *** ariscop has joined #archiveteam [08:49] *** atomotic has joined #archiveteam [08:53] *** bwn has quit IRC (Read error: Operation timed out) [09:08] *** wyatt8740 has quit IRC (Read error: Operation timed out) [09:10] *** bwn has joined #archiveteam [09:10] *** wyatt8740 has joined #archiveteam [09:33] *** fmope_ has quit IRC (Remote host closed the connection) [09:33] *** fmope_ has joined #archiveteam [09:38] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [09:45] *** BartoCH has joined #archiveteam [09:52] *** BartoCH has quit IRC (Read error: Connection timed out) [09:52] *** BartoCH has joined #archiveteam [09:55] *** pfallenop has quit IRC (Ping timeout: 260 seconds) [09:56] *** pfallenop has joined #archiveteam [10:10] *** pfallenop has quit IRC (Ping timeout: 260 seconds) [10:11] *** pfallenop has joined #archiveteam [10:18] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [10:20] *** pfallenop has quit IRC (Remote host closed the connection) [10:30] *** RichardG has joined #archiveteam [10:32] *** pfallenop has joined #archiveteam [10:32] *** RichardG_ has quit IRC (Ping timeout: 272 seconds) [10:42] *** wyatt8740 has quit IRC (Read error: Operation timed out) [10:51] *** wyatt8740 has joined #archiveteam [11:00] *** metalcamp has joined #archiveteam [11:16] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [11:34] *** RichardG has quit IRC (Ping timeout: 272 seconds) [11:53] *** W1nterFox has quit IRC (Remote host closed the connection) [12:00] *** vitzli has joined #archiveteam [12:02] *** RichardG has joined #archiveteam [12:06] *** atomotic has joined #archiveteam [12:15] *** vitzli has quit IRC (Quit: Leaving) [12:43] *** pfallenop has quit IRC (Ping timeout: 244 seconds) [12:46] *** xXx_ndidd has joined #archiveteam [12:47] *** ndizzle has joined #archiveteam [12:48] *** xXx_ndidd has quit IRC (Read error: Operation timed out) [12:50] *** pfallenop has joined #archiveteam [12:58] *** ndiddy has quit IRC (Read error: Operation timed out) [13:02] *** RichardG has quit IRC (Ping timeout: 499 seconds) [13:07] *** RichardG has joined #archiveteam [13:09] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:11] *** oli has quit IRC (Read error: Operation timed out) [13:14] *** oli has joined #archiveteam [13:17] *** BartoCH has joined #archiveteam [13:20] *** BlueMaxim has quit IRC (Quit: Leaving) [14:13] *** MMovie2 has joined #archiveteam [14:14] *** MMovie has quit IRC (Read error: Operation timed out) [14:31] *** Start has quit IRC (Quit: Disconnected.) [14:35] *** ralphdnak has quit IRC (Read error: Connection reset by peer) [14:39] *** ralphdnak has joined #archiveteam [14:39] *** ralphdnak has quit IRC (Read error: Connection reset by peer) [14:40] *** ralphdnak has joined #archiveteam [14:41] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [15:12] *** dashcloud has quit IRC (Read error: Operation timed out) [15:17] *** xXx_ndidd has joined #archiveteam [15:19] *** dashcloud has joined #archiveteam [15:29] *** Start has joined #archiveteam [15:29] *** ndizzle has quit IRC (Read error: Operation timed out) [15:34] *** atomotic has joined #archiveteam [15:53] *** bsmith093 has quit IRC (Ping timeout: 370 seconds) [15:55] *** bsmith093 has joined #archiveteam [15:55] *** JesseW has joined #archiveteam [16:04] *** atomotic has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [16:05] *** SimpBrain has joined #archiveteam [16:06] *** Start has quit IRC (Quit: Disconnected.) [16:16] *** JesseW has quit IRC (Ping timeout: 370 seconds) [16:29] any concurrency limit on yuku-grab? :D [16:32] only that and google code running now? [16:32] and urlteam ofc [16:32] *** Start has joined #archiveteam [16:33] There is a chance GameFront might be coming back though, waiting on arkiver to look at it more [16:35] With the wrong size download thing, I tested multithreading each download, and it seemed to finish [16:35] but wget-lua cant do it [17:11] *** godane has quit IRC (Quit: Leaving.) [17:17] *** VADemon has joined #archiveteam [17:26] *** bwn has quit IRC (Read error: Operation timed out) [17:31] *** Neurosplo has joined #archiveteam [17:32] *** ralphdnak has quit IRC (Read error: Operation timed out) [17:35] *** FalconK has quit IRC (Ping timeout: 260 seconds) [17:37] *** Start has quit IRC (Quit: Disconnected.) [17:38] *** FalconK has joined #archiveteam [17:38] *** FalconK has quit IRC (Read error: Connection reset by peer) [17:38] *** FalconK has joined #archiveteam [17:42] Good news about Kuro5hin. The person running the site showed up Hacker News and said that it disappeared because his hosting provider shut down the datacenter his server was in: https://news.ycombinator.com/item?id=11612648 [17:42] He is planning to bring the site back in some form, though it may be a static archive [17:42] oh damn [17:42] "good" news [17:43] Also someone further down in the thread was running a scraper on them and has a partial archive of the Kuro5hin user diaries (but not the stories): https://news.ycombinator.com/item?id=11609514 [17:46] *** atomotic has joined #archiveteam [17:48] *** bwn has joined #archiveteam [17:52] *** xarph has joined #archiveteam [18:18] *** Start has joined #archiveteam [18:46] *** godane has joined #archiveteam [18:51] *** metal_cam has joined #archiveteam [18:52] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [19:15] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [19:25] *** Ymgve has joined #archiveteam [19:35] *** scyther_ has joined #archiveteam [19:35] *** scyther_ has quit IRC (Connection closed) [19:44] *** Start has quit IRC (Quit: Disconnected.) [19:46] *** jessehick has joined #archiveteam [20:42] *** ariscop has quit IRC (Read error: Operation timed out) [20:59] *** Neurosplo has quit IRC () [21:05] *** schbirid has quit IRC (Quit: Leaving) [21:35] *** xarph has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [21:46] *** Emcy has quit IRC (Read error: Connection reset by peer) [21:46] *** ariscop has joined #archiveteam [22:13] *** SimpBrain has quit IRC (Remote host closed the connection) [22:20] arkiver: Sent you a mail with the naming details for the next two uploads. Hava a look at it when you can. [22:21] *** Honno_ has quit IRC (Read error: Operation timed out) [22:25] zino: I'll have a look! [22:26] #videobot now supports twitter.com and vine.co videos. [22:26] The videos will be downloaded as WARCs and as 'normal' video items, which will be uploaded to IA. [22:27] *** Emcy has joined #archiveteam [22:34] *** Ravenloft has joined #archiveteam [22:36] *** metal_cam has quit IRC (Ping timeout: 244 seconds) [23:00] *** redlob has quit IRC (Read error: Operation timed out) [23:02] *** redlob has joined #archiveteam [23:17] *** vOYtEC has quit IRC (Read error: Connection reset by peer) [23:17] *** vOYtEC has joined #archiveteam [23:20] *** justsome has joined #archiveteam [23:22] hello [23:22] hi [23:23] who do i talk to about the project? [23:25] justsome: what project? [23:27] the archiveteam project, i have a question about it [23:27] archiveteam does a bunch of things, but this is the right place to ask questions about it, yes. [23:28] ok well then... is there a list of sites that the team has recently archived somewhere? i couldn't find it on the wiki... [23:29] justsome: http://archiveteam.org , scroll down to "Recently finished"? [23:31] ah sorry i mean specifically FTP sites actually... i went to the internet archive page linked from the wiki, and all i see are daily/hourly archive files, there is no single list i can find [23:31] i want to know what FTP sites have been grabbed, say, within the past few months [23:32] ah, the FTP project. probably worth asking #effteepee also. [23:33] it's been silent for two days now i think [23:33] heh [23:33] that's why i came here... [23:34] your wiki tells people to go on IRC to talk about the project... and there's almost nobody talking about the project on IRC... [23:34] It doesn't look like there's an up to date index that I can find, either. [23:35] You can extract one yourself from the cdx files, e.g. https://archive.org/download/archiveteam_ftp_20160416204923/archiveteam_ftp_20160416204923.cdx.gz [23:35] but that's quite a bit of a hassle. [23:35] so how does each batch work? there are so many uploads per day on that site [23:35] is it just spread out so you don't upload too much at once? how does it work? [23:35] arkiver may be able to answer more usefully about specific details. I'm not that familiar with the FTP grab. [23:36] uploads on which site? [23:36] https://github.com/ArchiveTeam/ftp-items might give some clues [23:36] the site you linked to [23:37] archive.org? Yes, there are a lot of uploads to it. :-) [23:37] no i mean why are there so many "archiveteam" files on archive.org *per day* [23:37] specifically the FTP files [23:38] or is each one one site or something? [23:38] *ahhh*. Well, we pack the data we grab into big collections, called megawarcs. And each one gets a separate item. [23:38] For the current grab, I don't think they are separated by site, no. [23:39] the reason there were lots uploaded on Saturday is that the pipeline between our staging server and the Internet Archive was jammed a bit, and unjammed on that day (I think). [23:40] actually, cancel that — I think that was a different project. [23:41] But the reason for the multiple items is just to avoid having painfully large things in each item. It makes it easier on the Wayback Machine (which parses them for display) [23:41] and makes it harder to search for something... way harder. [23:42] it's not particularly arranged for ease of searching, in that form. [23:42] *** slpeeds has joined #archiveteam [23:42] I thought IA had a Web tool that would search through CDXs for things [23:43] There may be one — I don't remember what it's called. [23:43] Before we started the FTP grab as WARCs I emailed about it with SketchCow [23:43] I gave him a sample of how a FTP WARC would look and asked if the wayback machine would get support for FTP sites [23:44] ah neat [23:44] SketchCow let me know I could start the FTP grab. [23:44] Currently FTP WARCs are not recognized by the cdx-creator [23:44] cool [23:44] So basically you'd have to download these WARCs and go through them [23:44] so they can't be searched than [23:44] then* [23:44] No. [23:45] Not at the moment [23:45] Someone certainly could create an index though. [23:45] Or fix cdx-creator (which I presume is planned eventually) [23:45] But if the wayback machine has support for them, this is a better way to save them then in tar files I think [23:46] I'll ask around a bit more about FTP support for the wayback machine [23:46] it looks like the grab from Feb 2015 separated the items by site [23:47] is anyone doing these sites in particular: http://archiveteam.org/index.php?title=FTP/List [23:47] *** fmope_ has quit IRC (Ping timeout: 864 seconds) [23:49] ErkDog is currently scanning a lot of university FTPs, that should give us some work soon [23:49] We grabbed the canadian FTPs for as far as they are online [23:49] not sure about the others [23:49] where are the canadian FTP archives? [23:49] in the WARCs [23:50] ... and that's why not having an index/search method isn't helpful. [23:50] see the above story [23:50] justsome: it'll come (or maybe you could write it)? [23:51] do you only grab these sites once? [23:53] i'm assuming that every day/hour's grab is different in that case... [23:53] (that is, of different sites) [23:53] or is there a way to optimize my manual search, so to speak? [23:53] Sites can be rescanned, all new files or in size changed files will then be readded to the lists to be grabbed [23:53] no i mean how should i browse through the WARCs [23:53] What are you looking for exactly? [23:53] the canadian sites [23:54] is there a date range, or something...? some search tip? [23:54] yeah, but what in those FTPs? [23:54] we have this with file lists https://github.com/ArchiveTeam/ftp-queue/tree/master/archive [23:54] I'm not sure if that list is complete at the moment [23:55] that repo should probably get added to the wiki page [23:56] hmmm... the site i'm focused on doesn't seem to be in that list, yet it's in the list i linked to (the one on the wiki) [23:57] which site is it [23:57] probably worth making a list of *all* the sites in the wiki list that aren't in the ftp-queue list. [23:58] presumably they are ones that wen't down before we could get to them? [23:59] the site i'm referring to may be taken down soon... which is why i'm worried about it in the first place: support.crtc.gc.ca