[00:16] *** zout has joined #archiveteam [00:21] *** Aranje has joined #archiveteam [00:44] is anybody actively tackling archive.is? there's a wiki page for it, but that seems to be the extent. [00:47] unfortunately they seem to have removed the "all domains" listing, and the original search tool has been replaced with Google Custom Search. [00:52] the URLs are 31 bits long which is too much for an exhaustive search. [00:56] I'm concerned about archive.is [00:57] It has a lot of data, and it looks like it's run by only one person] [00:57] I'm going to enumerate the top 1M domains from alexa's list, pending any better ideas. [00:57] Please do [00:57] We won't start a project for it though, since it's not in danger [00:58] joepie91: bzc6p: I'll be fixing nujij tomorrow, the way it's currently being done is too slow [01:16] *** BlueMaxim has joined #archiveteam [01:26] running, though their host is awful slow to return results. [01:51] *** rchrch has quit IRC (Ping timeout: 244 seconds) [01:59] *** rchrch has joined #archiveteam [02:07] *** dashcloud has joined #archiveteam [02:15] *** WinterFox has joined #archiveteam [02:32] *** JesseW has joined #archiveteam [02:35] *** dashcloud has quit IRC (Read error: Operation timed out) [02:35] *** alembic has joined #archiveteam [03:14] *** SirCmpwn has quit IRC (Ping timeout: 260 seconds) [03:21] *** filippo__ has quit IRC (Ping timeout: 244 seconds) [03:25] *** filippo__ has joined #archiveteam [03:34] *** ndiddy has quit IRC (Read error: Connection reset by peer) [03:47] *** Ymgve has quit IRC () [04:11] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:19] *** Sk1d has joined #archiveteam [04:33] gmane knows to contact me [04:33] But he also I think did the "stop or I'll shoot" and got some money to go on [04:34] *** Ymgve has joined #archiveteam [04:34] well, for the news-server part, yes -- but the web interface is still down [04:38] there's an awful lot of item URLs on archive.is, and I'm surely not getting all of them. [04:40] "item URLs"? [04:40] individual archives pages. [04:49] zout: still not understanding you [04:50] archives of what? [04:50] are you trying to mirror archive.is? [04:52] JesseW: I'm enumerating as many of the archives on archive.is as I can to gauge feasibility. [04:52] just discovering the URLs, not downloading the content itself. [05:23] *** RichardG has quit IRC (Ping timeout: 370 seconds) [05:42] *** JesseW has quit IRC (Ping timeout: 370 seconds) [06:27] *** Aranje has quit IRC (Quit: Three sheets to the wind) [07:04] *** nicolas17 has quit IRC (Quit: U+1F634) [07:05] *** Honno has joined #archiveteam [07:25] *** tomwsmf has quit IRC (Read error: Operation timed out) [09:20] FYI http://googleappsdeveloper.blogspot.com/2015/08/deprecating-web-hosting-support-in.html [09:57] *** RichardG has joined #archiveteam [10:16] *** GLaDOS has quit IRC (Oh crap, I died.) [10:34] *** AlexLehm has joined #archiveteam [10:51] *** BlueMaxim has quit IRC (Quit: Leaving) [10:52] *** tomaspark has quit IRC (Ping timeout: 255 seconds) [11:17] *** SirCmpwn has joined #archiveteam [11:46] *** VADemon has joined #archiveteam [12:08] *** Jeroen__u has joined #archiveteam [12:43] *** morbus_ has joined #archiveteam [12:44] *** Morbus has quit IRC (Read error: Operation timed out) [12:47] Hey, just started a Warrior using VirtualBox and selected a project, but I don't think that it is actually doing anything. It is stuck on "The warrior is beginning work on a project." [13:03] Sorry, wrong channel, going to #Warrior. [13:20] *** GLaDOS has joined #archiveteam [13:45] *** WinterFox has quit IRC (Ping timeout: 501 seconds) [13:47] *** ravetcofx has quit IRC (Ping timeout: 501 seconds) [14:05] *** dashcloud has joined #archiveteam [14:14] *** VADemon has quit IRC (Read error: Connection reset by peer) [14:38] *** dashcloud has quit IRC (Read error: Operation timed out) [14:41] *** dashcloud has joined #archiveteam [14:46] *** dashcloud has quit IRC (Remote host closed the connection) [15:46] *** AlexLehm has quit IRC (Remote host closed the connection) [16:22] Got an alert Bioware forums locked and deleted soon [16:37] *** JesseW has joined #archiveteam [16:41] *** metalcamp has joined #archiveteam [16:54] zout: oops, sorry I missed where you explained what you were doing (now read through the scrollback). Good idea, thank you for doing it. [16:57] SketchCow: How soon? The official date was October or such [16:59] *** VADemon has joined #archiveteam [17:08] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [17:11] *** kristian_ has joined #archiveteam [17:15] *** ndiddy has joined #archiveteam [17:36] advance notice: Imgur may be at some amount of risk: https://www.reddit.com/r/undelete/comments/4zx28b/imgur_removed_the_infamous_comcast_swastika_from/d6zpksd?context=1 [17:36] probably somewhat longer-term, but bad news for its longevity regardless, if that goes through [17:36] *** metalcamp has joined #archiveteam [17:41] I don't think banning it from /r/pics in itself would really matter. But Imgur are starting to "crack" around the edges, and they're huge and important, so they're definitely something to watch closely [17:42] Frogging: it would, they get a ton of traffic from there [17:43] yeah? hm, it's mostly hotlinks though surely [17:44] Frogging: don't think so [17:44] and even if it were, everybody recognizes an imgur link when they see it [17:45] if /r/pics were to be full of otherhost.com, that's where users would flock to over time [17:45] yeah, true [17:47] so especially given the importance of imgur, I think we should treat this as an early warning sign, especially since the reasons why imgur might be banned there are also likely to drive users away in other ways [17:47] er [17:47] given the importance and size of imgur * [17:47] it's -not- going to be an easy one to archive [18:03] I think this is weird fringe response [18:04] I think the thing to do with imgur is archive the most popular images [18:07] as a start, yes. I wonder if a full grab would even be feasible if shit were to hit the fan, hypothetically [18:08] No, of course not. [18:08] It's got to be in the petabyte range now [18:08] yeah :s [18:09] *** tomwsmf has joined #archiveteam [18:10] What I WOULD say is that if people wanted to whip all these reddit nerds into some storing frenzy there could be a distributed saving effort [18:11] Their sitemap would be a good start: https://imgur.com/gallery/sitemap.xml [18:12] I am amazed and pleased that this exists. cool [18:14] Unfortunately reddit’s domain listing is disabled: https://www.reddit.com/domain/i.imgur.com/ [18:15] maybe there's an API way to do it [18:16] protip: add .json after basically any Reddit URL [18:16] :) [18:16] (it's still disabled for that one though) [18:17] maybe reddit itself would be willing to assist? [18:17] didn't the provide some database dumps in the past? [18:22] Comments only: https://archive.org/details/2015_reddit_comments_corpus [18:33] https://twitter.com/ServerBear/status/765034545703813121 [18:33] serverbear is dead [18:33] :0 [18:33] nuked historical hardware and performance stats of hosting providers [18:33] fuckssake [18:34] * joepie91 is only slightly bitter about this [18:51] *** SirCmpwn has quit IRC (Read error: Operation timed out) [18:56] *** bRick5772 has joined #archiveteam [18:56] *** kristian_ has quit IRC (Leaving) [19:11] *** JesseW has quit IRC (Read error: Operation timed out) [19:44] *** kristian_ has joined #archiveteam [19:46] *** notjack has joined #archiveteam [19:46] Hey everyone, great to be here again ;) [20:05] Hi, Jack [20:06] Hey! ;) [20:24] *** tomaspark has joined #archiveteam [20:26] *** tomaspar1 has joined #archiveteam [20:32] *** tomaspark has quit IRC (Quit: ChatZilla 0.9.92 [Firefox 48.0/20160728203720]) [20:36] *** SirCmpwn has joined #archiveteam [21:00] *** VADemon has quit IRC (Quit: left4dead) [21:07] *** dashcloud has joined #archiveteam [21:15] Update on the tumblr and fickr projects. I've written some URL agnostic WARC deduplication scripts. Some example WARCs will be uploaded here and send to Internet Archive [21:15] To see if they are made correctly (they already play back good), or if anything is missing [21:16] If they are good they will be implemented in the flickr and tumblr (and possibly yahooanswers) projects. [21:16] Flickr is the first one to start. [21:17] CC flickr images will be done first. Over these CC flickr images we're going to do two samples of 100000 images to know what size flickr will be in total [21:18] One sample will be with all version of the images and a second sample will be with only the original size ad the size shown on the webpage of an image [21:18] From there will decide on what we're going to grab exactly from flickr. [21:18] After CC images we're going to have a look at non-CC images and possibly to those too. [21:19] That was the little update on where we are at the moment with these projects. [21:20] if you have any suggestions or questions regarding the above, please post them [21:27] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [21:48] *** dashcloud has quit IRC (Ping timeout: 260 seconds) [22:15] *** RichardG has quit IRC (Keyboard not found, press F1 to continue) [22:15] *** RichardG has joined #archiveteam [22:42] *** logchfoo0 starts logging #archiveteam at Sun Aug 28 22:42:31 2016 [22:42] *** logchfoo0 has joined #archiveteam [22:46] *** JonimusP has joined #archiveteam [22:46] *** swebb sets mode: +o JonimusP [22:47] *** jk[[SVP]] is now known as jk[SVP] [22:47] *** LordNigh2 is now known as Lord_Nigh [22:47] *** Kaz| is now known as Kaz [22:53] The update on tumblr & flickr sounds good [22:56] *** VonGuard_ has quit IRC (Ping timeout: 260 seconds) [22:57] *** AlexLehm has joined #archiveteam [23:02] *** kevin has quit IRC (Ping timeout: 260 seconds) [23:09] *** VonGuard_ has joined #archiveteam [23:14] *** Honno has quit IRC (Read error: Operation timed out) [23:15] *** sHATNER has joined #archiveteam [23:15] *** espes__ has joined #archiveteam [23:15] *** xhdr has joined #archiveteam [23:15] *** PepsiMax has joined #archiveteam [23:15] *** tephra has joined #archiveteam [23:26] *** kevin has joined #archiveteam [23:28] *** ErkDog has quit IRC (Read error: Operation timed out) [23:28] *** ErkDog has joined #archiveteam [23:28] *** ErkDog has quit IRC (Remote host closed the connection!) [23:29] *** ErkDog has joined #archiveteam [23:32] *** dashcloud has quit IRC (Remote host closed the connection) [23:41] *** cadbury_ has joined #archiveteam [23:43] *** ErkDog has quit IRC (Read error: Operation timed out) [23:44] *** dserodio has quit IRC (Read error: Operation timed out) [23:45] *** Zialus has quit IRC (Read error: Operation timed out) [23:46] *** ErkDog has joined #archiveteam [23:46] *** dserodio has joined #archiveteam [23:50] *** Zialus has joined #archiveteam [23:59] *** dserodio has quit IRC (Read error: Operation timed out)