[00:33] *** godane has joined #archiveteam [00:43] SketchCow: I'd like to start begin writing a project for tumblr [00:44] arkiver: xmc was going to start one I think [00:44] we were just talking about it in -bs [00:44] i haven't gotten around to it, so if you do it first then you win [00:44] i have some ideas about how to do it that might be valuable, but they're in scrollback of #-bs already [00:45] it would require two projects and a tiny bit of serverside code but yipdw is willing [01:00] *** WinterFox has joined #archiveteam [01:07] *** DoomTay has joined #archiveteam [01:08] *** Coderjoe has quit IRC (Read error: Operation timed out) [01:08] *** rossdylan has quit IRC (Read error: Operation timed out) [01:28] *** Coderjoe has joined #archiveteam [01:50] *** godane has quit IRC (Read error: Operation timed out) [02:25] *** philpem has quit IRC (Ping timeout: 260 seconds) [02:36] *** Aranje has quit IRC (Ping timeout: 260 seconds) [03:49] *** Coderjoe has quit IRC (Read error: Operation timed out) [04:12] *** Aranje has joined #archiveteam [04:35] *** Coderjoe has joined #archiveteam [04:46] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:52] *** Sk1d has joined #archiveteam [05:04] *** TC02 has quit IRC (Ping timeout: 246 seconds) [05:27] *** TC02 has joined #archiveteam [05:31] *** DoomTay has quit IRC (Quit: Page closed) [05:31] *** n00b484 has joined #archiveteam [05:32] it seems to be working now but cant connect using MRIC [05:35] *** n00b484 has quit IRC (Client Quit) [05:40] *** TC01 has quit IRC (Ping timeout: 260 seconds) [05:53] *** yipdw has quit IRC (Read error: Operation timed out) [05:54] *** TC01 has joined #archiveteam [05:54] *** Aranje has quit IRC (Quit: Three sheets to the wind) [05:56] *** JesseW has joined #archiveteam [06:08] *** yipdw has joined #archiveteam [06:33] *** tomwsmf has quit IRC (Ping timeout: 258 seconds) [07:07] *** metal_cam has joined #archiveteam [07:13] *** Emcy has quit IRC (Read error: Operation timed out) [07:14] *** TC02 has quit IRC (Ping timeout: 246 seconds) [07:21] *** TC02 has joined #archiveteam [07:30] *** TC02 has quit IRC (Ping timeout: 246 seconds) [07:31] *** TC02 has joined #archiveteam [07:31] *** JesseW has quit IRC (Ping timeout: 370 seconds) [08:32] *** W1nterFox has joined #archiveteam [08:35] *** WinterFox has quit IRC (Ping timeout: 492 seconds) [09:12] *** philpem has joined #archiveteam [09:16] *** schbirid has joined #archiveteam [09:48] *** pfallenop has quit IRC (Ping timeout: 260 seconds) [09:54] xmc: I'll have a look at the logs of #-bs. I'm not sure how spread out it is over the logs, so if you are around, maybe you could give me a small overview of the idea? [09:54] I'm basically thinking a warrior project to create the WARCs for the wayback machine. [09:54] We'l be extracting new tumblr sites as we archive them [10:00] I see the main idea is to seperate archiving images and blogs [10:02] I've started a little on the warrior project now [10:04] We should be able to do some test runs soon [10:08] xmc: what would be the reason for seperating the grab of images from the other files? [10:13] *** Scuttle has joined #archiveteam [10:13] hum...I have an archivebot pullig data from one of my sites, how do I find out what's going on? :) [10:14] hi [10:14] *** GLaDOS has quit IRC (Read error: Connection reset by peer) [10:14] What is your site? [10:14] *** GLaDOS has joined #archiveteam [10:15] The dashboard of ArchiveBot can be found here http://archivebot.com/ [10:17] randomwaffle.gbs.fm [10:17] Looks like it's on the dashboard [10:17] right [10:18] if someone wants, I can rsync the whole site somewhere [10:18] not sure what the grab is doing now though [10:18] well, downloading everything it seems :) [10:19] *** pfallenop has joined #archiveteam [10:19] Yeah it's going to put them onto the internet archive [10:19] Notes say you're the last surviving waffleimages mirror? [10:19] may very well be [10:19] How big is the repo? [10:19] around 330 gigs [10:20] We can delay the crawl / make it less resource intensive if it's causing you problems [10:21] ah, that's no problem, I just noticed my access-logs were a lot bigger than they used to be :D [10:21] aha :) [10:21] Seems someone wants to preserve it forever [10:21] So got added to the crawlers to download & upload to the internet archive / viewable in the way back machine [10:21] *** terg has joined #archiveteam [10:21] aight [10:22] it's mostly forum-linked pics though I think... [10:22] 182Gb done so about half way [10:22] and that would be broken anywas since I don't have access to the waffleimages-domain [10:22] No other notes :-/ [10:24] post the KAT raid, apart from proxies and such, is there any database lying about of KAT torrent info? [10:24] There is a torrent (ironically) kicking around somewhere [10:25] very unfortunate, I wonder if it'd be a good idea to do regular (incremental if possible) archivals of large torrent indexes [10:25] whereabouts should I look? [10:25] I think that is a good idea [10:26] I'm planning on getting something going to go by all torrent sites [10:26] we already have a good archive of rutracker [10:26] Good idea, Scale is an issue [10:26] But let's move this to #archiveteam-bs [10:26] gotcha [11:04] *** Atom-- has quit IRC (Read error: Operation timed out) [11:04] *** winterfox has joined #archiveteam [11:05] *** Emcy has joined #archiveteam [11:06] *** W1nterFox has quit IRC (Ping timeout: 492 seconds) [11:07] *** Emcy has quit IRC (Client Quit) [11:30] *** Emcy has joined #archiveteam [11:57] *** Emcy_ has joined #archiveteam [12:05] *** Sanqui has left . [12:05] *** Sanqui has joined #archiveteam [12:06] *** Coderjoe has quit IRC (Read error: Operation timed out) [12:06] *** terg has quit IRC (My Mac has gone to sleep. ZZZzzz…) [12:10] *** Emcy has quit IRC (Read error: Operation timed out) [12:16] *** Coderjoe has joined #archiveteam [12:32] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [12:35] *** BartoCH has joined #archiveteam [12:49] *** BlueMaxim has quit IRC (Read error: Operation timed out) [12:49] *** BlueMaxim has joined #archiveteam [13:06] *** atomotic has joined #archiveteam [13:13] *** kristian_ has joined #archiveteam [13:20] *** Coderjoe has quit IRC (Read error: Operation timed out) [13:23] *** Coderjoe has joined #archiveteam [13:29] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [13:40] *** vOYtEC has joined #archiveteam [13:49] *** GLaDOS has quit IRC (Quit: Oh crap, I died.) [13:49] *** GLaDOS has joined #archiveteam [13:56] *** REiN^ has quit IRC (Ping timeout: 244 seconds) [13:58] *** redlob has quit IRC (ZNC - http://znc.in) [14:03] *** redlob has joined #archiveteam [14:14] *** BlueMaxim has quit IRC (Quit: Leaving) [14:15] *** REiN^ has joined #archiveteam [15:05] *** DoomTay has joined #archiveteam [15:30] arkiver: split the images off because otherwise you'll get all the images copied into every blog's warc. that will multiply your grab size by like fifty [15:51] IA is currently working on something to deduplicate WARCs. [15:51] Duplicate records will be replaced by revisit records [15:51] I'll ask around, but size might not matter too much [15:52] Bandwidth is a more a problem [15:52] Jumping in on this - I'm assuming that means IA will only keep 1 copy of everything, even if the same file is uploaded in every WARC? [15:53] If the same file is upload in 50 WARCs, 49 WARCs will have revisit records and 1 WARC will hold the actual file [15:53] (If I understood IA's idea correctly) [15:53] revisit records being just a pointer to the actual file? [15:53] Yeah [15:54] Does this mean no more cases of multiple timestamps for things that haven [15:54] 't haven't changed at all? [15:54] awesome, had always wondered if there was an easy way to do that, though it sounds like a ton of processing work. Makes sense for it to be done on IA's end really [15:54] As far as I know, just a redirect to an other record, without making the URL and timestamp in the Wayback Machine look like it is redirected [15:54] DoomTay: no, see above ^ [15:55] The idea isn't totally clear yet though, still being discussed, so things might change [16:00] Okay, so after looking at what a revisit record is, it looks like it IS some form of redundancy removal [16:00] Yay [16:01] Though I doubt this would save more than, say, a few gigs worth of filespace [16:08] huh [16:09] Replacing a whole file and just throwing a pointer in saves tons [16:09] Even with just AT's stuff, there's an absolutel TON of duplication, due to the nature of what we do [16:09] When you scale that up to IA, that's terabytes, at least [16:11] Hell, maybe petabytes [16:12] Speaking of files, anyone know how copies with matching digests can still have different lengths? Is that actually the length of the WARC? [16:12] Like with http://web.archive.org/cdx/search/cdx?url=http://www.doomworld.com/batman/main.JPG&output=json [16:15] wget has a dedup flag for warc btw [16:16] that only goes so far though schbirid, I guess that works for ArchiveBot, but not warrior projects [17:02] *** kristian_ has quit IRC (Leaving) [17:29] *** Scuttle has left Leaving [17:50] *** JesseW has joined #archiveteam [18:12] *** tomwsmf has joined #archiveteam [18:38] *** metal_cam is now known as metalcamp [19:05] *** Start has quit IRC (Read error: Connection reset by peer) [19:05] *** Start has joined #archiveteam [19:16] no idea what it means but http://ddl-warez.to/ has a notice "only 68 days left" [19:17] ....and it has freaking CloudFlare [19:19] *** JesseW has quit IRC (Ping timeout: 370 seconds) [19:21] of cousre [20:09] *** maseck has quit IRC (Quit: No Ping reply in 180 seconds.) [20:09] *** maseck has joined #archiveteam [20:50] ...and there we go http://www.bbc.com/news/business-36879831 [20:50] .title [20:51] *** kristian_ has joined #archiveteam [20:51] .title http://www.bbc.co.uk/news/business-36879831 [20:51] sod it, "Verizon 'agrees $5bn Yahoo deal'" [21:07] Oh God. Yahoo and AOL having a baby [21:08] *** Kazzy is now known as Kaz [21:10] This is gonna be fun... [21:26] *** Actium has joined #archiveteam [21:52] *** godane has joined #archiveteam [21:54] Supercookies for everyone! [21:54] *** Emcy_ has quit IRC (Read error: Operation timed out) [21:55] *** Emcy_ has joined #archiveteam [22:00] *** pguth_ has quit IRC (Remote host closed the connection) [22:00] *** pguth_ has joined #archiveteam [22:02] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [22:02] *** Emcy_ has quit IRC (Read error: Operation timed out) [22:04] *** Emcy_ has joined #archiveteam [22:10] *** kristian_ has quit IRC (Leaving) [22:15] *** winterfox has quit IRC (Ping timeout: 492 seconds) [22:33] *** ndiddy has joined #archiveteam [22:37] *** Coderjoe has quit IRC (Ping timeout: 260 seconds) [22:38] *** Coderjoe has joined #archiveteam [22:51] *** dashcloud has joined #archiveteam [23:01] *** pguth_ has quit IRC (Remote host closed the connection) [23:01] *** pguth_ has joined #archiveteam [23:11] *** kristian_ has joined #archiveteam [23:19] *** Coderjoe has quit IRC (Ping timeout: 260 seconds) [23:22] *** Swaxx has joined #archiveteam [23:23] hi anyone here? [23:23] * Actium says hi and goes back into hiding [23:24] how can i post a link in a forumpost? [23:25] I don't think this is the place for that [23:25] ow okay, [23:25] do you know the irc adress to extratorrents? [23:28] *** Swaxx has quit IRC (Quit: Page closed) [23:34] *** Coderjoe has joined #archiveteam [23:37] *** BlueMaxim has joined #archiveteam [23:55] *** closure has joined #archiveteam