[00:09] grabbing it [00:11] currently archiving 2100 channels, it's taking a while [00:13] *** aaaaaaaaa has joined #archiveteam [00:16] yeah, I need to sort the disk space on my virtualisation server then I can ace a few channels [00:21] do we have a formal log of who has what? if not, we should start one. just a list of names and videos with notes like "entire playlist" and "entire channel up to October 2015" will do at first. I'm saying this to make it easy to fold our efforts into a big push on the off chance that youtube does something inadvisable like deleting all the nonprofitable channels [00:21] * joepie91 is grabbing nopefully [00:22] Microguru: the only easy thing for me to share is a list of channels+users+playlists I'm scheduled to grab, and a list of all video IDs that I have [00:23] marvinw, that's likely enough info for now. [00:31] *** HCrossSRV has joined #archiveteam [00:34] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [00:41] in related news here is the god-awful software I am using to archive users/channels/playlists https://www.refheap.com/0a0da5376faa8984131d83f83/raw [00:45] definitely looks like quite a hack. how do the proxies work? are they different servers in different countries that you are ssh port forwarding to? [00:48] Microguru: spiped to a polipo proxy [00:48] yes, different countries [00:49] when it comes to setting up download servers, do you think that being geographically diverse is a good idea? many hosting providers let you set up servers in several countries. Digital ocean, for example, lets me have a server in the netherlands,, USA, singapore, UK, germany and canada [00:49] digitalocean is not very good for archiving because you don't get unmetered bandwidth [00:50] *** philpem has quit IRC (Ping timeout: 252 seconds) [00:50] true. I think that my VPN server gets 1 TB a month, and it's the $5 package [00:50] being geographically diverse is good but beware of crawling the web from countries that censor websites [00:51] are datacenters generally subject to national filters? [00:51] I don't know [00:52] my tests from online.net suggested that French censorship did not apply [00:52] best to test it anyway [00:52] in Singapore, for example, " Censorship of sexual, political and racially or religiously sensitive content is extensive." [00:52] let's move to -bs [00:52] they might be a bad place for archiving if they apply that to datacenters and not just to citizens [00:52] ok [00:52] to -bs it is [00:54] looks like IA groks youtube videos now? https://web.archive.org/web/20130303114655if_/http://www.youtube.com/embed/dU1xS07N-FA [00:54] so does that mean that all we need to do is scrape for videos and let IA do the rest? sounds more bandwidth efficent [00:54] joepie91: they implemented this a while ago when they grabbed ~1PB of YouTube but I think they mostly stopped grabbing YouTube [00:55] aha [00:56] it also works on the /watch?v= pages [00:56] if you're lucky Chrome will manage to load some kind of replacement Flash (?) player [00:57] your /embed/ link isn't working for me [00:57] no video playback, I think that's just the YouTube player failing to load YouTube content? [00:57] "when they grabbed ~1PB of YouTube" what did they all grab? [00:58] #-bs [01:00] *** Sanqui has quit IRC (Read error: Operation timed out) [01:01] *** wp494_ has joined #archiveteam [01:02] *** stevieo has joined #archiveteam [01:04] *** cloudmons has quit IRC (Ping timeout: 506 seconds) [01:05] *** SiBurning has quit IRC (Ping timeout: 506 seconds) [01:05] *** wp494 has quit IRC (Ping timeout: 506 seconds) [01:05] *** robink has quit IRC (Ping timeout: 506 seconds) [01:09] *** BlueMaxim has joined #archiveteam [01:21] *** wp494_ is now known as wp494 [01:27] *** aaaaaaaaa has quit IRC (Leaving) [01:33] *** cloudmons has joined #archiveteam [01:34] *** robink has joined #archiveteam [01:39] *** Sanqui has joined #archiveteam [01:45] *** cloudmons has quit IRC (Read error: Connection reset by peer) [01:45] *** cloudmons has joined #archiveteam [02:01] *** aaaaaaaaa has joined #archiveteam [02:04] *** godane has quit IRC (Ping timeout: 268 seconds) [02:12] *** cloudmons has quit IRC (Read error: Connection reset by peer) [02:12] *** cloudmons has joined #archiveteam [02:13] *** robink has quit IRC (Read error: Connection reset by peer) [02:14] *** robink has joined #archiveteam [02:23] *** Coderjoe_ has quit IRC (Ping timeout: 252 seconds) [02:25] *** Coderjoe has joined #archiveteam [02:49] Microguru: 56,302 copies of "Shake it Off" [02:50] Back home! [02:50] ....for two days [02:56] *** stevieo has quit IRC (Read error: Connection reset by peer) [02:57] *** godane has joined #archiveteam [03:32] Uploading of Gamefront begins. [03:48] *** zenguy_pc has quit IRC (Read error: Connection reset by peer) [04:04] *** zenguy_pc has joined #archiveteam [04:18] *** Froggypwn has joined #archiveteam [04:25] *** aaaaaaaaa has quit IRC (Leaving) [04:44] *** zenguy_pc has quit IRC (Read error: Connection reset by peer) [04:44] *** matthusb- has quit IRC (Read error: Operation timed out) [04:44] *** matthusby has joined #archiveteam [05:01] *** zenguy_pc has joined #archiveteam [05:38] *** swebb has quit IRC (ny.us.hub irc.colosolutions.net) [05:52] *** Infreq has joined #archiveteam [06:00] *** pokeball9 has quit IRC (Quit: Connection closed for inactivity) [06:46] *** bzc6p has joined #archiveteam [06:48] *** scyther has joined #archiveteam [06:58] *** Ungstein1 has quit IRC (Ping timeout: 252 seconds) [06:59] *** Ungstein1 has joined #archiveteam [07:16] *** vitzli has joined #archiveteam [07:18] *** asfd has joined #archiveteam [07:23] *** JesseW has joined #archiveteam [07:29] *** JesseW has quit IRC (Read error: Operation timed out) [07:40] *** primus104 has joined #archiveteam [07:47] *** diskozap has joined #archiveteam [07:47] *** diskozap has quit IRC (Client Quit) [08:10] http://forum.vgcw.net/single/?p=8640299&t=11352108 [08:21] *** oli has quit IRC (Ping timeout: 252 seconds) [08:26] *** philpem has joined #archiveteam [08:37] *** schbirid has joined #archiveteam [08:46] *** oli has joined #archiveteam [08:49] *** scyther has quit IRC (Quit: Leaving) [08:54] *** insane_al has joined #archiveteam [09:04] SketchCows: thanks! [09:06] oops, without the s [09:11] the world cannot handle more than one SketchCow. it's just not ready. [09:11] there would be an bovine ignition movement from scared citizens [09:21] *** Ungstein1 has quit IRC (Quit: Leaving.) [09:24] yuku grab is started! [09:26] *** philpem has quit IRC (Ping timeout: 252 seconds) [09:29] arkiver: I'm getting rate limited? [09:29] yeah, currently going at 1 item/min [09:30] the site is very unstable, I'm not sure if that's due to bandwidth [09:30] if it is, we should go very very slow [09:30] I'll higher the limit if the site remains stable [09:31] alright [10:16] *** Ghost_of_ has joined #archiveteam [10:22] *** asfd has quit IRC (Quit: Leaving) [11:22] *** Dark_Star has quit IRC () [11:33] *** zerkalo has quit IRC (Ping timeout: 186 seconds) [11:33] *** thefinn93 has quit IRC (Ping timeout: 186 seconds) [11:33] *** winr4r has quit IRC (Ping timeout: 186 seconds) [11:33] *** jmtd is now known as Jon [11:34] *** Coderjoe has quit IRC (Ping timeout: 186 seconds) [11:34] *** Coderjoe has joined #archiveteam [11:34] *** Nemo_bis has quit IRC (Ping timeout: 186 seconds) [11:34] *** Nemo_bis has joined #archiveteam [11:35] *** thefinn93 has joined #archiveteam [11:36] *** zerkalo has joined #archiveteam [11:38] *** winr4r has joined #archiveteam [11:45] *** primus104 has quit IRC (Leaving.) [11:57] *** VADemon has joined #archiveteam [12:20] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [12:21] *** WinterFox has quit IRC (Remote host closed the connection) [12:30] *** HCrossSRV has quit IRC (Read error: Operation timed out) [12:31] *** HCrossSRV has joined #archiveteam [12:49] *** Ghost_of_ has quit IRC (Quit: Leaving) [12:50] *** Elegance has quit IRC (Read error: Connection reset by peer) [12:59] *** Elegance has joined #archiveteam [13:16] *** zenguy_pc has quit IRC (Read error: Connection reset by peer) [13:30] *** z00nx has quit IRC (Quit: WeeChat 1.2) [13:30] *** z00nx has joined #archiveteam [13:33] *** zenguy_pc has joined #archiveteam [13:36] *** z00nx has quit IRC (Quit: WeeChat 1.2) [13:36] *** z00nx has joined #archiveteam [13:38] *** z00nx has quit IRC (Client Quit) [13:40] *** z00nx has joined #archiveteam [13:49] *** primus104 has joined #archiveteam [14:19] *** VADemon_ has joined #archiveteam [14:21] Transfers going fine. [14:22] Yes, already 14 items uploaded. [14:22] *** VADemon has quit IRC (Read error: Operation timed out) [14:25] SketchCow: so you also have a presentation on the 31st? It's not listed on the IMPAKT website. When will it be? [14:26] I speak in Brussels on the 31st and Utrecht on the 1st. [14:27] Sounds like a busy week [14:27] PACKED is the Brussels event [14:28] *** Ghost_of_ has joined #archiveteam [14:28] http://www.packed.be/ [14:29] Me and midas will be there the 1st of november [14:29] Looking forward to it :) [14:43] anything like that in the London area? [14:45] I've spoken in the London area in the past, but nothing scheduled this year forward. [14:45] http://devslovebacon.com/conferences/bacon-2014/talks/from-colo-to-yolo-confessions-of-the-angriest-archivist (2014) [14:46] http://www.thinkingdigital.co.uk might be a good one to look at [14:46] I get flown to them. I don't fly to them. [14:46] ah xD [14:47] There's like six new active Archive Team members, like this HCross guy here, and I probably should say hi to them. [14:48] Hi SketchCow [14:48] I mostly notice them (and you, HCross) because they come in demanding answers [14:48] Potentially good AT material [14:48] ah [14:48] First ask "why", then go "what the fuck", then "fuck you", then "I built a tool, good luck negotiating with your ISP next month for overages" [14:49] haha. Unlimited BW servers here [14:51] SketchCow: talks being recorded? [14:51] / published [14:51] I have no idea. [14:51] Probably. [14:52] I could PROBABLY sneak you into the second one, you cheapshit [14:52] Since it's down the way [14:53] SketchCow, a very enjoyable talk. Thanks [14:53] lol [14:53] "If you see a cat change colour, RUNe [14:53] "If you see a cat change colour, RUN" [14:53] SketchCow: would be appreciated, but I still have to take into account travel cost as well, especially since I'm presently behind on rent :P [14:53] well [14:53] not technically behind yet [14:53] but going to be [14:54] SO BONE POOR [14:54] I could certainly see about it [14:54] (it's like 27 euro roundtrip by train, from Dordrecht <-> Utrecht) [14:57] Cling to the bottom of the train [15:00] lol [15:02] *** pokeball9 has joined #archiveteam [15:13] * midas fires up google maps [15:13] arkiver: whereabouts are you? [15:14] amsterdam [15:14] you? [15:14] naarden [15:14] so im in between utrecht / amsterdam [15:14] yes [15:15] are you going all day or only to SketchCow's talk? [15:15] probably just the talk, i might go all day [15:15] joepie91: if in need, i can pick you up and bring you home again. yadayada companycar [15:17] *** midas sets mode: +o joepie91 [15:17] midas: that'd be a viable option :P cc SketchCow [15:17] *** midas sets mode: +oo ersi Nemo_bis [15:18] it's just a 160k roundtrip :p [15:18] but companycar! it's free [15:18] midas: calculate in ~10-20 minutes of getting completely lost, though, this city is a nightmare to navigate by car [15:18] and I have seen exactly one satnav get it right [15:18] lol [15:18] just made the Yuku grab work on ARM [15:19] apple maps will fail at it, badly [15:19] but ill give it a try [15:19] midas: it's all one-way streets everywhere, you make a single wrong turn and you end up having to make a full circle around the city center to get back where you were [15:19] :P [15:19] thats why im using apple maps, i like to see the city a couple of times [15:19] dutch-bs [15:20] daww [15:38] *** Ghost_of_ has quit IRC (Quit: Leaving) [16:06] *** nertzy has joined #archiveteam [16:18] *** JesseW has joined #archiveteam [16:18] *** jmad980 has quit IRC (Ping timeout: 252 seconds) [16:30] *** JesseW has quit IRC (Read error: Operation timed out) [16:34] *** jmad980 has joined #archiveteam [16:45] *** Start has quit IRC (Read error: Connection reset by peer) [16:46] *** Start has joined #archiveteam [16:47] bzc6p: available for myvip questions? [16:49] Yes. But we should probably set up a channel for that. [16:49] let's thinkg of a channel then [16:50] #byevip ? [16:50] yes, let's do #byevip [16:51] #nosovip [16:51] notsovip [16:51] *** philpem has joined #archiveteam [16:53] everyone: #byevip or #notsovip ? [16:55] In fact, byevip may sound better only in Hungarian, as we pronounce it [maivip] instead of [maiviaipi:]. [16:55] SketchCow: you talked about archiving external links from mediawiki's some time ago. [16:55] Yes [16:55] IO [16:55] I think we'll archive those with the upcoming wikis project [16:55] I'd like that very very much, starting with fileformats.archiveteam.org [16:56] Basically that project will grab all wikis into WARCs, I'll also make sure we can grab all external links from wikis into WARCs [16:56] So those will be done through the warrior I think [16:56] That ok? [16:57] arkiver: I'm away for like 20 minutes, don't you mind? [16:57] Nope [16:58] You may list your questions in the appropriate channel until then. Thanks for your time, by the way! [16:58] ;) [17:01] *** JesseW has joined #archiveteam [17:03] *** gibigiana has quit IRC (Ping timeout: 252 seconds) [17:03] So, I've been handed nearly the entire metadata (grabbed HTML and images) of MP3.COM. [17:03] It's safely esconsed behind a dark object at the archive now. About 5gb [17:04] *** Lord_Nigh has quit IRC (Ping timeout: 252 seconds) [17:04] *** vtyl has quit IRC (Ping timeout: 252 seconds) [17:04] nice [17:05] *** is- has quit IRC (Ping timeout: 252 seconds) [17:05] *** lukeman has quit IRC (Ping timeout: 252 seconds) [17:06] *** JesseW has quit IRC (Read error: Operation timed out) [17:06] *** is- has joined #archiveteam [17:07] *** dan- has quit IRC (Ping timeout: 252 seconds) [17:07] *** lytv has joined #archiveteam [17:07] *** wp494_ has joined #archiveteam [17:08] *** balrog has quit IRC (Ping timeout: 252 seconds) [17:08] *** zhongfu has quit IRC (Ping timeout: 252 seconds) [17:08] *** tephra has quit IRC (Ping timeout: 252 seconds) [17:08] *** tephra has joined #archiveteam [17:08] *** wp494 has quit IRC (Ping timeout: 252 seconds) [17:09] *** Baljem_ has joined #archiveteam [17:09] *** Selanda has quit IRC (Ping timeout: 252 seconds) [17:10] *** vitzli has quit IRC (Leaving) [17:11] *** goekesmi has quit IRC (Ping timeout: 252 seconds) [17:11] *** bai has quit IRC (Ping timeout: 252 seconds) [17:11] *** gibigiana has joined #archiveteam [17:11] *** goekesmi has joined #archiveteam [17:11] *** Selanda has joined #archiveteam [17:12] *** bai has joined #archiveteam [17:12] *** diacope has quit IRC (Ping timeout: 252 seconds) [17:13] *** wacky_ has quit IRC (Ping timeout: 252 seconds) [17:13] *** Baljem has quit IRC (Ping timeout: 252 seconds) [17:14] *** Kenshin has quit IRC (Ping timeout: 252 seconds) [17:15] *** wacky has joined #archiveteam [17:15] *** Lord_Nigh has joined #archiveteam [17:17] *** Kenshin has joined #archiveteam [17:17] *** zhongfu has joined #archiveteam [17:17] *** lukeman has joined #archiveteam [17:18] *** dan- has joined #archiveteam [17:23] *** balrog has joined #archiveteam [17:24] *** Kenshin has quit IRC (Quit: ZNC - http://znc.in) [17:24] *** Kenshin has joined #archiveteam [17:29] *** diacope has joined #archiveteam [17:36] *** arkiver2 has joined #archiveteam [17:36] *** wp494_ is now known as wp494 [17:42] So, my plan is to work with Archive Team members to convert it back into WARCS and shove it into the archive. [17:42] *** diacope has quit IRC (Ping timeout: 252 seconds) [17:43] *** arkiver2 has quit IRC (Ping timeout: 252 seconds) [17:43] *** Fletcher has quit IRC (Ping timeout: 252 seconds) [17:43] *** dan- has quit IRC (Ping timeout: 252 seconds) [17:44] *** Famicoman has quit IRC (Ping timeout: 252 seconds) [17:44] *** sivoais has quit IRC (Ping timeout: 252 seconds) [17:44] *** balrog has quit IRC (Ping timeout: 252 seconds) [17:44] *** wacky has quit IRC (Ping timeout: 252 seconds) [17:44] *** joepie91 has quit IRC (Ping timeout: 252 seconds) [17:44] *** WubTheCap has quit IRC (Ping timeout: 252 seconds) [17:44] *** wacky has joined #archiveteam [17:44] *** Sue_ has quit IRC (Ping timeout: 252 seconds) [17:45] *** sivoais has joined #archiveteam [17:51] *** joepie91 has joined #archiveteam [17:52] *** aaaaaaaaa has joined #archiveteam [17:55] *** WubTheCap has joined #archiveteam [17:56] *** dan- has joined #archiveteam [17:56] *** diacope has joined #archiveteam [17:57] This brings the classic issue [17:57] Of creating WARCs from nowhere [17:57] *** balrog has joined #archiveteam [18:00] *** Sue_ has joined #archiveteam [18:04] *** bzc6p_ has joined #archiveteam [18:09] SketchCow: Main problem is the recreation of request and response headers [18:09] *** bzc6p has quit IRC (Read error: Operation timed out) [18:13] Agreed, a host of issues. [18:13] I've started with the foundation, of course. Would this break wayback? [18:13] (Asking our engineers) [18:17] This dubstep is making me forget everybody sucks [18:18] *** Fletcher has joined #archiveteam [18:18] i kinda want to sneak in some fake warc data some day [18:18] it's crazy that we can do it [18:19] This entire situation is built on a whole set of trust built on my name and reputation with the archive. [18:19] Violate it and we won't even remember where you once stood [18:19] the problem is that you cannot trust people, who knows who sneaks in stuff :( [18:20] I find they tend to be little blabbermouths who mention it in a general channel. [18:20] So the external URL grabbing script is working [18:21] The fileformats wii has 17381 external URLs [18:22] wiki* [18:24] Yep [18:25] chfoo: can you please create a FOS rsync target for 'wikis'? [18:26] chfoo: we'll be grabbing full wikis and external links from wikis [18:54] *** Famicoman has joined #archiveteam [18:57] *** insane_al has quit IRC (Leaving) [19:03] *** bzc6p_ is now known as bzc6p [19:17] didn't you already upload that mp3.com stuff ages ago [19:18] https://archive.org/details/mp3com-skeleton [19:20] *** insane_al has joined #archiveteam [19:23] *** Ghost_of_ has joined #archiveteam [19:39] *** Start has quit IRC (Quit: Disconnected.) [19:39] *** Start has joined #archiveteam [19:40] ha ha [19:40] mmmaybe [19:42] The IA is now getting a copy of "Justin Bieber OS" [19:44] *** Start_ has joined #archiveteam [19:45] Thank god [19:45] it needs to run on all the IA servers now [19:46] *** Start has quit IRC (Read error: Operation timed out) [19:46] [19:46:20] 19<Major> HCross: Your job for http://biebian.sourceforge.net/ has finished. [19:47] arkiver: ok, done [19:47] *** Start has joined #archiveteam [19:51] *** Start_ has quit IRC (Ping timeout: 310 seconds) [19:52] It's Linux, at least. [19:54] By the way, what's the news with sourceforge? [19:55] no reply yet [19:56] then there won't be any [19:57] Assume they're all dead [19:57] http://i.ytimg.com/vi/PE-CfJ190x4/maxresdefault.jpg [19:57] *** SimpBrain has quit IRC (Read error: Operation timed out) [19:58] *** Start has quit IRC (Quit: Disconnected.) [20:10] *** Start has joined #archiveteam [20:16] *** schbirid has quit IRC (Quit: Leaving) [20:19] *** insane_al has quit IRC (Leaving) [20:23] with both sourceforge and google code, for the source code items, what exactly is the plan for them? [20:24] I mean one item per repo seems a little much. But I don't know if randomly stuffed in packs is that useful [20:25] I think alphabetical packing [20:26] the problem with that is they would need to be grabbed alphabetically. Plus some letters will have way more than others, like l, x, and p [20:27] 2.7tb of gamefront being uploaded at the moment. [20:30] *** scyther has joined #archiveteam [20:33] *** SimpBrain has joined #archiveteam [21:02] Where going to start the wikis project for external URLs! [21:02] Who wants to take the (big) fileformats archiveteam wiki item? [21:03] ooops, we're I mean [21:04] sorry, I'm a bit tired [21:05] the wikis project should be able to grab referenced youtube videos, imho [21:05] what is the most urgent one? [21:05] currently that grab is not able to grab youtube videos [21:05] although youtube is stable... [21:06] HCross: you want to fileformats wiki? [21:06] *** Meeh has quit IRC (Remote host closed the connection) [21:06] I meant, is the wiki project more important vs the Yuku [21:07] yuku is more important, but we currently don't need a lot more concurrent on that grab [21:07] I will stay on Yuku with my 40 workers [21:08] you already have 40 on it? yeah, please keep them on yuku then [21:08] 2x20 [21:20] *** arkiver2 has joined #archiveteam [21:20] *** arkiver2 has quit IRC (Client Quit) [21:41] *** WubTheCap has quit IRC (Quit: Leaving) [21:59] *** scyther has quit IRC (Read error: Connection reset by peer) [22:03] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [22:10] yuku-discovery: PR, added DNSdumpster and penetration-tools https://github.com/chpwssn/yuku-discovery/pull/1 [22:27] VADemon_: thanks, but we found a sitemap a while ago, so we have all sites [22:27] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [22:27] *** aaaaaaaaa has joined #archiveteam [22:44] *** RichardG has quit IRC (Read error: Connection reset by peer) [22:47] *** RichardG has joined #archiveteam [22:51] ah yea, it has the sitemap.xml too, but the script still reported added entries [22:56] *** dashcloud has joined #archiveteam [22:58] arkiver, amazonklubben.yuku.com is not in the sitemap.xml for example [23:04] VADemon_: nice find! [23:04] well, will have a look at that later, afk now for the night [23:25] *** dashcloud has quit IRC (Read error: Operation timed out) [23:27] *** dashcloud has joined #archiveteam [23:55] *** dan- has quit IRC (Ping timeout: 252 seconds)