[00:11] *** Start has joined #archiveteam [00:23] *** maseck_ has quit IRC (Quit: No Ping reply in 180 seconds.) [00:23] *** maseck has joined #archiveteam [00:42] *** hive-mind has quit IRC (Ping timeout: 260 seconds) [00:43] *** hive-mind has joined #archiveteam [00:45] *** wyatt8740 has quit IRC (Read error: Operation timed out) [00:45] *** wyatt8740 has joined #archiveteam [00:52] *** Ghost_of_ has joined #archiveteam [00:55] *** Atom__ has quit IRC (Read error: Connection reset by peer) [00:56] *** Atom__ has joined #archiveteam [01:00] *** wyatt8740 has quit IRC (Read error: Operation timed out) [01:01] *** wyatt8740 has joined #archiveteam [01:28] *** kyan has joined #archiveteam [01:51] *** vitzli has joined #archiveteam [02:01] *** Stiletto has quit IRC (Read error: Connection reset by peer) [02:09] *** philpem has quit IRC (Ping timeout: 260 seconds) [02:13] *** JesseW has joined #archiveteam [02:51] *** Ghost_of_ has quit IRC (Quit: Leaving) [03:20] *** vtyl has joined #archiveteam [03:21] *** lytv has quit IRC (Read error: Operation timed out) [03:21] *** einstein9 has joined #archiveteam [03:45] *** kyan has quit IRC (This computer has gone to sleep) [03:48] *** vitzli has quit IRC (Leaving) [04:06] *** lukeman has joined #archiveteam [04:13] Christopher Rush, one of the early Magic: the Gathering card artists, has died [04:17] *** ianweller has left [04:25] *** mutoso has quit IRC (Read error: Operation timed out) [04:34] *** mutoso has joined #archiveteam [04:43] *** mutoso has quit IRC (Ping timeout: 252 seconds) [04:50] *** mutoso has joined #archiveteam [04:51] *** Coderjoe has quit IRC (Read error: Operation timed out) [05:17] *** mutoso has quit IRC (Read error: Operation timed out) [05:21] *** Coderjoe has joined #archiveteam [05:22] *** ndiddy has quit IRC (Quit: Leaving) [05:41] *** Sk1d has quit IRC (Ping timeout: 200 seconds) [05:47] *** Sk1d has joined #archiveteam [06:03] *** vitzli has joined #archiveteam [06:11] *** vitzli has quit IRC (Leaving) [06:22] *** WinterFox has joined #archiveteam [06:22] *** vitzli has joined #archiveteam [07:26] *** megaminxw has joined #archiveteam [07:56] Are any projects hitting FOS particularly hard? I've dropped to 1MB/s upload [07:57] GameTrailers is ~1GB per video [07:58] ah, that may be where the problem is [07:59] (my archivebot upload directory has ballooned to 600G) [07:59] *** schbirid has joined #archiveteam [08:03] Fletcher: which pipelines do you run, anyway? [08:04] F_* [08:05] ah, cool [08:06] thank you for running those [08:06] Do you know who runs the aupipe ones? [08:07] no idea I'm afraid :/ [08:08] should be able to trace it back to when the ssh key was submitted though [08:09] hm, that's not public info, though, right? [08:10] and what about nico-only_at_home (which seems to have been stuck for a couple of weeks)? [08:11] *** signius has quit IRC (Remote host closed the connection) [08:13] *** signius has joined #archiveteam [08:18] yipdw would be the only one with key info [08:18] *** JesseW has quit IRC (Quit: Leaving.) [08:18] trs80 runs aupipe [08:19] I think I have SSH access to it from the control node [08:32] *** kyan has joined #archiveteam [08:48] *** kyan has quit IRC (Leaving) [09:03] hmm? yeah, I have aupipe [09:03] jessew, fletcher: ^^ [09:08] *** Stiletto has joined #archiveteam [09:23] *** Sk1d has left [09:34] *** Stilett0 has joined #archiveteam [09:39] *** Stiletto has quit IRC (Read error: Operation timed out) [09:41] *** Stilett0 is now known as Stiletto [09:51] *** lukeman has quit IRC (Quit: My MacBook Pro has gone to sleep. ZZZzzz…) [09:57] *** Stilett0 has joined #archiveteam [09:57] *** Stiletto has quit IRC (Read error: Connection reset by peer) [10:01] *** Stilett0 is now known as Stiletto [10:01] *** Stiletto has quit IRC (Remote host closed the connection) [10:01] *** Stiletto has joined #archiveteam [10:04] *** mutoso has joined #archiveteam [10:13] Gah [10:13] OK [10:38] Load average of FOS is now up to 40, what could go wrong [10:39] could go up to 80 [10:41] Could go down to 0 [10:43] *** vtyl has quit IRC (Quit: Leaving) [10:45] *** Cameron_D has joined #archiveteam [10:49] *** lytv has joined #archiveteam [10:56] *** bzc6p has joined #archiveteam [10:56] *** swebb sets mode: +o bzc6p [10:58] *** Stiletto is now known as Stilett0 [11:02] *** alberto has joined #archiveteam [11:06] lol [11:06] Chorca: you said 40 terabytes? That's awesome. [11:07] einstein9: Not in Hungary. [11:08] k [11:13] *** bzc6p sets mode: +oooo achip aliz chazchaz chfoo [11:13] *** bzc6p sets mode: +oooo chfoo- closure Coderjoe Ctrl-S___ [11:13] *** bzc6p sets mode: +oooo dashcloud Fletcher Fletcher_ Fusl [11:13] *** bzc6p sets mode: +oooo GLaDOS godane HCross2 Infreq_ [11:14] *** bzc6p sets mode: +oooo ivan` joepie91 Kazzy Kenshin [11:14] *** bzc6p sets mode: +oooo midas Muad-Dib Nemo_bis ohhdemgir [11:14] *** bzc6p sets mode: +oooo phuzion PurpleSym Sanqui schbirid [11:14] *** bzc6p sets mode: +oooo SimpBrain SmileyG Start trs80 [11:14] *** bzc6p sets mode: +ooo vitzli wp494 wyatt8740 [11:43] *** bzc6p_ has joined #archiveteam [11:43] *** swebb sets mode: +o bzc6p_ [11:44] *** bzc6p has quit IRC (Ping timeout: 250 seconds) [11:55] *** WinterFox has quit IRC (Remote host closed the connection) [12:01] *** einstein9 has quit IRC (Read error: Operation timed out) [12:05] *** bzc6p_ has quit IRC (Ping timeout: 250 seconds) [12:39] *** bzc6p has joined #archiveteam [12:39] *** swebb sets mode: +o bzc6p [12:43] *** signius_ has joined #archiveteam [12:44] *** arkiver3 has joined #archiveteam [12:45] *** signius has quit IRC (Read error: Operation timed out) [12:51] *** arkiver3 has quit IRC (Ping timeout: 252 seconds) [12:51] *** arkiver3 has joined #archiveteam [12:55] *** arkiver3 has quit IRC (Ping timeout: 252 seconds) [12:56] *** arkiver3 has joined #archiveteam [12:56] *** signius_ has quit IRC (Remote host closed the connection) [12:58] *** weles has joined #archiveteam [12:58] *** signius has joined #archiveteam [13:15] *** bzc6p has quit IRC (Ping timeout: 250 seconds) [13:17] *** Atom-- has joined #archiveteam [13:20] *** arkiver3 has quit IRC (Ping timeout: 252 seconds) [13:20] *** Atom__ has quit IRC (Ping timeout: 252 seconds) [13:20] *** Atom__ has joined #archiveteam [13:21] *** Atom-- has quit IRC (Ping timeout: 252 seconds) [13:22] *** arkiver3 has joined #archiveteam [13:22] *** bzc6p has joined #archiveteam [13:22] *** swebb sets mode: +o bzc6p [13:27] *** Stilett0 has quit IRC (Read error: Operation timed out) [13:28] http://www.factmag.com/2016/02/11/soundcloud-financial-report-44m-losses/ [13:36] https://soundcloud.com/people/directory [13:37] track ids go up to ~300 million [13:43] *** Stiletto has joined #archiveteam [13:46] *** arkiver3 has quit IRC (Ping timeout: 252 seconds) [14:10] *** megaminxw has quit IRC (Quit: Leaving.) [14:15] *** arkiver3 has joined #archiveteam [14:37] *** [phire] has quit IRC (Quit: ZNC - http://znc.in) [14:38] *** vOYtEC_ has quit IRC (Quit: rm -r *) [14:54] *** [phire] has joined #archiveteam [14:59] *** Start has quit IRC (Quit: Disconnected.) [15:00] that's a whole lot of data.. [15:03] shit. http://www.factmag.com/2016/02/11/soundcloud-financial-report-44m-losses/ [15:03] I see people posted it [15:03] ugh the copyright cartel has been after them [15:04] even though they are mostly user-generated content [15:04] That's like how the RIAA requires any public place that allows bands to play to get a license even if they require the bands to only play origianl music "just in case" they accidentally play a cover [15:06] There has been a discussion about SoundCloud back in August: http://archive.fart.website/bin/irclogger_log/archiveteam?date=2015-08-27,Thu&sel=180#l176 [15:06] SketchCow There's no way we can get Soundcloud [15:06] "2.5 PB of data" [15:06] yikes. [15:07] (it was a quote) [15:07] but how do you identify the original content? [15:07] is there a way to get certain channels? [15:08] youtube-dl supports SoundCloud [15:08] So if you have any artists you follow now is the time to rip them [15:08] grr lots of podcasts host there [15:09] yep, going after startalk radio [15:09] thank you [15:22] *** bzc6p has left [15:24] If we want a project for SoundCloud and SketchCow agrees we can start a project [15:24] maybe do census over wikipedia external links to soundcloud [15:24] & [15:24] dammit, it was a question mark, sorry [15:27] Who is recording the gravitation waves livestream?(!!) [15:28] start in 2 minutes! [15:28] can't record it from where I am, so someone please record [15:29] URL? [15:31] https://www.youtube.com/watch?v=c7293kAiPZw [15:33] arkiver: Can I just throw the URL into youtube-dl? [15:34] I don't think it records livestreams [15:34] What's the recommended method then? [15:34] But I think the youtubestream will also be online as a videos after the stream [15:34] and I'm totally sure someone else is recording this [15:35] With youtube-dl mpegts complains about continuity errors, sigh. [15:36] arkiver3: see #archivebot re: al jazeera [15:37] MrRadar: arkiver3: yes, youtube-dl records livestreams in theory, but it has never worked for me [15:38] Hmm, youtube-dl is grabbing a bunch of segments for me though I don't know if it will continue [15:39] *** Start has joined #archiveteam [15:46] *** Zei-Pii has joined #archiveteam [15:51] *** zzqw has quit IRC (Ping timeout: 252 seconds) [15:52] *** z00nx has quit IRC (Ping timeout: 252 seconds) [15:52] *** Fletcher has quit IRC (Ping timeout: 252 seconds) [15:52] *** goekesmi has quit IRC (Ping timeout: 250 seconds) [15:53] soundcloud STORING 2.5 PB will be including original formats (wav etc) and private tracks [15:53] *** Rickster has quit IRC (Ping timeout: 260 seconds) [15:54] *** HCross has quit IRC (Ping timeout: 250 seconds) [15:55] *** rduser has quit IRC (Ping timeout: 260 seconds) [15:55] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [15:56] *** midas has quit IRC (Ping timeout: 260 seconds) [15:56] *** ivan` has quit IRC (Read error: Operation timed out) [15:57] *** arkiver3 has quit IRC (Quit: Nettalk6 - www.ntalk.de) [15:57] *** Test__ has joined #archiveteam [15:58] *** sevs44936 has quit IRC (Ping timeout: 633 seconds) [15:59] *** Test__ has quit IRC (Client Quit) [16:02] *** Rickster has joined #archiveteam [16:03] *** hawc145 has joined #archiveteam [16:03] *** z00nx has joined #archiveteam [16:04] *** sevs44936 has joined #archiveteam [16:05] *** rduser has joined #archiveteam [16:05] *** midas has joined #archiveteam [16:09] *** zzqw has joined #archiveteam [16:09] *** marvinw has joined #archiveteam [16:11] *** godane has quit IRC (Read error: Operation timed out) [16:14] *** goekesmi has joined #archiveteam [16:16] "youtube-dl http://api.soundcloud.com/tracks/182804938" works by id. gets best audio format. this one is wav (and a fine mix) [16:16] *** sevs44936 has quit IRC (Read error: Operation timed out) [16:17] *** sevs44936 has joined #archiveteam [16:19] *** K4k has joined #archiveteam [16:21] *** Start has quit IRC (Quit: Disconnected.) [16:21] *** Famicoman has joined #archiveteam [16:24] *** Start has joined #archiveteam [16:25] *** Start has quit IRC (Remote host closed the connection) [16:25] *** Start has joined #archiveteam [16:34] *** godane has joined #archiveteam [16:36] 1. Please go after Al-Jazeera America. [16:36] 2. Please at least go after the most most popular soundcloud stuff [16:40] *** Fletcher has joined #archiveteam [16:42] I can send out what I built with regards to the soundcloud grab. [16:42] Userlists, etc ... [16:44] *** hawc145 is now known as HCross [16:51] Might be worth trying to grab all the CBC podcasts off there, as they have a history of quietly disappearing over time, elsewhere. [16:52] Please do [16:56] Can anyone think of other gov't-run stuff on Soundcloud? I see the Voice of America has several thousand programs in Chinese, and Germany's Deutsche Welle has almost 2300 items going back to at least 2012. [16:57] Surely they'll preserve "user data" online! http://www.theguardian.com/media/2016/feb/11/time-inc-buys-what-is-left-of-myspace-for-its-user-data [17:03] arkiver, looks like the NSF livestream ended. I have 1.3GB of it, though I'm not sure if it's viewable or not. [17:04] #soundbutt ? :) [17:04] *** vOYtEC has joined #archiveteam [17:05] LOL [17:05] During the last soundcloud scare we used #soundclown [17:06] *** Fletcher_ sets mode: +o Fletcher [17:06] probably should reuse that for consistency [17:06] yeah [17:07] *** Start has quit IRC (Quit: Disconnected.) [17:07] *** marvinw is now known as ivan` [17:11] *** JesseW has joined #archiveteam [17:20] FWIW, youtube-dl will (eventually...) grab all of a Soundcloud user's tracks if you point it at soundcloud.com/username/tracks [17:25] Can whoever is running a newsbuddy instance without asking please stop. [17:27] Yikes.. got a gametrailers item here that's going to take 116 hours to rsync! [17:27] all good fun. :) [17:27] My archive of the soundcloud crawl that I did in November: https://www.amazon.com/clouddrive/share/7thOzbwVF2hwD5iVfXYESu1DhaksPGHDFZBJWs9FQIU?ref_=cd_ph_share_link_copy [17:28] It's a mysqldump of the user data that I accumulated - hundreds of millions of users - I targetted the most followed and the people with the most followers. 2.5G bzipped. Includes the url to their profile, their user-id, username, follower number, followings number & avatar url. From that data, you can crawl their content pretty easily. Soundcloud eventually blocked me from crawling them, but I was able to crawl them for 5-6 months before they [17:28] found and blocked my IP. [17:30] *** JesseW has quit IRC (Quit: Leaving.) [17:32] *** JesseW has joined #archiveteam [17:33] *** schbirid has quit IRC (Quit: Leaving) [17:35] *** JesseW has quit IRC (Client Quit) [17:36] ... 1.5MB/sec from amazon cloud drive [17:36] *** schbirid has joined #archiveteam [17:48] *** Fletcher_ has quit IRC (Quit: WeeChat 0.4.3) [17:49] *** Tomcat_ has joined #archiveteam [17:50] *** Fletcher_ has joined #archiveteam [17:53] Yea, ACD isn't the most high-speed thing in the world. :) [17:53] but I got unlimited storage for a year for $5. :) [17:55] Hard to beat that. :) [17:58] What offer is that :o [18:01] It was a Christmas thing I think. The normal unlimited ACD plan is $50/yr [18:01] Nemo_bis: Amazon had a promotional offer for 1 year of ACD for $5 [18:01] or a black friday thing [18:01] yeah [18:01] *** Fletcher sets mode: +o Fletcher_ [18:02] black friday deal, got one too [18:02] They may do it again this year - it was pretty popular with data hoarders [18:04] how big is the database when imported? 7-8GB? [18:05] Not sure, sorry. [18:05] Probably larger than 10GB [18:21] *** rduser has quit IRC (Read error: Operation timed out) [18:34] *** hive-mind has quit IRC (Ping timeout: 260 seconds) [18:36] *** hive-mind has joined #archiveteam [18:37] *** Tomcat_ has quit IRC (Ping timeout: 362 seconds) [18:42] *** Start has joined #archiveteam [18:43] *** vitzli has quit IRC (Leaving) [18:44] http://www.factmag.com/2016/02/11/soundcloud-financial-report-44m-losses/ [18:45] what's ACD? [18:46] Amazon Cloud Drive [18:49] *** Tomcat_ has joined #archiveteam [19:04] FOS is holding up [19:04] 69% full [19:04] But stuff is going out [19:12] man, GT uploads going so slow [19:16] Well, you're hammering the literal hell out of the machine. [19:16] haha, i assumed. When we first started i hit like 25MB/s [19:19] *** Start has quit IRC (Quit: Disconnected.) [19:23] *** Start has joined #archiveteam [19:25] *** rduser has joined #archiveteam [19:40] *** Tomcat_ has quit IRC (Ping timeout: 362 seconds) [19:43] Oh, and here's my crawler code, BTW: https://github.com/scumola/soundcloud-crawler If you want to deplicate my crawl setup, you'll need mysql, rabbitmq and couchbase (or memcache) running somewhere on your network. [19:44] I would choose Al Jazeera over Soundcloud [19:44] But first we need to get through these two others. [19:44] We also maybe need another rsync target [19:44] That then pushes to FOS before I upload [19:46] *** RichardG has quit IRC (Ping timeout: 250 seconds) [19:52] *** RichardG has joined #archiveteam [19:54] *** scyther has joined #archiveteam [20:04] many sites failing, fos needs to breathe :P [20:05] Yeah, its been crazy lately [20:05] SketchCow: FOS can handle them, GameTrailers is almost done [20:12] *** LibreWulf has joined #archiveteam [20:12] chfoo: can you please send me the logs of gametrailers? [20:14] *** Sk1d has joined #archiveteam [20:14] arkiver: ok, give me a few minutes [20:14] thanks! I'd like to make sure everything went well [20:15] I'm not sure if anyone had heard of this, but there are rumors that soundcloud is having tough financial times. [20:15] #soundclown [20:16] yeah essentially. what on earth will I do without my shitty joke mixes [20:16] But I did figure I'd stop in and at least mention it. Several sites and blogs and whatnot are doubting they'll survive a lot longer [20:17] no, they meant, we literally have channel #soundclown for it [20:17] Oh, really? I had no clue, thanks [20:18] *** megaminxw has joined #archiveteam [20:19] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [20:25] *** Sk1d has joined #archiveteam [20:27] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [20:27] *** ploopkazo has joined #archiveteam [20:28] *** dashcloud has joined #archiveteam [20:33] *** LibreWulf has quit IRC (Quit: changing clients) [20:44] *** Tomcat_ has joined #archiveteam [20:45] *** Start has quit IRC (Quit: Disconnected.) [20:46] *** Zei-Pii has quit IRC (Read error: Connection reset by peer) [20:52] *** megaminxw has quit IRC (Quit: Leaving.) [20:58] *** useretail has quit IRC (Ping timeout: 252 seconds) [21:01] *** nickname_ has joined #archiveteam [21:01] Atluxity: are you coming back to the gametrailers grab? [21:01] soundclown is the name of a joke service, parodying soundcloud. [21:02] arkiver: I am unable to... the items are too big [21:03] Seems BnAboyZ has you covered though [21:03] I restart the grab a couple times a day, and they go through a pile of items, then get filled [21:03] :\ [21:04] if the pipeline could recognize a full disk, remove it, and ask for a new item, that would help me a lot [21:04] *remove the current item from disk [21:08] How would you use youtube-dl to download json info from soundcloud? [21:09] *** weles has quit IRC (Read error: Operation timed out) [21:09] Ignore it, it's an offtopic question [21:18] *** useretail has joined #archiveteam [21:19] SketchCow: FOS can handle them, GameTrailers is almost done [21:19] lol, I only just found out :") [21:19] about GT dying I mean [21:33] The uploads are going slow, but not that slow. [21:39] SketchCow: we're discussing in #soundclown on what to do [21:39] we're thinking discovery to get the most popular tracks [21:39] then we'll decide on what to grab [21:41] Is there a channel for discussing Al Jazeera? [21:41] We'll do a small grab of all links of al jazeera from the sitemap [21:41] that should have all articles [21:41] but the project will start in the weekend [21:41] There's also the ArchiveBot grab that's been running for about a month [21:42] Can't archive.org just archive Al Jazeera America? [21:42] It looks to be a pretty normal website. [21:42] I think we're the contributors to IA that go into specific sites [21:42] IA more goes into wide crawls [21:42] Ahh. [21:43] SketchCow would know best though [21:43] It would also be nice for future researchers to be able to download WARCs with the entire site [21:43] I could just spin up an archive crawler myself - I've done that in the past. [21:43] Instead of having to trudge through the Wayback Machine [21:43] Generally, they do focused crawls here and there but now they check to be able to see if Archive Team isn't on the case. [21:43] I love the newbies in here [21:43] Never gets old [21:43] * SketchCow is working on a wiki entry for how to make items run in the Archive [21:43] heretrix - I've used that before. [21:44] I could just fire it up and point it at Al Jazeera America. [21:44] It creates WARCs [21:45] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [21:47] *** dashcloud has joined #archiveteam [21:50] Who you callin' a noob? :) [21:50] You love the newbies who don't speak of themselves in the third person, anyway... [21:51] lol [21:51] also, archivebot has been on AJA since mid-january [21:52] Is archivebot Heritrix? [21:53] nope [21:53] i think it's wpull [21:54] snape: oh don't start you ' [21:55] Wouldn't dare. :) [21:57] *** Tomcat_ has quit IRC (Remote host closed the connection) [22:04] I started a Heritrix crawl of AlJazeeraAmerica [22:04] https://www.evernote.com/l/ACms9qHxNSRPjIJLcdAHBUIGWHEJMNKI3zY [22:07] yep, archivebot is wpull [22:19] *** JetBalsa has joined #archiveteam [22:19] *** SN4T14 has quit IRC (Quit: Leaving) [22:21] *** scyther has quit IRC (Quit: Leaving) [22:42] *** nickname_ has quit IRC (Read error: Operation timed out) [22:47] Is Gametrailers still going? [22:47] *** nickname_ has joined #archiveteam [22:48] SketchCow, all items are out, just uploads are slow [22:49] Yes, I've got items that probably won't finish uploading until Saturday [22:49] At current upload speeds [22:49] ^^ [22:50] Ive got over 100GB to upload [22:52] at 50kB/s [23:03] Whenever I use Heritrix for crawling archiveteam.org stuff, I always use the useragent: Mozilla/5.0 (compatible; heritrix/1.14.4 +http://archiveteam.org) [23:09] *** brayden has joined #archiveteam [23:09] *** swebb sets mode: +o brayden [23:11] *** schbirid has quit IRC (Quit: Leaving) [23:14] SketchCow: yes [23:15] *** swebb sets mode: +o arkiver [23:15] *** Start has joined #archiveteam [23:15] *** brayden_ has quit IRC (Read error: Operation timed out) [23:51] *** mutoso has quit IRC (Ping timeout: 260 seconds)