[00:09] *** bwn has quit IRC (Ping timeout: 960 seconds) [00:12] *** j08nY has quit IRC (Quit: Leaving) [00:16] *** j08nY has joined #archiveteam-bs [00:31] JAA: scrape still chugging along, 155GB. [00:35] tammy_: scrape of? [00:35] so i only have about 3gb of redeye chicago magazine left [00:36] everything before 2016 should be upload [00:36] Odd0002: https://interfacelift.com/ [00:36] i think i may have screwed up a upload of one issue from 2012 [00:37] other then that everything is there [00:38] tammy_: ah, are you downloading it by yourself or using the warrior thing? [00:39] on my own [00:39] JAA wrote the grab [00:39] I have the storage [00:39] ah ok [00:39] you can review it if I can dig up where he posted the git [00:39] how much is the whole site? [00:39] no idea [00:40] oh [00:40] they claim to have about 4000 images, in about every resolution imaginable [00:40] are you uploading it anywhere or? [00:40] I will upload it any/every where [00:41] you want me to jot yer name down so I make sure you get a copy? [00:44] no, I was thinking of helping out [00:44] *** Aranje has quit IRC (Quit: Three sheets to the wind) [00:45] *** Aranje has joined #archiveteam-bs [00:46] this is what's running: https://gist.github.com/anonymous/c752b52901d6688d8b677e759c694896 [00:48] but it would start another instance from the beginning, not continue or add to your work [00:49] correct [00:50] so it wouldn't help [00:51] I don't even know what the website is, I just want to help archive anything, I have OK bandwidth and the warriors are not using any significant portion of my bandwidth [00:52] bingo. Not sure if you can work in reverse or something. I don't really know python. I just offered to run this for JAA as it was relevant to my interests. I like to have wallpapers. :) [00:52] ah [00:53] I haven't changed my wallpaper since I installed Arch on here last year, and so I'm still using the single default one... [00:53] I have 7 screens and rarely are any of them empty, so it's kinda even a waste here too. [00:57] am looking if wget can scrape in reverse alphabetical order [00:57] if it can, I got a thing you can help with by simply starting at the other end [00:58] but then when do I stop? [00:58] when we check in periodically to see if we've pass each other in each direction [00:58] nothing fancy here [00:59] I'm just doing a wget scrape of this open directory: https://sdo.gsfc.nasa.gov/assets/img/browse/ [01:20] not a thing built into wget it seems [01:38] *** GE has joined #archiveteam-bs [02:14] *** GE has quit IRC (Remote host closed the connection) [02:22] looks like i have to wait to upload stuff [02:22] i'm getting the slowdown error [02:58] well, the archive was down earlier today due to a power outage so... [02:58] I wonder if it would be feasible to archive all of YouTube... [03:00] *** pizzaiolo has quit IRC (Remote host closed the connection) [03:29] JAA: regarding public/semi-public archives of project channels -- I think the general reason not to do so is that it provides a place for discussions of the specific tactics ... [03:29] ... of archiving a (sometimes unwilling) website, in a manner that is at least semi-private. [03:30] I strongly suspect that people wouldn't object if you kept local logs, and made them public in a decade or so. But that is probably not really what you were thinking of. [04:15] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:17] *** j08nY has quit IRC (Quit: Leaving) [04:58] Odd0002: no [05:34] youtube adds an internet archive size of video data every few days [05:42] it must cost them so fucking much [06:29] ok [06:49] it costs them $10 billion a year [06:49] actually? [06:50] maybe only 5 [06:50] total data center costs [07:13] *** bwn has joined #archiveteam-bs [08:03] *** GE has joined #archiveteam-bs [09:05] *** fenn_ is now known as fenn [09:06] *** schbirid has joined #archiveteam-bs [10:00] *** JAA has joined #archiveteam-bs [10:22] i'm gonna push all jamendo flac into ACD D: [10:41] *** GE has quit IRC (Remote host closed the connection) [11:02] *** BartoCH has quit IRC (Quit: WeeChat 1.7) [11:02] *** BartoCH has joined #archiveteam-bs [11:48] Somebody2: Thanks. That makes a lot of sense. [11:56] tammy_, Odd0002: I've been wondering whether there is a way to distribute wget/wpull across multiple machines. It should be possible in principle. [11:56] *** odemg has joined #archiveteam-bs [12:25] *** GE has joined #archiveteam-bs [13:23] *** BlueMaxim has quit IRC (Quit: Leaving) [13:29] *** pizzaiolo has joined #archiveteam-bs [14:20] *** arkiver2 has joined #archiveteam-bs [14:29] *** arkiver2 has quit IRC (Remote host closed the connection) [15:13] sure- I'd look at the warrior vm code, because you'll need a server component to tell the machines what they should be pulling, and then what to get next/where to send the finished data (or you can have people self-select portions, but that gets messy quickly, and can easily lead to duplicates or things being missed) [15:29] Yeah, to do it without duplicates or misses, you'd need to do one URL = one item and then upload the retrieved data and any new resources (sublinks, images, etc.) to the central server. It seems messy and inefficient, but I guess every other way is doomed to fail entirely. [15:30] But maybe something could be done with wpull using a centralised database similar to the --database option. [15:31] schbirid: why ACD? [15:31] schbirid: space constraints? [15:49] *** j08nY has joined #archiveteam-bs [15:52] *** Aranje has quit IRC (Three sheets to the wind) [15:52] *** ndiddy has joined #archiveteam-bs [15:54] *** Aranje has joined #archiveteam-bs [17:11] joepie91: because acd triggered my hoarding instinct :\ [17:12] i can rsync to anyone who wants [17:19] lol [17:19] schbirid: how big do you expect it to be? [17:19] 140k tracks at ~24MB each, just ~4TB [17:19] schbirid: also, you might want to drop by #datahoarder then... <.< [17:19] ah [17:19] I only have 1TB of idle space atm [17:19] will do for sure [17:19] soundcloud next! [17:20] lol [17:20] schbirid: will you be uploading the jamendo flacs to IA? [17:20] if they're not already there [17:20] maaaayybe [17:20] any particular reason not to? :p [17:20] i have $id.flac and $id.json [17:20] effort... [17:22] lol [17:22] do eet [17:23] somehow it feels way too little btw [17:23] i had 2tb of vorbis iirc [17:25] but maybe they just decided to delete all albums and only keep singles [17:25] would not surprise me one bit [17:27] schbirid: oh uh, iirc singles are accessed differently from albums [17:27] so that may be whyu [17:27] why* [17:27] yeah [17:27] but each track has a unique id [17:27] which i all tried \o/ [17:31] schbirid: yeah but I'm pretty sure the single track IDs are totally separate from the album track IDs [17:31] schbirid: or at least the way to access them [17:31] oh great, i found a better way gto grab them all now [17:31] with proper titles, not just id as name [17:31] maybe [17:31] https://mp3d.jamendo.com/download/track/1368703/flac/ [17:32] the track json metadata references album IDs (which are indeed different) [17:32] schbirid: this used to work: https://gist.github.com/joepie91/9ce4032812c649bf5bc370adbf755d92 [17:33] unsure if it still works [17:33] client_id is the API key I think [17:33] so I removed that from the gist [17:33] wtf weird ass languate is that... [17:33] coffeescript [17:33] not important [17:33] :x [17:33] just look at teh URLs [17:33] :p [17:33] the* [17:33] no i like mine [17:35] schbirid: storage. still works [17:35] :) [17:36] no API key required for that either [17:36] :p [17:36] gah, with --content-disposition filenames i cannot directly download the files into first or last character directories :( [17:36] not for the mp3d ones either! [17:37] also you can just use their demo key until they ban/renew it ;) [17:38] lol [18:01] *** kristian_ has joined #archiveteam-bs [18:11] *** r3c0d3x has quit IRC (Ping timeout: 260 seconds) [18:16] *** r3c0d3x has joined #archiveteam-bs [18:26] *** r3c0d3x has quit IRC (Read error: Connection timed out) [18:26] *** r3c0d3x has joined #archiveteam-bs [18:57] *** RichardG has joined #archiveteam-bs [19:06] *** ndiddy has quit IRC (Ping timeout: 260 seconds) [19:14] *** icedice has joined #archiveteam-bs [19:15] *** GE has quit IRC (Remote host closed the connection) [19:28] *** kristian_ has quit IRC (Quit: Leaving) [19:45] *** icedice has quit IRC (Ping timeout: 250 seconds) [19:57] *** icedice has joined #archiveteam-bs [19:58] *** antomati_ is now known as antomatic [20:22] SketchCow: kpfa is up to 2017-03-31 [20:48] *** odemg has quit IRC (Read error: Operation timed out) [20:49] *** GE has joined #archiveteam-bs [21:17] *** logchfoo3 starts logging #archiveteam-bs at Sat Apr 22 21:17:00 2017 [21:17] *** logchfoo3 has joined #archiveteam-bs [21:23] *** schbirid has quit IRC (Quit: Leaving) [21:23] *** bwn has joined #archiveteam-bs [21:26] *** Fletcher has joined #archiveteam-bs [21:35] *** ndiddy has joined #archiveteam-bs [21:58] *** GE has quit IRC (Remote host closed the connection) [22:00] *** dashcloud has quit IRC (Ping timeout: 260 seconds) [22:06] *** Rai-chan has joined #archiveteam-bs [22:06] *** purplebot has joined #archiveteam-bs [22:06] *** JensRex has joined #archiveteam-bs [22:08] *** i0npulse has joined #archiveteam-bs [22:21] *** tuluut has joined #archiveteam-bs [23:02] *** tammy_ has joined #archiveteam-bs [23:02] test message, had a strange dissconnection issue :( [23:06] Yeah, looks like there was a netsplit. [23:07] I guess efnet doesn't reconnect as nicely as other servers I'm used to. [23:07] JAA: you interested in writing a new scrape that I'd be happy to host? [23:08] https://www.reddit.com/r/DataHoarder/comments/66wgks/uhq_tvmovie_poster_sources/ [23:11] JAA: interfacelift scrape is at 160GB [23:22] tammy_: I only clicked through a few pages on http://www.impawards.com/, but I didn't see anything that would require special treatment. It seems that a simple recursive wget/wpull (or ArchiveBot job) should be enough to grab it. [23:23] *** BlueMaxim has joined #archiveteam-bs [23:24] JAA: any interest in using this as a chance to play with spreading wget amoungst multiple users? [23:25] I'd be willing to be those multiple users even. [23:26] *** Hecatz has joined #archiveteam-bs [23:33] tammy_: I don't think it's possible to do that properly with wget. (Different ignore sets per machine/user don't count, in my opinion. It would be impossible to avoid some duplication, and adding more machines would be very painful.) [23:34] With wpull, it might be worth a try to just share the database between the machines. Specifically, store the database in a separate directory, then mount that directory on the other machine(s) via sshfs or whatever, and run another wpull process there. [23:35] I have no idea whether that would work though. [23:35] that's fine. I was just presenting the oportunity for science. [23:38] :-) [23:38] For proper testing, I think it'd be better to use a synthetic website with known contents anyway. That makes it easier to verify that the thing is working. [23:41] I'm happy to help you in persuing such an endevour [23:53] Yeah, I think it would be really nice to have something like that. It certainly isn't easy to implement this securely though if it's supposed to work with multiple users. Then again, the (warrior and script-based) projects aren't anywhere close to "secure" either from what I've seen so far, so... But I'll definitely think about this a bit more in detail.