#archiveteam-bs 2017-04-22,Sat

↑back Search

Time Nickname Message
00:09 🔗 bwn has quit IRC (Ping timeout: 960 seconds)
00:12 🔗 j08nY has quit IRC (Quit: Leaving)
00:16 🔗 j08nY has joined #archiveteam-bs
00:31 🔗 tammy_ JAA: scrape still chugging along, 155GB.
00:35 🔗 Odd0002 tammy_: scrape of?
00:35 🔗 godane so i only have about 3gb of redeye chicago magazine left
00:36 🔗 godane everything before 2016 should be upload
00:36 🔗 tammy_ Odd0002: https://interfacelift.com/
00:36 🔗 godane i think i may have screwed up a upload of one issue from 2012
00:37 🔗 godane other then that everything is there
00:38 🔗 Odd0002 tammy_: ah, are you downloading it by yourself or using the warrior thing?
00:39 🔗 tammy_ on my own
00:39 🔗 tammy_ JAA wrote the grab
00:39 🔗 tammy_ I have the storage
00:39 🔗 Odd0002 ah ok
00:39 🔗 tammy_ you can review it if I can dig up where he posted the git
00:39 🔗 Odd0002 how much is the whole site?
00:39 🔗 tammy_ no idea
00:40 🔗 Odd0002 oh
00:40 🔗 tammy_ they claim to have about 4000 images, in about every resolution imaginable
00:40 🔗 Odd0002 are you uploading it anywhere or?
00:40 🔗 tammy_ I will upload it any/every where
00:41 🔗 tammy_ you want me to jot yer name down so I make sure you get a copy?
00:44 🔗 Odd0002 no, I was thinking of helping out
00:44 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
00:45 🔗 Aranje has joined #archiveteam-bs
00:46 🔗 tammy_ this is what's running: https://gist.github.com/anonymous/c752b52901d6688d8b677e759c694896
00:48 🔗 Odd0002 but it would start another instance from the beginning, not continue or add to your work
00:49 🔗 tammy_ correct
00:50 🔗 Odd0002 so it wouldn't help
00:51 🔗 Odd0002 I don't even know what the website is, I just want to help archive anything, I have OK bandwidth and the warriors are not using any significant portion of my bandwidth
00:52 🔗 tammy_ bingo. Not sure if you can work in reverse or something. I don't really know python. I just offered to run this for JAA as it was relevant to my interests. I like to have wallpapers. :)
00:52 🔗 Odd0002 ah
00:53 🔗 Odd0002 I haven't changed my wallpaper since I installed Arch on here last year, and so I'm still using the single default one...
00:53 🔗 tammy_ I have 7 screens and rarely are any of them empty, so it's kinda even a waste here too.
00:57 🔗 tammy_ am looking if wget can scrape in reverse alphabetical order
00:57 🔗 tammy_ if it can, I got a thing you can help with by simply starting at the other end
00:58 🔗 Odd0002 but then when do I stop?
00:58 🔗 tammy_ when we check in periodically to see if we've pass each other in each direction
00:58 🔗 tammy_ nothing fancy here
00:59 🔗 tammy_ I'm just doing a wget scrape of this open directory: https://sdo.gsfc.nasa.gov/assets/img/browse/
01:20 🔗 tammy_ not a thing built into wget it seems
01:38 🔗 GE has joined #archiveteam-bs
02:14 🔗 GE has quit IRC (Remote host closed the connection)
02:22 🔗 godane looks like i have to wait to upload stuff
02:22 🔗 godane i'm getting the slowdown error
02:58 🔗 Odd0002 well, the archive was down earlier today due to a power outage so...
02:58 🔗 Odd0002 I wonder if it would be feasible to archive all of YouTube...
03:00 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
03:29 🔗 Somebody2 JAA: regarding public/semi-public archives of project channels -- I think the general reason not to do so is that it provides a place for discussions of the specific tactics ...
03:29 🔗 Somebody2 ... of archiving a (sometimes unwilling) website, in a manner that is at least semi-private.
03:30 🔗 Somebody2 I strongly suspect that people wouldn't object if you kept local logs, and made them public in a decade or so. But that is probably not really what you were thinking of.
04:15 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:17 🔗 j08nY has quit IRC (Quit: Leaving)
04:58 🔗 tammy_ Odd0002: no
05:34 🔗 espes__ youtube adds an internet archive size of video data every few days
05:42 🔗 Frogging it must cost them so fucking much
06:29 🔗 Odd0002 ok
06:49 🔗 espes__ it costs them $10 billion a year
06:49 🔗 Frogging actually?
06:50 🔗 espes__ maybe only 5
06:50 🔗 espes__ total data center costs
07:13 🔗 bwn has joined #archiveteam-bs
08:03 🔗 GE has joined #archiveteam-bs
09:05 🔗 fenn_ is now known as fenn
09:06 🔗 schbirid has joined #archiveteam-bs
10:00 🔗 JAA has joined #archiveteam-bs
10:22 🔗 schbirid i'm gonna push all jamendo flac into ACD D:
10:41 🔗 GE has quit IRC (Remote host closed the connection)
11:02 🔗 BartoCH has quit IRC (Quit: WeeChat 1.7)
11:02 🔗 BartoCH has joined #archiveteam-bs
11:48 🔗 JAA Somebody2: Thanks. That makes a lot of sense.
11:56 🔗 JAA tammy_, Odd0002: I've been wondering whether there is a way to distribute wget/wpull across multiple machines. It should be possible in principle.
11:56 🔗 odemg has joined #archiveteam-bs
12:25 🔗 GE has joined #archiveteam-bs
13:23 🔗 BlueMaxim has quit IRC (Quit: Leaving)
13:29 🔗 pizzaiolo has joined #archiveteam-bs
14:20 🔗 arkiver2 has joined #archiveteam-bs
14:29 🔗 arkiver2 has quit IRC (Remote host closed the connection)
15:13 🔗 dashcloud sure- I'd look at the warrior vm code, because you'll need a server component to tell the machines what they should be pulling, and then what to get next/where to send the finished data (or you can have people self-select portions, but that gets messy quickly, and can easily lead to duplicates or things being missed)
15:29 🔗 JAA Yeah, to do it without duplicates or misses, you'd need to do one URL = one item and then upload the retrieved data and any new resources (sublinks, images, etc.) to the central server. It seems messy and inefficient, but I guess every other way is doomed to fail entirely.
15:30 🔗 JAA But maybe something could be done with wpull using a centralised database similar to the --database option.
15:31 🔗 joepie91 schbirid: why ACD?
15:31 🔗 joepie91 schbirid: space constraints?
15:49 🔗 j08nY has joined #archiveteam-bs
15:52 🔗 Aranje has quit IRC (Three sheets to the wind)
15:52 🔗 ndiddy has joined #archiveteam-bs
15:54 🔗 Aranje has joined #archiveteam-bs
17:11 🔗 schbirid joepie91: because acd triggered my hoarding instinct :\
17:12 🔗 schbirid i can rsync to anyone who wants
17:19 🔗 joepie91 lol
17:19 🔗 joepie91 schbirid: how big do you expect it to be?
17:19 🔗 schbirid 140k tracks at ~24MB each, just ~4TB
17:19 🔗 joepie91 schbirid: also, you might want to drop by #datahoarder then... <.<
17:19 🔗 joepie91 ah
17:19 🔗 joepie91 I only have 1TB of idle space atm
17:19 🔗 schbirid will do for sure
17:19 🔗 schbirid soundcloud next!
17:20 🔗 joepie91 lol
17:20 🔗 joepie91 schbirid: will you be uploading the jamendo flacs to IA?
17:20 🔗 joepie91 if they're not already there
17:20 🔗 schbirid maaaayybe
17:20 🔗 joepie91 any particular reason not to? :p
17:20 🔗 schbirid i have $id.flac and $id.json
17:20 🔗 schbirid effort...
17:22 🔗 joepie91 lol
17:22 🔗 joepie91 do eet
17:23 🔗 schbirid somehow it feels way too little btw
17:23 🔗 schbirid i had 2tb of vorbis iirc
17:25 🔗 schbirid but maybe they just decided to delete all albums and only keep singles
17:25 🔗 schbirid would not surprise me one bit
17:27 🔗 joepie91 schbirid: oh uh, iirc singles are accessed differently from albums
17:27 🔗 joepie91 so that may be whyu
17:27 🔗 joepie91 why*
17:27 🔗 schbirid yeah
17:27 🔗 schbirid but each track has a unique id
17:27 🔗 schbirid which i all tried \o/
17:31 🔗 joepie91 schbirid: yeah but I'm pretty sure the single track IDs are totally separate from the album track IDs
17:31 🔗 joepie91 schbirid: or at least the way to access them
17:31 🔗 schbirid oh great, i found a better way gto grab them all now
17:31 🔗 schbirid with proper titles, not just id as name
17:31 🔗 schbirid maybe
17:31 🔗 schbirid https://mp3d.jamendo.com/download/track/1368703/flac/
17:32 🔗 schbirid the track json metadata references album IDs (which are indeed different)
17:32 🔗 joepie91 schbirid: this used to work: https://gist.github.com/joepie91/9ce4032812c649bf5bc370adbf755d92
17:33 🔗 joepie91 unsure if it still works
17:33 🔗 joepie91 client_id is the API key I think
17:33 🔗 joepie91 so I removed that from the gist
17:33 🔗 schbirid wtf weird ass languate is that...
17:33 🔗 joepie91 coffeescript
17:33 🔗 joepie91 not important
17:33 🔗 schbirid :x
17:33 🔗 joepie91 just look at teh URLs
17:33 🔗 joepie91 :p
17:33 🔗 joepie91 the*
17:33 🔗 schbirid no i like mine
17:35 🔗 joepie91 schbirid: storage. still works
17:35 🔗 schbirid :)
17:36 🔗 joepie91 no API key required for that either
17:36 🔗 joepie91 :p
17:36 🔗 schbirid gah, with --content-disposition filenames i cannot directly download the files into first or last character directories :(
17:36 🔗 schbirid not for the mp3d ones either!
17:37 🔗 schbirid also you can just use their demo key until they ban/renew it ;)
17:38 🔗 joepie91 lol
18:01 🔗 kristian_ has joined #archiveteam-bs
18:11 🔗 r3c0d3x has quit IRC (Ping timeout: 260 seconds)
18:16 🔗 r3c0d3x has joined #archiveteam-bs
18:26 🔗 r3c0d3x has quit IRC (Read error: Connection timed out)
18:26 🔗 r3c0d3x has joined #archiveteam-bs
18:57 🔗 RichardG has joined #archiveteam-bs
19:06 🔗 ndiddy has quit IRC (Ping timeout: 260 seconds)
19:14 🔗 icedice has joined #archiveteam-bs
19:15 🔗 GE has quit IRC (Remote host closed the connection)
19:28 🔗 kristian_ has quit IRC (Quit: Leaving)
19:45 🔗 icedice has quit IRC (Ping timeout: 250 seconds)
19:57 🔗 icedice has joined #archiveteam-bs
19:58 🔗 antomati_ is now known as antomatic
20:22 🔗 godane SketchCow: kpfa is up to 2017-03-31
20:48 🔗 odemg has quit IRC (Read error: Operation timed out)
20:49 🔗 GE has joined #archiveteam-bs
21:17 🔗 logchfoo3 starts logging #archiveteam-bs at Sat Apr 22 21:17:00 2017
21:17 🔗 logchfoo3 has joined #archiveteam-bs
21:23 🔗 schbirid has quit IRC (Quit: Leaving)
21:23 🔗 bwn has joined #archiveteam-bs
21:26 🔗 Fletcher has joined #archiveteam-bs
21:35 🔗 ndiddy has joined #archiveteam-bs
21:58 🔗 GE has quit IRC (Remote host closed the connection)
22:00 🔗 dashcloud has quit IRC (Ping timeout: 260 seconds)
22:06 🔗 Rai-chan has joined #archiveteam-bs
22:06 🔗 purplebot has joined #archiveteam-bs
22:06 🔗 JensRex has joined #archiveteam-bs
22:08 🔗 i0npulse has joined #archiveteam-bs
22:21 🔗 tuluut has joined #archiveteam-bs
23:02 🔗 tammy_ has joined #archiveteam-bs
23:02 🔗 tammy_ test message, had a strange dissconnection issue :(
23:06 🔗 JAA Yeah, looks like there was a netsplit.
23:07 🔗 tammy_ I guess efnet doesn't reconnect as nicely as other servers I'm used to.
23:07 🔗 tammy_ JAA: you interested in writing a new scrape that I'd be happy to host?
23:08 🔗 tammy_ https://www.reddit.com/r/DataHoarder/comments/66wgks/uhq_tvmovie_poster_sources/
23:11 🔗 tammy_ JAA: interfacelift scrape is at 160GB
23:22 🔗 JAA tammy_: I only clicked through a few pages on http://www.impawards.com/, but I didn't see anything that would require special treatment. It seems that a simple recursive wget/wpull (or ArchiveBot job) should be enough to grab it.
23:23 🔗 BlueMaxim has joined #archiveteam-bs
23:24 🔗 tammy_ JAA: any interest in using this as a chance to play with spreading wget amoungst multiple users?
23:25 🔗 tammy_ I'd be willing to be those multiple users even.
23:26 🔗 Hecatz has joined #archiveteam-bs
23:33 🔗 JAA tammy_: I don't think it's possible to do that properly with wget. (Different ignore sets per machine/user don't count, in my opinion. It would be impossible to avoid some duplication, and adding more machines would be very painful.)
23:34 🔗 JAA With wpull, it might be worth a try to just share the database between the machines. Specifically, store the database in a separate directory, then mount that directory on the other machine(s) via sshfs or whatever, and run another wpull process there.
23:35 🔗 JAA I have no idea whether that would work though.
23:35 🔗 tammy_ that's fine. I was just presenting the oportunity for science.
23:38 🔗 JAA :-)
23:38 🔗 JAA For proper testing, I think it'd be better to use a synthetic website with known contents anyway. That makes it easier to verify that the thing is working.
23:41 🔗 tammy_ I'm happy to help you in persuing such an endevour
23:53 🔗 JAA Yeah, I think it would be really nice to have something like that. It certainly isn't easy to implement this securely though if it's supposed to work with multiple users. Then again, the (warrior and script-based) projects aren't anywhere close to "secure" either from what I've seen so far, so... But I'll definitely think about this a bit more in detail.

irclogger-viewer