#archiveteam-bs 2017-04-22,Sat

↑back Search

Time	Nickname	Message
00:09 ^🔗		bwn has quit IRC (Ping timeout: 960 seconds)
00:12 ^🔗		j08nY has quit IRC (Quit: Leaving)
00:16 ^🔗		j08nY has joined #archiveteam-bs
00:31 ^🔗	tammy_	JAA: scrape still chugging along, 155GB.
00:35 ^🔗	Odd0002	tammy_: scrape of?
00:35 ^🔗	godane	so i only have about 3gb of redeye chicago magazine left
00:36 ^🔗	godane	everything before 2016 should be upload
00:36 ^🔗	tammy_	Odd0002: https://interfacelift.com/
00:36 ^🔗	godane	i think i may have screwed up a upload of one issue from 2012
00:37 ^🔗	godane	other then that everything is there
00:38 ^🔗	Odd0002	tammy_: ah, are you downloading it by yourself or using the warrior thing?
00:39 ^🔗	tammy_	on my own
00:39 ^🔗	tammy_	JAA wrote the grab
00:39 ^🔗	tammy_	I have the storage
00:39 ^🔗	Odd0002	ah ok
00:39 ^🔗	tammy_	you can review it if I can dig up where he posted the git
00:39 ^🔗	Odd0002	how much is the whole site?
00:39 ^🔗	tammy_	no idea
00:40 ^🔗	Odd0002	oh
00:40 ^🔗	tammy_	they claim to have about 4000 images, in about every resolution imaginable
00:40 ^🔗	Odd0002	are you uploading it anywhere or?
00:40 ^🔗	tammy_	I will upload it any/every where
00:41 ^🔗	tammy_	you want me to jot yer name down so I make sure you get a copy?
00:44 ^🔗	Odd0002	no, I was thinking of helping out
00:44 ^🔗		Aranje has quit IRC (Quit: Three sheets to the wind)
00:45 ^🔗		Aranje has joined #archiveteam-bs
00:46 ^🔗	tammy_	this is what's running: https://gist.github.com/anonymous/c752b52901d6688d8b677e759c694896
00:48 ^🔗	Odd0002	but it would start another instance from the beginning, not continue or add to your work
00:49 ^🔗	tammy_	correct
00:50 ^🔗	Odd0002	so it wouldn't help
00:51 ^🔗	Odd0002	I don't even know what the website is, I just want to help archive anything, I have OK bandwidth and the warriors are not using any significant portion of my bandwidth
00:52 ^🔗	tammy_	bingo. Not sure if you can work in reverse or something. I don't really know python. I just offered to run this for JAA as it was relevant to my interests. I like to have wallpapers. :)
00:52 ^🔗	Odd0002	ah
00:53 ^🔗	Odd0002	I haven't changed my wallpaper since I installed Arch on here last year, and so I'm still using the single default one...
00:53 ^🔗	tammy_	I have 7 screens and rarely are any of them empty, so it's kinda even a waste here too.
00:57 ^🔗	tammy_	am looking if wget can scrape in reverse alphabetical order
00:57 ^🔗	tammy_	if it can, I got a thing you can help with by simply starting at the other end
00:58 ^🔗	Odd0002	but then when do I stop?
00:58 ^🔗	tammy_	when we check in periodically to see if we've pass each other in each direction
00:58 ^🔗	tammy_	nothing fancy here
00:59 ^🔗	tammy_	I'm just doing a wget scrape of this open directory: https://sdo.gsfc.nasa.gov/assets/img/browse/
01:20 ^🔗	tammy_	not a thing built into wget it seems
01:38 ^🔗		GE has joined #archiveteam-bs
02:14 ^🔗		GE has quit IRC (Remote host closed the connection)
02:22 ^🔗	godane	looks like i have to wait to upload stuff
02:22 ^🔗	godane	i'm getting the slowdown error
02:58 ^🔗	Odd0002	well, the archive was down earlier today due to a power outage so...
02:58 ^🔗	Odd0002	I wonder if it would be feasible to archive all of YouTube...
03:00 ^🔗		pizzaiolo has quit IRC (Remote host closed the connection)
03:29 ^🔗	Somebody2	JAA: regarding public/semi-public archives of project channels -- I think the general reason not to do so is that it provides a place for discussions of the specific tactics ...
03:29 ^🔗	Somebody2	... of archiving a (sometimes unwilling) website, in a manner that is at least semi-private.
03:30 ^🔗	Somebody2	I strongly suspect that people wouldn't object if you kept local logs, and made them public in a decade or so. But that is probably not really what you were thinking of.
04:15 ^🔗		Sk1d has quit IRC (Ping timeout: 250 seconds)
04:17 ^🔗		j08nY has quit IRC (Quit: Leaving)
04:58 ^🔗	tammy_	Odd0002: no
05:34 ^🔗	espes__	youtube adds an internet archive size of video data every few days
05:42 ^🔗	Frogging	it must cost them so fucking much
06:29 ^🔗	Odd0002	ok
06:49 ^🔗	espes__	it costs them $10 billion a year
06:49 ^🔗	Frogging	actually?
06:50 ^🔗	espes__	maybe only 5
06:50 ^🔗	espes__	total data center costs
07:13 ^🔗		bwn has joined #archiveteam-bs
08:03 ^🔗		GE has joined #archiveteam-bs
09:05 ^🔗		fenn_ is now known as fenn
09:06 ^🔗		schbirid has joined #archiveteam-bs
10:00 ^🔗		JAA has joined #archiveteam-bs
10:22 ^🔗	schbirid	i'm gonna push all jamendo flac into ACD D:
10:41 ^🔗		GE has quit IRC (Remote host closed the connection)
11:02 ^🔗		BartoCH has quit IRC (Quit: WeeChat 1.7)
11:02 ^🔗		BartoCH has joined #archiveteam-bs
11:48 ^🔗	JAA	Somebody2: Thanks. That makes a lot of sense.
11:56 ^🔗	JAA	tammy_, Odd0002: I've been wondering whether there is a way to distribute wget/wpull across multiple machines. It should be possible in principle.
11:56 ^🔗		odemg has joined #archiveteam-bs
12:25 ^🔗		GE has joined #archiveteam-bs
13:23 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
13:29 ^🔗		pizzaiolo has joined #archiveteam-bs
14:20 ^🔗		arkiver2 has joined #archiveteam-bs
14:29 ^🔗		arkiver2 has quit IRC (Remote host closed the connection)
15:13 ^🔗	dashcloud	sure- I'd look at the warrior vm code, because you'll need a server component to tell the machines what they should be pulling, and then what to get next/where to send the finished data (or you can have people self-select portions, but that gets messy quickly, and can easily lead to duplicates or things being missed)
15:29 ^🔗	JAA	Yeah, to do it without duplicates or misses, you'd need to do one URL = one item and then upload the retrieved data and any new resources (sublinks, images, etc.) to the central server. It seems messy and inefficient, but I guess every other way is doomed to fail entirely.
15:30 ^🔗	JAA	But maybe something could be done with wpull using a centralised database similar to the --database option.
15:31 ^🔗	joepie91	schbirid: why ACD?
15:31 ^🔗	joepie91	schbirid: space constraints?
15:49 ^🔗		j08nY has joined #archiveteam-bs
15:52 ^🔗		Aranje has quit IRC (Three sheets to the wind)
15:52 ^🔗		ndiddy has joined #archiveteam-bs
15:54 ^🔗		Aranje has joined #archiveteam-bs
17:11 ^🔗	schbirid	joepie91: because acd triggered my hoarding instinct :\
17:12 ^🔗	schbirid	i can rsync to anyone who wants
17:19 ^🔗	joepie91	lol
17:19 ^🔗	joepie91	schbirid: how big do you expect it to be?
17:19 ^🔗	schbirid	140k tracks at ~24MB each, just ~4TB
17:19 ^🔗	joepie91	schbirid: also, you might want to drop by #datahoarder then... <.<
17:19 ^🔗	joepie91	ah
17:19 ^🔗	joepie91	I only have 1TB of idle space atm
17:19 ^🔗	schbirid	will do for sure
17:19 ^🔗	schbirid	soundcloud next!
17:20 ^🔗	joepie91	lol
17:20 ^🔗	joepie91	schbirid: will you be uploading the jamendo flacs to IA?
17:20 ^🔗	joepie91	if they're not already there
17:20 ^🔗	schbirid	maaaayybe
17:20 ^🔗	joepie91	any particular reason not to? :p
17:20 ^🔗	schbirid	i have $id.flac and $id.json
17:20 ^🔗	schbirid	effort...
17:22 ^🔗	joepie91	lol
17:22 ^🔗	joepie91	do eet
17:23 ^🔗	schbirid	somehow it feels way too little btw
17:23 ^🔗	schbirid	i had 2tb of vorbis iirc
17:25 ^🔗	schbirid	but maybe they just decided to delete all albums and only keep singles
17:25 ^🔗	schbirid	would not surprise me one bit
17:27 ^🔗	joepie91	schbirid: oh uh, iirc singles are accessed differently from albums
17:27 ^🔗	joepie91	so that may be whyu
17:27 ^🔗	joepie91	why*
17:27 ^🔗	schbirid	yeah
17:27 ^🔗	schbirid	but each track has a unique id
17:27 ^🔗	schbirid	which i all tried \o/
17:31 ^🔗	joepie91	schbirid: yeah but I'm pretty sure the single track IDs are totally separate from the album track IDs
17:31 ^🔗	joepie91	schbirid: or at least the way to access them
17:31 ^🔗	schbirid	oh great, i found a better way gto grab them all now
17:31 ^🔗	schbirid	with proper titles, not just id as name
17:31 ^🔗	schbirid	maybe
17:31 ^🔗	schbirid	https://mp3d.jamendo.com/download/track/1368703/flac/
17:32 ^🔗	schbirid	the track json metadata references album IDs (which are indeed different)
17:32 ^🔗	joepie91	schbirid: this used to work: https://gist.github.com/joepie91/9ce4032812c649bf5bc370adbf755d92
17:33 ^🔗	joepie91	unsure if it still works
17:33 ^🔗	joepie91	client_id is the API key I think
17:33 ^🔗	joepie91	so I removed that from the gist
17:33 ^🔗	schbirid	wtf weird ass languate is that...
17:33 ^🔗	joepie91	coffeescript
17:33 ^🔗	joepie91	not important
17:33 ^🔗	schbirid	:x
17:33 ^🔗	joepie91	just look at teh URLs
17:33 ^🔗	joepie91	:p
17:33 ^🔗	joepie91	the*
17:33 ^🔗	schbirid	no i like mine
17:35 ^🔗	joepie91	schbirid: storage. still works
17:35 ^🔗	schbirid	:)
17:36 ^🔗	joepie91	no API key required for that either
17:36 ^🔗	joepie91	:p
17:36 ^🔗	schbirid	gah, with --content-disposition filenames i cannot directly download the files into first or last character directories :(
17:36 ^🔗	schbirid	not for the mp3d ones either!
17:37 ^🔗	schbirid	also you can just use their demo key until they ban/renew it ;)
17:38 ^🔗	joepie91	lol
18:01 ^🔗		kristian_ has joined #archiveteam-bs
18:11 ^🔗		r3c0d3x has quit IRC (Ping timeout: 260 seconds)
18:16 ^🔗		r3c0d3x has joined #archiveteam-bs
18:26 ^🔗		r3c0d3x has quit IRC (Read error: Connection timed out)
18:26 ^🔗		r3c0d3x has joined #archiveteam-bs
18:57 ^🔗		RichardG has joined #archiveteam-bs
19:06 ^🔗		ndiddy has quit IRC (Ping timeout: 260 seconds)
19:14 ^🔗		icedice has joined #archiveteam-bs
19:15 ^🔗		GE has quit IRC (Remote host closed the connection)
19:28 ^🔗		kristian_ has quit IRC (Quit: Leaving)
19:45 ^🔗		icedice has quit IRC (Ping timeout: 250 seconds)
19:57 ^🔗		icedice has joined #archiveteam-bs
19:58 ^🔗		antomati_ is now known as antomatic
20:22 ^🔗	godane	SketchCow: kpfa is up to 2017-03-31
20:48 ^🔗		odemg has quit IRC (Read error: Operation timed out)
20:49 ^🔗		GE has joined #archiveteam-bs
21:17 ^🔗		logchfoo3 starts logging #archiveteam-bs at Sat Apr 22 21:17:00 2017
21:17 ^🔗		logchfoo3 has joined #archiveteam-bs
21:23 ^🔗		schbirid has quit IRC (Quit: Leaving)
21:23 ^🔗		bwn has joined #archiveteam-bs
21:26 ^🔗		Fletcher has joined #archiveteam-bs
21:35 ^🔗		ndiddy has joined #archiveteam-bs
21:58 ^🔗		GE has quit IRC (Remote host closed the connection)
22:00 ^🔗		dashcloud has quit IRC (Ping timeout: 260 seconds)
22:06 ^🔗		Rai-chan has joined #archiveteam-bs
22:06 ^🔗		purplebot has joined #archiveteam-bs
22:06 ^🔗		JensRex has joined #archiveteam-bs
22:08 ^🔗		i0npulse has joined #archiveteam-bs
22:21 ^🔗		tuluut has joined #archiveteam-bs
23:02 ^🔗		tammy_ has joined #archiveteam-bs
23:02 ^🔗	tammy_	test message, had a strange dissconnection issue :(
23:06 ^🔗	JAA	Yeah, looks like there was a netsplit.
23:07 ^🔗	tammy_	I guess efnet doesn't reconnect as nicely as other servers I'm used to.
23:07 ^🔗	tammy_	JAA: you interested in writing a new scrape that I'd be happy to host?
23:08 ^🔗	tammy_	https://www.reddit.com/r/DataHoarder/comments/66wgks/uhq_tvmovie_poster_sources/
23:11 ^🔗	tammy_	JAA: interfacelift scrape is at 160GB
23:22 ^🔗	JAA	tammy_: I only clicked through a few pages on http://www.impawards.com/, but I didn't see anything that would require special treatment. It seems that a simple recursive wget/wpull (or ArchiveBot job) should be enough to grab it.
23:23 ^🔗		BlueMaxim has joined #archiveteam-bs
23:24 ^🔗	tammy_	JAA: any interest in using this as a chance to play with spreading wget amoungst multiple users?
23:25 ^🔗	tammy_	I'd be willing to be those multiple users even.
23:26 ^🔗		Hecatz has joined #archiveteam-bs
23:33 ^🔗	JAA	tammy_: I don't think it's possible to do that properly with wget. (Different ignore sets per machine/user don't count, in my opinion. It would be impossible to avoid some duplication, and adding more machines would be very painful.)
23:34 ^🔗	JAA	With wpull, it might be worth a try to just share the database between the machines. Specifically, store the database in a separate directory, then mount that directory on the other machine(s) via sshfs or whatever, and run another wpull process there.
23:35 ^🔗	JAA	I have no idea whether that would work though.
23:35 ^🔗	tammy_	that's fine. I was just presenting the oportunity for science.
23:38 ^🔗	JAA	:-)
23:38 ^🔗	JAA	For proper testing, I think it'd be better to use a synthetic website with known contents anyway. That makes it easier to verify that the thing is working.
23:41 ^🔗	tammy_	I'm happy to help you in persuing such an endevour
23:53 ^🔗	JAA	Yeah, I think it would be really nice to have something like that. It certainly isn't easy to implement this securely though if it's supposed to work with multiple users. Then again, the (warrior and script-based) projects aren't anywhere close to "secure" either from what I've seen so far, so... But I'll definitely think about this a bit more in detail.

irclogger-viewer