#archiveteam 2014-09-08,Mon

↑back Search

Time	Nickname	Message
00:24 ^🔗	godane	Arkiver2: the only subtitles i found are in dutch
03:42 ^🔗	dx	hey #archiveteam! bukkit, a minecraft server / modding platform, is dead/dying due to a licensing conflict. the conflicting software is all gone due to DMCA, but their websites might still have plenty of data useful for the rest of the modding community
03:44 ^🔗	yipdw	dx: we got a copy of the bukkit wiki via archivebot
03:44 ^🔗	yipdw	if you've got other sites, please feel free to drop by #archivebot
03:44 ^🔗	dx	neat
03:44 ^🔗	dx	what about this? http://dev.bukkit.org/bukkit-plugins/
03:45 ^🔗	dx	pages in there could be archived through archivebot, but what about plugin .jars?
03:45 ^🔗	dx	pages would be useful because there's plenty of documentation in that site
03:45 ^🔗	yipdw	one sec, checking
03:46 ^🔗	yipdw	if the jars are linked, we can get them
03:47 ^🔗	dx	hmm, they seem like direct links http://dev.bukkit.org/bukkit-plugins/nametags/files/
03:49 ^🔗	balrog	the forums
03:49 ^🔗	balrog	the plugins would be useful
03:49 ^🔗	balrog	I have clones of the DMCA'd repos
03:49 ^🔗	balrog	but I don't know what the legality on that is :p
03:50 ^🔗	dx	as illegal as it always was
03:50 ^🔗	dx	it's only down now because they are acting on it
07:35 ^🔗	dx	so uhhh, there's an api for dev.bukkit.org, i scrapped all the plugin download links with it
07:36 ^🔗	dx	so now i've got this huge json file with some metadata and a bunch of direct download links, sample of the first 100: http://dump.dequis.org/mF7xj.txt
07:37 ^🔗	dx	no filesize info or uploaded timestamps, could grab that with HEAD requests maybe.
07:37 ^🔗	dx	full dump http://dequis.org/bukkitdev-releases.json.gz - 3.4mb gzipped, 27mb uncompressed
07:41 ^🔗	dx	all file urls one per line, no filtering at all, http://dequis.org/bukkitdev_all_file_urls.txt - 4.5mb, 75494 lines
07:42 ^🔗	dx	it's an absurd amount of files, most of them not worth saving
07:43 ^🔗	dx	how do i reduce this? for some projects it's not as trivial as grabbing the latest release, they have several parallel releases for a single version
07:46 ^🔗	Litus1960	Q: just installed Virtualbox and want to import a fine (appliance) how do I get the files?
07:46 ^🔗	Litus1960	a file
08:47 ^🔗	midas	Litus1960: http://tracker.archiveteam.org/ click the download link
08:55 ^🔗	Litus1960	@Midas I got the machine how do iI import the files?
08:58 ^🔗	Rotab	http://archiveteam.org/index.php?title=Warrior
09:00 ^🔗	Litus1960	In VirtualBox, click File > Import Appliance and open the file. -> what file do I open?
09:00 ^🔗	Rotab	....?
09:00 ^🔗	Rotab	the file you just donwloaded
09:01 ^🔗	Litus1960	I import the the file that I just installed the virtual machine with?
09:03 ^🔗	Rotab	you import the .ova
09:04 ^🔗	midas	Litus1960: http://archive.org/download/archiveteam-warrior/archiveteam-warrior-v2-20121008.ova
09:04 ^🔗	midas	then in virtualbox > file > import
09:07 ^🔗	Litus1960	ok did that now how do I run a job?
09:07 ^🔗	midas	it shows a IP address in the console
09:08 ^🔗	midas	it explains it here: http://tracker.archiveteam.org/
09:11 ^🔗	Litus1960	got it
09:11 ^🔗	schbirid	Litus1960: welcome and thanks for helping out :)
09:15 ^🔗	Litus1960	sorry bud, thank you for helping out :D was so buissy trying to comprehend the poop not a nerdy person ;-)
09:18 ^🔗	schbirid	http://www.classiccomputing.com/CC/Blog/Entries/2014/9/7_Computer_History_Nostalgia_Podcasts.html
09:23 ^🔗	Litus1960	followed with @FlexMind
09:26 ^🔗	Litus1960	bye for now
09:50 ^🔗	danneh_	y'all have probably seen this, but: http://blog.twitpic.com/2014/09/twitpic-is-shutting-down/
09:51 ^🔗	ersi	danneh_: Yeah. There's a project channel at #quitpic
09:52 ^🔗	danneh_	ersi: Ah, fair enough. Thanks for being so quick and awesome!
09:53 ^🔗	ersi	np!
09:53 ^🔗	Muad-Dib	is anyone going to those Dutch meetups about digital preservation tomorrow? http://www.dezwijger.nl/115739/nl/tegenlicht-meet-up-23-digitaal-geheugenverlies
09:54 ^🔗	Muad-Dib	joepie91, midas ^
09:58 ^🔗	midas	I wish i could, but work and job interviews
10:03 ^🔗	Muad-Dib	I might be able to go, but I'm a bit reluctant about walking around there and going "Hey guys, I'm with Archive Team!"
10:04 ^🔗	Muad-Dib	I don't want to be mistaken for a representative, as I'm not really a good PR person
10:05 ^🔗	midas	simple awnser, dont :p
10:07 ^🔗	Muad-Dib	but I also want to stir up discussion about digital preservation, and if people ask "how are you connected to this stuff?"
10:09 ^🔗	Muad-Dib	Shall I stick with Jasons "Teenagers and crazy people" description for bands of rogue archivists like AT? https://decorrespondent.nl/1695/Waarom-deze-man-het-hele-internet-downloadt/56475705-f10825bc
10:12 ^🔗	antomatic	I believe he said "maniacs" :)
10:12 ^🔗	Muad-Dib	antomatic: figures, it was translated into dutch in that interview
10:13 ^🔗	Muad-Dib	havent watched the entire documentary yet
14:36 ^🔗	Arkiver2	Guys, I'm going to the meetup tomorrow evening in Amsterdam
14:37 ^🔗	Arkiver2	^Muad-Dib
14:39 ^🔗	Muad-Dib	If I can free the time, I'll be there too
14:41 ^🔗	midas	i'd like to be there, but the day after that i have a job interview abroad.. so ill pass
14:42 ^🔗	midas	but do remind them that we grabbed the announcement already Muad-Dib ;)
14:43 ^🔗	Arkiver2	midas: I'd like to tell people about the projects we are curently doing
14:43 ^🔗	Arkiver2	and the problems the archiveteam is usually experiencing when archiving a websites with warrior
14:44 ^🔗	Arkiver2	However, it would be great if someone of you can also be there
14:44 ^🔗	Arkiver2	:)
14:51 ^🔗	vantec	I'd like to go but I'm 6.5k km away
14:51 ^🔗	phuzion	dx: I'm getting the sizes of everything right now, just so you know
14:53 ^🔗	Arkiver2	dx: There is nothing that is not worth saving!!
14:54 ^🔗	Arkiver2	If the list of files is below, say, 200 GB we can do it with archivebot
14:58 ^🔗	Arkiver2	added it to archivebot
14:58 ^🔗	Arkiver2	should be done in a day
14:59 ^🔗	dx	whoa!
14:59 ^🔗	dx	just came home and you're already archiving the whole thing :D
14:59 ^🔗	Arkiver2	dx: http://archivebot.com/ first one
15:00 ^🔗	dx	Arkiver2: thanks :D
15:00 ^🔗	Arkiver2	if you have any other list of files/pages please give it and it will be saved
15:01 ^🔗	dx	Arkiver2: all the pages under http://dev.bukkit.org/bukkit-plugins/ - mostly documentation of those plugins, they also link to the jars in that list so you'll want to exclude that.
15:02 ^🔗	Arkiver2	dx: all of dev.bukkit.org is being saved: http://archivebot.com/ (third one)
15:02 ^🔗	dx	Arkiver2: it won't download jars twice, right?
15:03 ^🔗	dx	also, :D
15:04 ^🔗	yipdw	dx: if the JARs have N URLs, they will be downloaded N times
15:05 ^🔗	dx	yipdw: each url is a different version, but they are being downloaded from both my file list and the dev.bukkit.org recursive task
15:06 ^🔗	Arkiver2	I'd say just leave it as it's going now, it won't be too big
15:06 ^🔗	dx	hmm
15:08 ^🔗	jules	hi
15:08 ^🔗	Arkiver2	jules: hello
15:08 ^🔗	dx	yesterday i took a random sample of 50 of them, got average 140kbytes, max 4mbytes, so uhhh, somewhere between 10gb and 300gb
15:08 ^🔗	dx	but i have no idea what's "too big" for you guys :P
15:09 ^🔗	Arkiver2	nothing is too big or us
15:09 ^🔗	Arkiver2	well, there is a limit kind of
15:09 ^🔗	dx	yeah i saw the twitch wiki page
15:09 ^🔗	Arkiver2	MobileMe was more then 200 TB
15:09 ^🔗	dx	whoa.
15:09 ^🔗	Arkiver2	so it should be fine ;)
15:10 ^🔗	dx	:D
15:10 ^🔗	dx	thanks a lot!
15:10 ^🔗	yipdw	dx: if that's the case, please ignore all jars from the dev.bukkit.org job
15:10 ^🔗	dx	Arkiver2: ^
15:10 ^🔗	yipdw	there's "having lots of space" and "not being brain-dead" and two identical copies are the latter
15:10 ^🔗	dx	yup!
15:10 ^🔗	Arkiver2	we can do that too
15:11 ^🔗	yipdw	hopefully dev.bukkit.org doesn't use Java applets
15:11 ^🔗	dx	haha no
15:11 ^🔗	yipdw	that'll make the obvious ignore pattern a bit trickier
15:11 ^🔗	dx	it's not a website from the 90s luckily
15:13 ^🔗	Arkiver2	yipdw: maybe ignore everything from servermods.cursecdn.com/
15:13 ^🔗	yipdw	done
15:13 ^🔗	Arkiver2	as all filles are form there
15:13 ^🔗	Arkiver2	ah that's even better
15:13 ^🔗	Arkiver2	thanks
17:53 ^🔗	pluesch	i hate sites that are impossible to archive...
18:01 ^🔗	joepie91	nothing is impossible ;)
18:01 ^🔗	joepie91	it's just a challenge!
18:02 ^🔗	dx	unless they are already dead
18:03 ^🔗	joepie91	... okay, point taken.
18:03 ^🔗	joepie91	:P
18:17 ^🔗	pluesch	joepie91: tumblr.com :D
18:17 ^🔗	pluesch	they limit everything
18:17 ^🔗	pluesch	for example: notes are limited to 4980 and I don't know why
18:18 ^🔗	pluesch	so if there is a post with lets say 10000 notes, it is impossible to find out all likes, reblogs
18:19 ^🔗	xmc	sounds like they tried to limit it to 250 pages of 20, but got it wrong
18:23 ^🔗	joepie91	off-by-one, whoo
18:23 ^🔗	joepie91	pluesch: how is the remainder normally visible?
18:23 ^🔗	joepie91	surely they don't completely hide them
18:40 ^🔗	pluesch	joepie91: what do you mean?
18:40 ^🔗	pluesch	its not visible the data just gets lost in their database
18:40 ^🔗	pluesch	uhhh I hate it when data gets lost
18:43 ^🔗	midas	did someone say mongodb on 32bit?
18:49 ^🔗	joepie91	pluesch: I suspect it can still be accessed
18:49 ^🔗	joepie91	just through a different method
18:51 ^🔗	pluesch	joepie91: I've tried many ways ... api v1, api v2, normal page
18:52 ^🔗	pluesch	but yeah ... still have to check some things (tumblr android app api calls, "undocumented" api functions)
19:12 ^🔗	espes__	of general interest: https://github.com/espes/jdget
19:12 ^🔗	espes__	if anyone ever needs to pull a bunch of links from file lockers
19:13 ^🔗	espes__	ping me and maybe I'll update / maintain it
19:23 ^🔗	pluesch	even with tumblr android app the 4980 limit is there....
19:23 ^🔗	pluesch	-.-
19:54 ^🔗	joepie91	pluesch: is it possible that the notes displayed differ depending on what reblog you're looking at
19:54 ^🔗	xmc	I think they don't
19:54 ^🔗	joepie91	similar to how twitter sometimes shows an entirely different conversation flow on a different reply
19:55 ^🔗	xmc	tumblr notes are attached to the thread root
19:55 ^🔗	joepie91	bah
19:55 ^🔗	*	joepie91 crumples up paper and aims at circular filing bin
20:05 ^🔗	DFJustin	espes__: ooooh
20:06 ^🔗	DFJustin	does it do youtube? jdownloader2 is one of the few tools I've found that will correctly grab 1080p videos
20:06 ^🔗	pluesch	joepie91: nope, checked
20:06 ^🔗	pluesch	so
20:06 ^🔗	pluesch	what should i do now?
20:06 ^🔗	DFJustin	but I can't really recommend it to people because it's such a heap
20:06 ^🔗	pluesch	ask tumblr to improve there api? XD
20:07 ^🔗	pluesch	their*
20:07 ^🔗	pluesch	DFJustin: youtube-dl can 1080p too?
20:07 ^🔗	pluesch	bestvideo+bestaudio
20:12 ^🔗	joepie91	DFJustin: huh? youtube-dl does 1080p fine
20:12 ^🔗	DFJustin	yeah but it's not the default and you need ffmpeg
20:12 ^🔗	joepie91	also, so does freerapid afaik
20:12 ^🔗	joepie91	...?
20:12 ^🔗	joepie91	DFJustin: link me a 1080p video?
20:13 ^🔗	xmc	youtube-dl usually works for me
20:15 ^🔗	pluesch	tumblr has just blocked my ip :/
20:15 ^🔗	pluesch	damn.
20:15 ^🔗	pluesch	I just wanted to mirror 160 blogs
20:16 ^🔗	RedType	did you try saying you're googlebot?
20:16 ^🔗	DFJustin	https://www.youtube.com/watch?v=t8YXut6_56c
20:17 ^🔗	RedType	"oh what's that, you need me to register before viewing or my IP is 2much4u?" -A "Googlebot/2.1 (+http://www.google.com/bot.html)"
20:18 ^🔗	DFJustin	youtube-dl with default parameters gets 720p (silently!)
20:20 ^🔗	pluesch	will try that
20:23 ^🔗	DFJustin	ah --format bestvideo+bestaudio looks like it works, must be new
20:24 ^🔗	DFJustin	before you had to manually specify the correct numbers
20:25 ^🔗	danneh_	yeah, that'll get 1080p with youtube-dl, but you need to merge the audio/video streams with some other tool later
20:25 ^🔗	DFJustin	the dependency on ffmpeg is also problematic, when I tried to get this working with sketchcow before his system ffmpeg was broken somehow
20:25 ^🔗	DFJustin	and in my case (windows) it doesn't seem to like unicode filenames
20:26 ^🔗	danneh_	I believe, last time I checked you had to merge it manually at least
20:26 ^🔗	DFJustin	it does call out to ffmpeg to merge them
20:26 ^🔗	danneh_	oh, that's shiny
20:27 ^🔗	danneh_	pluesch: what're you using to download tumblr blogs?
20:27 ^🔗	danneh_	I've downloaded a few and haven't really had any issues, just going through and wgetting the thing
20:28 ^🔗	danneh_	had to modify the wget source code to grab another tag properly though (could probably be done via the lua extensions, but I haven't looked into that yet)
20:29 ^🔗	pluesch	danneh_: https://github.com/bbolli/tumblr-utils tumblr_backup.py with a few modifications
20:30 ^🔗	pluesch	will release it soon
20:30 ^🔗	danneh_	ah, fair enough
20:31 ^🔗	pluesch	and i have created a script that goes through all notes to create a list of blogs
20:31 ^🔗	pluesch	^^
20:31 ^🔗	pluesch	"all notes" == the 4890 notes per post
20:32 ^🔗	danneh_	aha, nice
20:32 ^🔗	danneh_	if you don't hit the api they let you download at full speed in my experience
20:33 ^🔗	pluesch	it hits the old api (blogname.tumblr.com/api/read) and that's the problem i guess
20:33 ^🔗	pluesch	media download isn't a problem yeah :)
20:34 ^🔗	RedType	hmm... is there a tumblr screenreading based api or something?
20:34 ^🔗	RedType	or just hitting their ajax endpoint?
20:34 ^🔗	RedType	err what im asking is if anyone has written such a thing
20:35 ^🔗	danneh_	I'll try to clean up my script and throw it online sometime
20:35 ^🔗	danneh_	also includes a little webserver to host blogs after they've been downloaded for fun
20:57 ^🔗	raylee	LOL
20:57 ^🔗	raylee	http://partners.disney.com/throwback

irclogger-viewer