#archiveteam 2014-09-08,Mon

↑back Search

Time Nickname Message
00:24 🔗 godane Arkiver2: the only subtitles i found are in dutch
03:42 🔗 dx hey #archiveteam! bukkit, a minecraft server / modding platform, is dead/dying due to a licensing conflict. the conflicting software is all gone due to DMCA, but their websites might still have plenty of data useful for the rest of the modding community
03:44 🔗 yipdw dx: we got a copy of the bukkit wiki via archivebot
03:44 🔗 yipdw if you've got other sites, please feel free to drop by #archivebot
03:44 🔗 dx neat
03:44 🔗 dx what about this? http://dev.bukkit.org/bukkit-plugins/
03:45 🔗 dx pages in there could be archived through archivebot, but what about plugin .jars?
03:45 🔗 dx pages would be useful because there's plenty of documentation in that site
03:45 🔗 yipdw one sec, checking
03:46 🔗 yipdw if the jars are linked, we can get them
03:47 🔗 dx hmm, they seem like direct links http://dev.bukkit.org/bukkit-plugins/nametags/files/
03:49 🔗 balrog the forums
03:49 🔗 balrog the plugins would be useful
03:49 🔗 balrog I have clones of the DMCA'd repos
03:49 🔗 balrog but I don't know what the legality on that is :p
03:50 🔗 dx as illegal as it always was
03:50 🔗 dx it's only down now because they are acting on it
07:35 🔗 dx so uhhh, there's an api for dev.bukkit.org, i scrapped all the plugin download links with it
07:36 🔗 dx so now i've got this huge json file with some metadata and a bunch of direct download links, sample of the first 100: http://dump.dequis.org/mF7xj.txt
07:37 🔗 dx no filesize info or uploaded timestamps, could grab that with HEAD requests maybe.
07:37 🔗 dx full dump http://dequis.org/bukkitdev-releases.json.gz - 3.4mb gzipped, 27mb uncompressed
07:41 🔗 dx all file urls one per line, no filtering at all, http://dequis.org/bukkitdev_all_file_urls.txt - 4.5mb, 75494 lines
07:42 🔗 dx it's an absurd amount of files, most of them not worth saving
07:43 🔗 dx how do i reduce this? for some projects it's not as trivial as grabbing the latest release, they have several parallel releases for a single version
07:46 🔗 Litus1960 Q: just installed Virtualbox and want to import a fine (appliance) how do I get the files?
07:46 🔗 Litus1960 a file
08:47 🔗 midas Litus1960: http://tracker.archiveteam.org/ click the download link
08:55 🔗 Litus1960 @Midas I got the machine how do iI import the files?
08:58 🔗 Rotab http://archiveteam.org/index.php?title=Warrior
09:00 🔗 Litus1960 In VirtualBox, click File > Import Appliance and open the file. -> what file do I open?
09:00 🔗 Rotab ....?
09:00 🔗 Rotab the file you just donwloaded
09:01 🔗 Litus1960 I import the the file that I just installed the virtual machine with?
09:03 🔗 Rotab you import the .ova
09:04 🔗 midas Litus1960: http://archive.org/download/archiveteam-warrior/archiveteam-warrior-v2-20121008.ova
09:04 🔗 midas then in virtualbox > file > import
09:07 🔗 Litus1960 ok did that now how do I run a job?
09:07 🔗 midas it shows a IP address in the console
09:08 🔗 midas it explains it here: http://tracker.archiveteam.org/
09:11 🔗 Litus1960 got it
09:11 🔗 schbirid Litus1960: welcome and thanks for helping out :)
09:15 🔗 Litus1960 sorry bud, thank you for helping out :D was so buissy trying to comprehend the poop not a nerdy person ;-)
09:18 🔗 schbirid http://www.classiccomputing.com/CC/Blog/Entries/2014/9/7_Computer_History_Nostalgia_Podcasts.html
09:23 🔗 Litus1960 followed with @FlexMind
09:26 🔗 Litus1960 bye for now
09:50 🔗 danneh_ y'all have probably seen this, but: http://blog.twitpic.com/2014/09/twitpic-is-shutting-down/
09:51 🔗 ersi danneh_: Yeah. There's a project channel at #quitpic
09:52 🔗 danneh_ ersi: Ah, fair enough. Thanks for being so quick and awesome!
09:53 🔗 ersi np!
09:53 🔗 Muad-Dib is anyone going to those Dutch meetups about digital preservation tomorrow? http://www.dezwijger.nl/115739/nl/tegenlicht-meet-up-23-digitaal-geheugenverlies
09:54 🔗 Muad-Dib joepie91, midas ^
09:58 🔗 midas I wish i could, but work and job interviews
10:03 🔗 Muad-Dib I might be able to go, but I'm a bit reluctant about walking around there and going "Hey guys, I'm with Archive Team!"
10:04 🔗 Muad-Dib I don't want to be mistaken for a representative, as I'm not really a good PR person
10:05 🔗 midas simple awnser, dont :p
10:07 🔗 Muad-Dib but I also want to stir up discussion about digital preservation, and if people ask "how are you connected to this stuff?"
10:09 🔗 Muad-Dib Shall I stick with Jasons "Teenagers and crazy people" description for bands of rogue archivists like AT? https://decorrespondent.nl/1695/Waarom-deze-man-het-hele-internet-downloadt/56475705-f10825bc
10:12 🔗 antomatic I believe he said "maniacs" :)
10:12 🔗 Muad-Dib antomatic: figures, it was translated into dutch in that interview
10:13 🔗 Muad-Dib havent watched the entire documentary yet
14:36 🔗 Arkiver2 Guys, I'm going to the meetup tomorrow evening in Amsterdam
14:37 🔗 Arkiver2 ^Muad-Dib
14:39 🔗 Muad-Dib If I can free the time, I'll be there too
14:41 🔗 midas i'd like to be there, but the day after that i have a job interview abroad.. so ill pass
14:42 🔗 midas but do remind them that we grabbed the announcement already Muad-Dib ;)
14:43 🔗 Arkiver2 midas: I'd like to tell people about the projects we are curently doing
14:43 🔗 Arkiver2 and the problems the archiveteam is usually experiencing when archiving a websites with warrior
14:44 🔗 Arkiver2 However, it would be great if someone of you can also be there
14:44 🔗 Arkiver2 :)
14:51 🔗 vantec I'd like to go but I'm 6.5k km away
14:51 🔗 phuzion dx: I'm getting the sizes of everything right now, just so you know
14:53 🔗 Arkiver2 dx: There is nothing that is not worth saving!!
14:54 🔗 Arkiver2 If the list of files is below, say, 200 GB we can do it with archivebot
14:58 🔗 Arkiver2 added it to archivebot
14:58 🔗 Arkiver2 should be done in a day
14:59 🔗 dx whoa!
14:59 🔗 dx just came home and you're already archiving the whole thing :D
14:59 🔗 Arkiver2 dx: http://archivebot.com/ first one
15:00 🔗 dx Arkiver2: thanks :D
15:00 🔗 Arkiver2 if you have any other list of files/pages please give it and it will be saved
15:01 🔗 dx Arkiver2: all the pages under http://dev.bukkit.org/bukkit-plugins/ - mostly documentation of those plugins, they also link to the jars in that list so you'll want to exclude that.
15:02 🔗 Arkiver2 dx: all of dev.bukkit.org is being saved: http://archivebot.com/ (third one)
15:02 🔗 dx Arkiver2: it won't download jars twice, right?
15:03 🔗 dx also, :D
15:04 🔗 yipdw dx: if the JARs have N URLs, they will be downloaded N times
15:05 🔗 dx yipdw: each url is a different version, but they are being downloaded from both my file list and the dev.bukkit.org recursive task
15:06 🔗 Arkiver2 I'd say just leave it as it's going now, it won't be too big
15:06 🔗 dx hmm
15:08 🔗 jules hi
15:08 🔗 Arkiver2 jules: hello
15:08 🔗 dx yesterday i took a random sample of 50 of them, got average 140kbytes, max 4mbytes, so uhhh, somewhere between 10gb and 300gb
15:08 🔗 dx but i have no idea what's "too big" for you guys :P
15:09 🔗 Arkiver2 nothing is too big or us
15:09 🔗 Arkiver2 well, there is a limit kind of
15:09 🔗 dx yeah i saw the twitch wiki page
15:09 🔗 Arkiver2 MobileMe was more then 200 TB
15:09 🔗 dx whoa.
15:09 🔗 Arkiver2 so it should be fine ;)
15:10 🔗 dx :D
15:10 🔗 dx thanks a lot!
15:10 🔗 yipdw dx: if that's the case, please ignore all jars from the dev.bukkit.org job
15:10 🔗 dx Arkiver2: ^
15:10 🔗 yipdw there's "having lots of space" and "not being brain-dead" and two identical copies are the latter
15:10 🔗 dx yup!
15:10 🔗 Arkiver2 we can do that too
15:11 🔗 yipdw hopefully dev.bukkit.org doesn't use Java applets
15:11 🔗 dx haha no
15:11 🔗 yipdw that'll make the obvious ignore pattern a bit trickier
15:11 🔗 dx it's not a website from the 90s luckily
15:13 🔗 Arkiver2 yipdw: maybe ignore everything from servermods.cursecdn.com/
15:13 🔗 yipdw done
15:13 🔗 Arkiver2 as all filles are form there
15:13 🔗 Arkiver2 ah that's even better
15:13 🔗 Arkiver2 thanks
17:53 🔗 pluesch i hate sites that are impossible to archive...
18:01 🔗 joepie91 nothing is impossible ;)
18:01 🔗 joepie91 it's just a challenge!
18:02 🔗 dx unless they are already dead
18:03 🔗 joepie91 ... okay, point taken.
18:03 🔗 joepie91 :P
18:17 🔗 pluesch joepie91: tumblr.com :D
18:17 🔗 pluesch they limit everything
18:17 🔗 pluesch for example: notes are limited to 4980 and I don't know why
18:18 🔗 pluesch so if there is a post with lets say 10000 notes, it is impossible to find out all likes, reblogs
18:19 🔗 xmc sounds like they tried to limit it to 250 pages of 20, but got it wrong
18:23 🔗 joepie91 off-by-one, whoo
18:23 🔗 joepie91 pluesch: how is the remainder normally visible?
18:23 🔗 joepie91 surely they don't *completely* hide them
18:40 🔗 pluesch joepie91: what do you mean?
18:40 🔗 pluesch its not visible the data just gets lost in their database
18:40 🔗 pluesch uhhh I hate it when data gets lost
18:43 🔗 midas did someone say mongodb on 32bit?
18:49 🔗 joepie91 pluesch: I suspect it can still be accessed
18:49 🔗 joepie91 just through a different method
18:51 🔗 pluesch joepie91: I've tried many ways ... api v1, api v2, normal page
18:52 🔗 pluesch but yeah ... still have to check some things (tumblr android app api calls, "undocumented" api functions)
19:12 🔗 espes__ of general interest: https://github.com/espes/jdget
19:12 🔗 espes__ if anyone ever needs to pull a bunch of links from file lockers
19:13 🔗 espes__ ping me and maybe I'll update / maintain it
19:23 🔗 pluesch even with tumblr android app the 4980 limit is there....
19:23 🔗 pluesch -.-
19:54 🔗 joepie91 pluesch: is it possible that the notes displayed differ depending on what reblog you're looking at
19:54 🔗 xmc I think they don't
19:54 🔗 joepie91 similar to how twitter sometimes shows an entirely different conversation flow on a different reply
19:55 🔗 xmc tumblr notes are attached to the thread root
19:55 🔗 joepie91 bah
19:55 🔗 * joepie91 crumples up paper and aims at circular filing bin
20:05 🔗 DFJustin espes__: ooooh
20:06 🔗 DFJustin does it do youtube? jdownloader2 is one of the few tools I've found that will correctly grab 1080p videos
20:06 🔗 pluesch joepie91: nope, checked
20:06 🔗 pluesch so
20:06 🔗 pluesch what should i do now?
20:06 🔗 DFJustin but I can't really recommend it to people because it's such a heap
20:06 🔗 pluesch ask tumblr to improve there api? XD
20:07 🔗 pluesch their*
20:07 🔗 pluesch DFJustin: youtube-dl can 1080p too?
20:07 🔗 pluesch bestvideo+bestaudio
20:12 🔗 joepie91 DFJustin: huh? youtube-dl does 1080p fine
20:12 🔗 DFJustin yeah but it's not the default and you need ffmpeg
20:12 🔗 joepie91 also, so does freerapid afaik
20:12 🔗 joepie91 ...?
20:12 🔗 joepie91 DFJustin: link me a 1080p video?
20:13 🔗 xmc youtube-dl usually works for me
20:15 🔗 pluesch tumblr has just blocked my ip :/
20:15 🔗 pluesch damn.
20:15 🔗 pluesch I just wanted to mirror 160 blogs
20:16 🔗 RedType did you try saying you're googlebot?
20:16 🔗 DFJustin https://www.youtube.com/watch?v=t8YXut6_56c
20:17 🔗 RedType "oh what's that, you need me to register before viewing or my IP is 2much4u?" -A "Googlebot/2.1 (+http://www.google.com/bot.html)"
20:18 🔗 DFJustin youtube-dl with default parameters gets 720p (silently!)
20:20 🔗 pluesch will try that
20:23 🔗 DFJustin ah --format bestvideo+bestaudio looks like it works, must be new
20:24 🔗 DFJustin before you had to manually specify the correct numbers
20:25 🔗 danneh_ yeah, that'll get 1080p with youtube-dl, but you need to merge the audio/video streams with some other tool later
20:25 🔗 DFJustin the dependency on ffmpeg is also problematic, when I tried to get this working with sketchcow before his system ffmpeg was broken somehow
20:25 🔗 DFJustin and in my case (windows) it doesn't seem to like unicode filenames
20:26 🔗 danneh_ I believe, last time I checked you had to merge it manually at least
20:26 🔗 DFJustin it does call out to ffmpeg to merge them
20:26 🔗 danneh_ oh, that's shiny
20:27 🔗 danneh_ pluesch: what're you using to download tumblr blogs?
20:27 🔗 danneh_ I've downloaded a few and haven't really had any issues, just going through and wgetting the thing
20:28 🔗 danneh_ had to modify the wget source code to grab another tag properly though (could probably be done via the lua extensions, but I haven't looked into that yet)
20:29 🔗 pluesch danneh_: https://github.com/bbolli/tumblr-utils tumblr_backup.py with a few modifications
20:30 🔗 pluesch will release it soon
20:30 🔗 danneh_ ah, fair enough
20:31 🔗 pluesch and i have created a script that goes through all notes to create a list of blogs
20:31 🔗 pluesch ^^
20:31 🔗 pluesch "all notes" == the 4890 notes per post
20:32 🔗 danneh_ aha, nice
20:32 🔗 danneh_ if you don't hit the api they let you download at full speed in my experience
20:33 🔗 pluesch it hits the old api (blogname.tumblr.com/api/read) and that's the problem i guess
20:33 🔗 pluesch media download isn't a problem yeah :)
20:34 🔗 RedType hmm... is there a tumblr screenreading based api or something?
20:34 🔗 RedType or just hitting their ajax endpoint?
20:34 🔗 RedType err what im asking is if anyone has written such a thing
20:35 🔗 danneh_ I'll try to clean up my script and throw it online sometime
20:35 🔗 danneh_ also includes a little webserver to host blogs after they've been downloaded for fun
20:57 🔗 raylee LOL
20:57 🔗 raylee http://partners.disney.com/throwback

irclogger-viewer