[00:24] Arkiver2: the only subtitles i found are in dutch [03:42] hey #archiveteam! bukkit, a minecraft server / modding platform, is dead/dying due to a licensing conflict. the conflicting software is all gone due to DMCA, but their websites might still have plenty of data useful for the rest of the modding community [03:44] dx: we got a copy of the bukkit wiki via archivebot [03:44] if you've got other sites, please feel free to drop by #archivebot [03:44] neat [03:44] what about this? http://dev.bukkit.org/bukkit-plugins/ [03:45] pages in there could be archived through archivebot, but what about plugin .jars? [03:45] pages would be useful because there's plenty of documentation in that site [03:45] one sec, checking [03:46] if the jars are linked, we can get them [03:47] hmm, they seem like direct links http://dev.bukkit.org/bukkit-plugins/nametags/files/ [03:49] the forums [03:49] the plugins would be useful [03:49] I have clones of the DMCA'd repos [03:49] but I don't know what the legality on that is :p [03:50] as illegal as it always was [03:50] it's only down now because they are acting on it [07:35] so uhhh, there's an api for dev.bukkit.org, i scrapped all the plugin download links with it [07:36] so now i've got this huge json file with some metadata and a bunch of direct download links, sample of the first 100: http://dump.dequis.org/mF7xj.txt [07:37] no filesize info or uploaded timestamps, could grab that with HEAD requests maybe. [07:37] full dump http://dequis.org/bukkitdev-releases.json.gz - 3.4mb gzipped, 27mb uncompressed [07:41] all file urls one per line, no filtering at all, http://dequis.org/bukkitdev_all_file_urls.txt - 4.5mb, 75494 lines [07:42] it's an absurd amount of files, most of them not worth saving [07:43] how do i reduce this? for some projects it's not as trivial as grabbing the latest release, they have several parallel releases for a single version [07:46] Q: just installed Virtualbox and want to import a fine (appliance) how do I get the files? [07:46] a file [08:47] Litus1960: http://tracker.archiveteam.org/ click the download link [08:55] @Midas I got the machine how do iI import the files? [08:58] http://archiveteam.org/index.php?title=Warrior [09:00] In VirtualBox, click File > Import Appliance and open the file. -> what file do I open? [09:00] ....? [09:00] the file you just donwloaded [09:01] I import the the file that I just installed the virtual machine with? [09:03] you import the .ova [09:04] Litus1960: http://archive.org/download/archiveteam-warrior/archiveteam-warrior-v2-20121008.ova [09:04] then in virtualbox > file > import [09:07] ok did that now how do I run a job? [09:07] it shows a IP address in the console [09:08] it explains it here: http://tracker.archiveteam.org/ [09:11] got it [09:11] Litus1960: welcome and thanks for helping out :) [09:15] sorry bud, thank you for helping out :D was so buissy trying to comprehend the poop not a nerdy person ;-) [09:18] http://www.classiccomputing.com/CC/Blog/Entries/2014/9/7_Computer_History_Nostalgia_Podcasts.html [09:23] followed with @FlexMind [09:26] bye for now [09:50] y'all have probably seen this, but: http://blog.twitpic.com/2014/09/twitpic-is-shutting-down/ [09:51] danneh_: Yeah. There's a project channel at #quitpic [09:52] ersi: Ah, fair enough. Thanks for being so quick and awesome! [09:53] np! [09:53] is anyone going to those Dutch meetups about digital preservation tomorrow? http://www.dezwijger.nl/115739/nl/tegenlicht-meet-up-23-digitaal-geheugenverlies [09:54] joepie91, midas ^ [09:58] I wish i could, but work and job interviews [10:03] I might be able to go, but I'm a bit reluctant about walking around there and going "Hey guys, I'm with Archive Team!" [10:04] I don't want to be mistaken for a representative, as I'm not really a good PR person [10:05] simple awnser, dont :p [10:07] but I also want to stir up discussion about digital preservation, and if people ask "how are you connected to this stuff?" [10:09] Shall I stick with Jasons "Teenagers and crazy people" description for bands of rogue archivists like AT? https://decorrespondent.nl/1695/Waarom-deze-man-het-hele-internet-downloadt/56475705-f10825bc [10:12] I believe he said "maniacs" :) [10:12] antomatic: figures, it was translated into dutch in that interview [10:13] havent watched the entire documentary yet [14:36] Guys, I'm going to the meetup tomorrow evening in Amsterdam [14:37] ^Muad-Dib [14:39] If I can free the time, I'll be there too [14:41] i'd like to be there, but the day after that i have a job interview abroad.. so ill pass [14:42] but do remind them that we grabbed the announcement already Muad-Dib ;) [14:43] midas: I'd like to tell people about the projects we are curently doing [14:43] and the problems the archiveteam is usually experiencing when archiving a websites with warrior [14:44] However, it would be great if someone of you can also be there [14:44] :) [14:51] I'd like to go but I'm 6.5k km away [14:51] dx: I'm getting the sizes of everything right now, just so you know [14:53] dx: There is nothing that is not worth saving!! [14:54] If the list of files is below, say, 200 GB we can do it with archivebot [14:58] added it to archivebot [14:58] should be done in a day [14:59] whoa! [14:59] just came home and you're already archiving the whole thing :D [14:59] dx: http://archivebot.com/ first one [15:00] Arkiver2: thanks :D [15:00] if you have any other list of files/pages please give it and it will be saved [15:01] Arkiver2: all the pages under http://dev.bukkit.org/bukkit-plugins/ - mostly documentation of those plugins, they also link to the jars in that list so you'll want to exclude that. [15:02] dx: all of dev.bukkit.org is being saved: http://archivebot.com/ (third one) [15:02] Arkiver2: it won't download jars twice, right? [15:03] also, :D [15:04] dx: if the JARs have N URLs, they will be downloaded N times [15:05] yipdw: each url is a different version, but they are being downloaded from both my file list and the dev.bukkit.org recursive task [15:06] I'd say just leave it as it's going now, it won't be too big [15:06] hmm [15:08] hi [15:08] jules: hello [15:08] yesterday i took a random sample of 50 of them, got average 140kbytes, max 4mbytes, so uhhh, somewhere between 10gb and 300gb [15:08] but i have no idea what's "too big" for you guys :P [15:09] nothing is too big or us [15:09] well, there is a limit kind of [15:09] yeah i saw the twitch wiki page [15:09] MobileMe was more then 200 TB [15:09] whoa. [15:09] so it should be fine ;) [15:10] :D [15:10] thanks a lot! [15:10] dx: if that's the case, please ignore all jars from the dev.bukkit.org job [15:10] Arkiver2: ^ [15:10] there's "having lots of space" and "not being brain-dead" and two identical copies are the latter [15:10] yup! [15:10] we can do that too [15:11] hopefully dev.bukkit.org doesn't use Java applets [15:11] haha no [15:11] that'll make the obvious ignore pattern a bit trickier [15:11] it's not a website from the 90s luckily [15:13] yipdw: maybe ignore everything from servermods.cursecdn.com/ [15:13] done [15:13] as all filles are form there [15:13] ah that's even better [15:13] thanks [17:53] i hate sites that are impossible to archive... [18:01] nothing is impossible ;) [18:01] it's just a challenge! [18:02] unless they are already dead [18:03] ... okay, point taken. [18:03] :P [18:17] joepie91: tumblr.com :D [18:17] they limit everything [18:17] for example: notes are limited to 4980 and I don't know why [18:18] so if there is a post with lets say 10000 notes, it is impossible to find out all likes, reblogs [18:19] sounds like they tried to limit it to 250 pages of 20, but got it wrong [18:23] off-by-one, whoo [18:23] pluesch: how is the remainder normally visible? [18:23] surely they don't *completely* hide them [18:40] joepie91: what do you mean? [18:40] its not visible the data just gets lost in their database [18:40] uhhh I hate it when data gets lost [18:43] did someone say mongodb on 32bit? [18:49] pluesch: I suspect it can still be accessed [18:49] just through a different method [18:51] joepie91: I've tried many ways ... api v1, api v2, normal page [18:52] but yeah ... still have to check some things (tumblr android app api calls, "undocumented" api functions) [19:12] of general interest: https://github.com/espes/jdget [19:12] if anyone ever needs to pull a bunch of links from file lockers [19:13] ping me and maybe I'll update / maintain it [19:23] even with tumblr android app the 4980 limit is there.... [19:23] -.- [19:54] pluesch: is it possible that the notes displayed differ depending on what reblog you're looking at [19:54] I think they don't [19:54] similar to how twitter sometimes shows an entirely different conversation flow on a different reply [19:55] tumblr notes are attached to the thread root [19:55] bah [19:55] * joepie91 crumples up paper and aims at circular filing bin [20:05] espes__: ooooh [20:06] does it do youtube? jdownloader2 is one of the few tools I've found that will correctly grab 1080p videos [20:06] joepie91: nope, checked [20:06] so [20:06] what should i do now? [20:06] but I can't really recommend it to people because it's such a heap [20:06] ask tumblr to improve there api? XD [20:07] their* [20:07] DFJustin: youtube-dl can 1080p too? [20:07] bestvideo+bestaudio [20:12] DFJustin: huh? youtube-dl does 1080p fine [20:12] yeah but it's not the default and you need ffmpeg [20:12] also, so does freerapid afaik [20:12] ...? [20:12] DFJustin: link me a 1080p video? [20:13] youtube-dl usually works for me [20:15] tumblr has just blocked my ip :/ [20:15] damn. [20:15] I just wanted to mirror 160 blogs [20:16] did you try saying you're googlebot? [20:16] https://www.youtube.com/watch?v=t8YXut6_56c [20:17] "oh what's that, you need me to register before viewing or my IP is 2much4u?" -A "Googlebot/2.1 (+http://www.google.com/bot.html)" [20:18] youtube-dl with default parameters gets 720p (silently!) [20:20] will try that [20:23] ah --format bestvideo+bestaudio looks like it works, must be new [20:24] before you had to manually specify the correct numbers [20:25] yeah, that'll get 1080p with youtube-dl, but you need to merge the audio/video streams with some other tool later [20:25] the dependency on ffmpeg is also problematic, when I tried to get this working with sketchcow before his system ffmpeg was broken somehow [20:25] and in my case (windows) it doesn't seem to like unicode filenames [20:26] I believe, last time I checked you had to merge it manually at least [20:26] it does call out to ffmpeg to merge them [20:26] oh, that's shiny [20:27] pluesch: what're you using to download tumblr blogs? [20:27] I've downloaded a few and haven't really had any issues, just going through and wgetting the thing [20:28] had to modify the wget source code to grab another tag properly though (could probably be done via the lua extensions, but I haven't looked into that yet) [20:29] danneh_: https://github.com/bbolli/tumblr-utils tumblr_backup.py with a few modifications [20:30] will release it soon [20:30] ah, fair enough [20:31] and i have created a script that goes through all notes to create a list of blogs [20:31] ^^ [20:31] "all notes" == the 4890 notes per post [20:32] aha, nice [20:32] if you don't hit the api they let you download at full speed in my experience [20:33] it hits the old api (blogname.tumblr.com/api/read) and that's the problem i guess [20:33] media download isn't a problem yeah :) [20:34] hmm... is there a tumblr screenreading based api or something? [20:34] or just hitting their ajax endpoint? [20:34] err what im asking is if anyone has written such a thing [20:35] I'll try to clean up my script and throw it online sometime [20:35] also includes a little webserver to host blogs after they've been downloaded for fun [20:57] LOL [20:57] http://partners.disney.com/throwback