[00:01] nico_: I fed those all to archivebot already [00:01] (pretty much right when the news broke) [00:01] if you want to do it anyway feel free [00:02] What news? [00:02] his dead [00:03] oh [00:03] (french) https://linuxfr.org/news/deces-de-cedric-blancher-chercheur-en-securite-informatique [00:04] (english) http://www.theregister.co.uk/2013/11/12/cdric_sid_blancher_dead_at_37/ [00:04] if you want more, we need to go to -bs [05:46] SketchCow: i found webstock 2013 videos was released [05:47] going to make a collection for them [07:48] 13.15 <@Nemo_bis> where is odie5533 when one needs him :) https://code.google.com/p/wikiteam/issues/detail?id=78 [07:49] uh he's not here, stupid page up :) [07:52] Page-up has been broken on my laptop for a while, irssi has been stressful [08:12] Nemo_bis: hmm? [08:13] I didn't write wikiteam scripts. [08:14] Nemo_bis: I think the wikiteam script works by using urllib to get a list of urls, then wget for the actual download. If I were writing it, I'd write the url grabbing part as a Scrapy project that outputs urls. [08:15] Nemo_bis: well, it's hitting a redirect loop. Can you give me a url that this occurs on? [08:16] And in any case, the grabber should catch the HTTPError and just continue grabbing the other urls. [08:18] odie5533_: I know. Some days ago you asked me to provide URLs where the problems happen, here you go. :) [08:18] They are in the bug report, just add some dots to the domain names in dirs [08:18] eh... can you give me a link? [08:19] I'm really not sure what you mean since that won't give me a url to one of the images which seem to be having the problem [08:20] None of the domains seem to exist. e.g. http://amzwikkii.com/ [08:21] Nemo_bis: let's talk in #wikiteam [08:44] Question about vimeo [08:45] i'm looking at vmeo webstock archive and you can't download the original video file even though there is a link [08:47] whats funny is if your over the link you get this message: "This source file has been lovingly stored in our archive." [08:47] but you can freaking download it [08:47] *can't [08:57] i also found out that more videos from webstock 2011 was release more recently [08:57] maybe they moved them all to tapes in order to save money :D [09:02] maybe but its very weird [09:03] cause some videos in that area still have the original links working just fine [09:04] anyways d-addicts.com wiki dump is done downloading [09:04] making a 7z file of it [09:12] still uploading as 50 KB/s average to s3... [09:16] That seems really low [10:36] I'm really pounding s3. [11:45] SketchCow: and derivers too now? :) [11:46] 2,391,921,500 KB so far, I hope we'll have some slice of s3 available for us too soon :P [13:09] AAAAAAND WE'RE OFF: http://tracker.archiveteam.org/hyves/ [13:49] Is FOS getting this fun? [14:16] SketchCow: I think so [14:16] S[h]O[r]T has offered space too [15:02] SketchCow: I'm not sure if FOS has been added as a target yet [15:02] the initial target was icefortress, awaiting FOS space to be freed [15:12] GLaDOS: VERSION = "20131116.02_" + subprocess.check_output(["git","rev-parse", "HEAD"]).strip()[:7] [15:12] nico_: I saw it (i'm also twrist) [15:12] (need a import subprocess before) [18:14] http://bpaste.net/show/gWTMl6R6j3bdSgAuFHZj/ I used this for ripping a table from Wikipedia as CSV. Maybe someone else here might find this useful. [18:15] It doesn't cope with there being two tables matching the selector on the table, but a minor set of modifications could make it do that. [18:16] *matching the selector on the page [18:55] uploaded: https://archive.org/details/wikid_addictscom_static-20131115-wikidump [22:10] hmpf, people disabling the ability to submit Issues to stuff on github... [22:23] Does anyone know what a CDX warcinfo/request entry is supposed to look like when the filename has a space in it? [22:27] The Python program CDX-Writer formats the massaged url as 'warcinfo:/output file.warc.gz/version'. Should there really be a space in the middle of an otherwise space-separated file? None of the test cases for CDX-Writer have spaces in the file names, so perhaps it was overlooked. [22:31] And the author disabled the Issues section on Github and doesn't list an email, so I can't even contact them. [22:37] you can usually get an email by cloning the repo and looking at their commits [22:43] Archive Team Bot GO! The Third: https://archive.org/details/archiveteam_archivebot_go_003\ [22:46] SketchCow: did the plugins.jetbrains.com warc get nuked or are you specially handling it? [22:49] it's from around Oct 23 [22:51] it ran into the 40GB limit so maybe something went wrong with the rsync