[00:05] Do warriors receive a list of urls to download, or do they hunt for urls themselves? [00:18] prettry sure they get a list from the tacker. That way, everbody is trying different URLs [00:41] but then the tracker needed to already have crawled the site, right? [00:41] it seems like the site would be crawled twice then. how does the warrior help? [00:43] Would anyone happen to have an archived copy of the media files here? [00:43] https://web.archive.org/web/20040209025641/http://www.skycycleonline.com/media.html [01:15] odie5533_: usually we do a quick surface crawl to get valid id numbers and url formats, then fill in the tracker with things we've seen and things we've extrapolated [03:32] did anyone archive the video of that dude knocking over the boulder? theres lots of dmca takedowns going around [03:38] what video? [03:38] this boy scout decided to knock over some million year old boulder to save children [03:38] I know of the one [03:40] Lord_Nigh: http://www.liveleak.com/view?i=727_1382054402 [03:40] I'm surprised he didn't somehow manage to crush himself. [03:41] yay glenn! [03:43] Lord_Nigh: magnet:?xt=urn:btih:C49EFD4BE3FBFA7FEB8C4ABF18FAE5A5ADEAB61D&dn=jackass%20topples%20200-million-year%20rock%20formation.mp4.mp4&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=udp%3a%2f%2ftracker.publicbt.com%3a80%2fannounce&tr=udp%3a%2f%2ftracker.ccc.de%3a80%2fannounce [03:46] wow [03:48] I archived it [04:07] hmm who had that handy script to reupload youtube-dl output to ia [04:08] sounds like something I'd write, but it isn't [04:09] I kinda wish I had a script where it was question and answer script to upload files to IA [04:11] does there even need to be question and answer? [04:11] found it http://code.google.com/p/emijrp/source/browse/trunk/scrapers/youtube2internetarchive.py [04:38] Why isn't that on github!? [04:40] Does emijrp ever come in here? [04:40] yeah [04:40] and probably just didn't decide to use github [04:41] How often does he come in here? [04:41] not sure [04:52] DFJustin: Did you use that script? And if so, to upload what? [05:01] so cause i'm nuts i found another tech podcast [05:02] called The Tech Report Podcast [05:02] good news is the rss feed looks like has all mp3 [05:02] make pushing downloading and pushing it easier [05:04] A New WikiDump has been made for the following Projects: https://archive.org/details/wiki-ftlwikicom https://archive.org/details/wiki-letsplaywikicom https://archive.org/details/wiki-lptwikicom and the big stuff https://archive.org/details/wiki-pcgamingwikicom [05:30] i also just found a podcast called hacker pubic radio [05:42] godane and its not on ia [05:42] Sounds like a project! [05:45] I haven't used it yet [05:45] would need to adapt it to upload already-downloaded things rather than pulling fresh [05:46] Do you upload every podcast you find? [05:49] Why not? [05:54] i will work on tech report podcast for the moment [05:55] the hacker pubic radio is released in mp3, spx and ogg [05:55] i'm grabbing the mp3 version since archive.org will make a ogg of that [05:55] JRWR: Sounds like a lot of work for stuff that's usually pretty low quality... but if you want to, I wouldn't stop you [05:58] well this is odd [05:58] Why does the wiki teams batch downloader do POST on images [05:59] that breaks NGINX [05:59] What do you mean? [05:59] 2607:5300:60:ad1::1 - - [24/Oct/2013:01:52:53 -0400] [pcgamingwiki.com] "POST /images/2/2e/Zen_Puzzle_Garden_cover.png HTTP/1.0" 405 166 "-" "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0" [06:00] That's bad. [06:00] What are you using to dump the wiki? [06:00] https://code.google.com/p/wikiteam/source/browse/trunk/dumpgenerator.py [06:00] Who wrote it? [06:00] wow that's a long script. [06:01] look at the reversions [06:01] nemo and emijrp [06:02] line 671 [06:02] line 671/1195 [06:02] perhaps [06:02] JRWR: What command did you use? [06:04] launcher.py wiki.txt [06:04] whats funny is that episode 1364 of hacker pubic radio talks about vintage tech icon pay phone coin box [06:04] https://code.google.com/p/wikiteam/source/browse/trunk/batchdownload/launcher.py [06:05] i will go after the website for stuff thats not .org, spx, and mp3 just so we have the other stuff [06:07] so the launcher.py calls the dumpgenerator.py? [06:07] crazy. [06:07] Yep [06:07] its meant for a big ol list of wikis [06:08] JRWR: well [06:09] for a quick fix, just delete the ", data=...") stuff [06:09] so that the line reads: urllib.urlretrieve(url=url, filename='%s/%s' % (imagepath, filename2)) [06:09] might break other stuff though! :D [06:10] but that code is hacking since he's overriding urllib internals. bad bad bad! But I've done similar stuff before heh [06:11] http://bap.ece.cmu.edu/download/bap-0.8/ was released on oct 17 and taken down on oct 22; unsure why; it was also stored at a git repo at https://github.com/cmubap/bap which was taken down simultaneously; i'm in communications with someone who has a checkout of that git [06:11] there is some lawyer related crap why it was taken down [06:11] oh my [06:11] sounds like a bittorrent mirror is in order [06:11] exactly [06:12] ill be happy to seed it for some time :) [06:12] don't these lawyers know that code wants to be free? :) [06:13] especially since the 0.7 code is still up at http://bap.ece.cmu.edu/download/bap-0.7/ though it looks like it may have been modified when everything else was taken down [06:13] listing: http://webcache.googleusercontent.com/search?q=cache:http://bap.ece.cmu.edu/download/bap-0.8/&ie=utf-8&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&gws_rd=cr&ei=TLpoUuKaC8v5kQfvpoBY [06:14] damn, who thought it was a good idea to do POSTs to get data [06:14] the code WAS released as gplv2... so once i get a copy i'm pretty sure i'm allowed to further distribute it... [06:14] JRWR: looks like it was done to fix the GET not working oddly enough [06:14] Lord_Nigh: sort of. [06:14] lol [06:14] Not if it's illegal code [06:14] ill watch the logs [06:15] afaik its not illegal [06:15] If it's illegal to begin with, and they had no right to release it, then you have no right either [06:15] true [06:15] dat user agent [06:16] JRWR: yeah, I'm not sure why they didn't just use URLopenerUserAgent().retriever(...) [06:16] *retrieve [06:16] sounds like a rewrite is in order [06:17] Perhaps just a fix. If it were rewritten, I'd say change from urllib to Twisted. [06:17] also, I noticed its border line a DoS [06:18] it spams the fuck out of the webserver [06:18] that's not good. [06:18] also, it looks like the _urlopener, while looking a bit hackish, is actually recommended by the API docs. [06:18] same network, im getting 40req/s [06:19] with Twisted I always use delays and set a max number of requests [06:19] I dont mind, but adding random requests and maybe some better user agents would work [06:19] JRWR: What is it doing, exactly? You give it a list of image urls? [06:19] no [06:20] it dumps the ENTIRE contents of a wiki [06:20] XML + Images [06:21] First does XML right? [06:21] Yes [06:21] uses the API to pull it all [06:22] Do you do a lot of wiki archiving? [06:22] I own a Very LARGE wiki farm [06:22] What does that mean? [06:22] and I hate messing with the database, my caches love me :) [06:23] and well, I broke my own dump scripts that they include with mediawiki [06:23] even now im dumping a 2G XML file [06:25] Yay! its working [06:25] all 2200 images [06:30] oh god FTLWiki is Huge [06:38] ah, thats more like it https://archive.org/details/wiki-pcgamingwikicom [06:39] odie5533: no, only emijrp; I just do some small changes [06:40] JRWR: what do you mean that it breaks nginx? [06:40] it 405s on "true" files [06:40] if you try and do a POST on them [06:40] perhaps we should try both then [06:41] I would try GETs first, then POSTs [06:41] apparently POST was used because in some cases GET requests didn't work, according to the comment [06:41] yeah, sure; wanna submit a patch? :) [06:41] uhhh..... me + python = bwhahah [06:54] this sucks [06:54] looks like there is already a collection [06:55] but it was done badly and out of date [06:55] this is about hacker pubic radio [06:57] i may have redo the first two items i have uploaded [06:57] add a _mp3 to item names just so they will upload [06:58] some of the way this collection was done is sort of half ass [06:58] like this item: https://archive.org/details/hpr1282 [06:58] Nemo_bis: just do GET. Leave POST for if the GET didn't work someone can fix to that [06:58] it should only be hpr1282 in it [06:58] I don't think POST should ever be the default behavior. [06:58] but hpr1284 is also in it [07:00] Nemo_bis: have you read through all the code of dumpgenerator.py? Or have you only made tiny fixes to it>? [07:01] odie5533: yes, I guess I read it all at some point in time [07:02] aren't there other scripts to generate backups of medaiwiki sites? [07:03] there are, but this set is very nice as it does all the heavy lifting for you when it works [07:04] I just submitted three bugs [07:08] JRWR: It would probably help your issues if you gave the specific commands you used to reproduce the problem [07:08] "1. Do a normal API based Full XML+Image Dump using SVN Trunk " [07:11] odie5533 added a comment [07:12] looks better [07:12] Is dumping wikis popular? [07:12] Or is dumping other stuff more popular? [07:12] somewhat [07:13] its more common to find a wiki [07:13] since mediawikis are easy to setup and allow for content to be stored [07:13] I run PCGamingWiki.com (Their servers) and well 47k a day in visits is nice [07:28] JRWR: What do you use to view warc files? [07:36] http://www.magicthegatheringtactics.com/ is already down. I assume no one got a grab of it? [07:40] odie5533: its not down for me [07:41] oh. won't load for me. someone should probably grab it since the game is shutting down [09:39] Does WARC support HTTP1.1? [09:41] I guess it does by splitting up the request/responses. [09:41] HTTP1.1 makes things more complicated... [12:05] odie5533: so long as there's one or more responses to a given request, WARC/1.0 should be able to handle any such version of HTTP [12:06] correction, zero or more responses per one request [12:06] WARC will correctly capture a "no response received" situation [16:14] paging sketchcow / undersco2 - rsync to fos failing for lack of space on device [18:10] please bang on this and make sure you don't see any breakage or errors [18:10] http://archive.org/details/historicalsoftware [18:37] in-browser emulators? lynx won't touch it :P [18:39] undersco2: it's a bit weird pointing at the Spectrum version of Elite -- isn't the BBC version (the original) in the archive? [18:39] (Ian Bell actually recommends the NES version as the best 8-bit one...) [18:40] unsure, would be a SketchCow question [18:40] he picked the things [18:41] it's kind of pot luck currently as to what computer systems are working [18:42] bbc is in mess ought to work but there may be some silly issue with the compile [18:44] * ats launches his Z80-equipped Cobra MkIII and goes for a spin [18:47] new elite coming 2014, can't wait [18:51] Spectrum, Apple ][ and Osborne I all seem to work OK for me, and the text looks good [18:53] * ats idly ponders a "focus on British games" page along similar lines to point his students at... [19:42] Any weirdness, let me know [19:44] https://docs.google.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDgtQmxhQS1ibEJua1JRYlJScWt2dWc&usp=sharing [19:57] So look. [19:57] I shifted data off the filling partition [20:02] SketchCow: my coworkers saw that software archive, they love it [20:07] Great [20:46] I might have a new peoject to do [20:46] http://community.eveonline.com/news/news-channels/eve-online-news/old-portrait-services-temporarily-re-enabled/ [20:47] eve has re-enabled their old portrait server, Im already running a script right now that is brute forcing it, since the id for the avatar can be 1 ro 9000000 [20:47] the old docs are here for it http://oldportraits.eveonline.com/ [20:47] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [20:48] yahoosucks, good sir [20:48] I have the greatest question ever. [20:48] https://archive.org/details/VisiCalc_1979_SoftwareArts [20:48] I can't get it to do a second row of data [20:48] Any ideas? [20:53] there's probably an easier way but you can type >A2 [20:53] source https://archive.org/stream/atariusersguide00fyls#page/16/mode/2up [20:56] Nemo_bis: on hp ftp.. did a compression test on "hpdesignjet.zip" to see what's possible.. nothing much came out of it. "Compression Ratio: 1.010.", couple of hundred meg savings. Not useful at all to upload it I guess.. [21:04] SketchCow: Ha, I was *just* wondering the same thing [21:05] Oh huh, I entered something that made left/right do vertical scrolling instead [21:12] deathy: with what settings? [21:13] unless you have over 20 GB RAM, you'd need -U for that one :) [21:18] Nemo_bis: ran with "-lU" since that's what you mentioned yesterday. Just got a server with 48 GB of ram today :) [21:20] I wonder if I should submit this project to the warriors [21:20] this is taking forever, Ive got 9 million IDs to find [21:25] deathy: wow, so you don't even need to use -U :D how long did it take? maybe you can remove even -l [21:26] I suspect the piping done by lrztar has worse effects than lrzip directly on a tar on disk [21:28] Nemo_bis: 21 minutes for the lrzip. I actually unarchived, created a tar and then ran lrzip. Well..sleep now. Let me know if you want me to try it on any other big archives [21:29] JRWR, I'd imagine their servers can handle a nice number of connections. Got threading? [21:33] deathy: impressive :) a test without -lU would be fun [21:33] maybe that's the wrong testcase, it's possible there isn't as much duplication as in others [21:37] what if archive.org goes down [21:37] do we archive archive.org [21:46] TSwift: yes, for instance I ask people to mirror my https://archive.org/details/wikimediacommons collection; I'd also like to know more about the Alexandria mirror [21:47] I wonder if some researcher is downloading huge datasets; usually the link to Internet2 is much less busy, iirc. https://monitor.archive.org/weathermap/weathermap.html Maybe someone I asked to mirror Commons files :) https://en.wikipedia.org/wiki/Category:Internet_mirror_services [21:47] Also fun: http://www.internet2.edu/news/pr/2013.04.24.first-100G-transcontinental-transmission-rande-link.html [21:48] TSwift: http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/ [21:48] cool, ty [21:49] also, while you're at it: http://www.newegg.com/Product/Product.aspx?Item=N82E16840995035 ;) [21:49] I've been meaning to write a gui leech tool with the new ia python stuff but someone will probably beat me to it [21:52] DFJustin: which new stuff? https://pypi.python.org/pypi/internetarchive (which has quite impressive stats btw) [21:56] that's the one [22:04] update of the eve project atm: http://pcgamingwiki.com/eve [22:21] ugh, efnet seriously doesn't even partially mask people's IP address after all these years? [22:22] never did, never will [22:29] ^ likely accurate [22:35] the things that never change are never good things [22:56] lol [23:08] I like freenodes system [23:08] :) [23:08] why are we not on freenode anyway? [23:09] I like EFNet [23:10] freenode is too structured for a band of rogue archivists :) [23:30] man this is going to take forever, anyone have ideas? the eve online project im working on, I've contacted the devs with no response so far [23:31] here is the code Im using for the worker right now: http://hastebin.com/tonamaxovu.php [23:34] what problem are you having? [23:35] its a image every 0.5 [23:35] the keyspace is 9 million [23:35] they close on the 28th, the server [23:35] 0.5s [23:36] like they're throttling your connection? [23:36] na, more like ccp being slow [23:37] when you say "worker" does that mean you have a pool of multiple of those things going at once? [23:37] its a IIS server with a backend to MSSQL (I think) [23:37] nope, just one ATM [23:37] didnt want to kill it, but I didnt expect for it to be this slow [23:37] I'd run about 100 of those at once and see if that improves things :) [23:38] illl give that a try, I hope they dont get mad at me [23:38] if they're closing down anyway... [23:38] they probably won't care/notice [23:39] what's "ccp" ? [23:40] oh n/m [23:41] don't know much about the game :) [23:42] its all good, CCP are reditors and I have already made a post [23:42] http://www.reddit.com/r/Eve/comments/1p5hrq/in_light_of_the_old_portrait_server_being_nuked/