#archiveteam 2013-10-24,Thu

↑back Search

Time Nickname Message
00:05 🔗 odie5533_ Do warriors receive a list of urls to download, or do they hunt for urls themselves?
00:18 🔗 phillipsj prettry sure they get a list from the tacker. That way, everbody is trying different URLs
00:41 🔗 odie5533_ but then the tracker needed to already have crawled the site, right?
00:41 🔗 odie5533_ it seems like the site would be crawled twice then. how does the warrior help?
00:43 🔗 drfsite Would anyone happen to have an archived copy of the media files here?
00:43 🔗 drfsite https://web.archive.org/web/20040209025641/http://www.skycycleonline.com/media.html
01:15 🔗 xmc odie5533_: usually we do a quick surface crawl to get valid id numbers and url formats, then fill in the tracker with things we've seen and things we've extrapolated
03:32 🔗 Lord_Nigh did anyone archive the video of that dude knocking over the boulder? theres lots of dmca takedowns going around
03:38 🔗 drfsite what video?
03:38 🔗 odie5533_ this boy scout decided to knock over some million year old boulder to save children
03:38 🔗 JRWR I know of the one
03:40 🔗 odie5533_ Lord_Nigh: http://www.liveleak.com/view?i=727_1382054402
03:40 🔗 odie5533_ I'm surprised he didn't somehow manage to crush himself.
03:41 🔗 odie5533_ yay glenn!
03:43 🔗 JRWR Lord_Nigh: magnet:?xt=urn:btih:C49EFD4BE3FBFA7FEB8C4ABF18FAE5A5ADEAB61D&dn=jackass%20topples%20200-million-year%20rock%20formation.mp4.mp4&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=udp%3a%2f%2ftracker.publicbt.com%3a80%2fannounce&tr=udp%3a%2f%2ftracker.ccc.de%3a80%2fannounce
03:46 🔗 drfsite wow
03:48 🔗 DFJustin I archived it
04:07 🔗 DFJustin hmm who had that handy script to reupload youtube-dl output to ia
04:08 🔗 joepie91 sounds like something I'd write, but it isn't
04:09 🔗 JRWR I kinda wish I had a script where it was question and answer script to upload files to IA
04:11 🔗 BlueMax does there even need to be question and answer?
04:11 🔗 DFJustin found it http://code.google.com/p/emijrp/source/browse/trunk/scrapers/youtube2internetarchive.py
04:38 🔗 odie5533_ Why isn't that on github!?
04:40 🔗 odie5533_ Does emijrp ever come in here?
04:40 🔗 yipdw yeah
04:40 🔗 yipdw and probably just didn't decide to use github
04:41 🔗 odie5533_ How often does he come in here?
04:41 🔗 yipdw not sure
04:52 🔗 odie5533_ DFJustin: Did you use that script? And if so, to upload what?
05:01 🔗 godane so cause i'm nuts i found another tech podcast
05:02 🔗 godane called The Tech Report Podcast
05:02 🔗 godane good news is the rss feed looks like has all mp3
05:02 🔗 godane make pushing downloading and pushing it easier
05:04 🔗 JRWR A New WikiDump has been made for the following Projects: https://archive.org/details/wiki-ftlwikicom https://archive.org/details/wiki-letsplaywikicom https://archive.org/details/wiki-lptwikicom and the big stuff https://archive.org/details/wiki-pcgamingwikicom
05:30 🔗 godane i also just found a podcast called hacker pubic radio
05:42 🔗 JRWR godane and its not on ia
05:42 🔗 JRWR Sounds like a project!
05:45 🔗 DFJustin I haven't used it yet
05:45 🔗 DFJustin would need to adapt it to upload already-downloaded things rather than pulling fresh
05:46 🔗 odie5533_ Do you upload every podcast you find?
05:49 🔗 JRWR Why not?
05:54 🔗 godane i will work on tech report podcast for the moment
05:55 🔗 godane the hacker pubic radio is released in mp3, spx and ogg
05:55 🔗 godane i'm grabbing the mp3 version since archive.org will make a ogg of that
05:55 🔗 odie5533_ JRWR: Sounds like a lot of work for stuff that's usually pretty low quality... but if you want to, I wouldn't stop you
05:58 🔗 JRWR well this is odd
05:58 🔗 JRWR Why does the wiki teams batch downloader do POST on images
05:59 🔗 JRWR that breaks NGINX
05:59 🔗 odie5533_ What do you mean?
05:59 🔗 JRWR 2607:5300:60:ad1::1 - - [24/Oct/2013:01:52:53 -0400] [pcgamingwiki.com] "POST /images/2/2e/Zen_Puzzle_Garden_cover.png HTTP/1.0" 405 166 "-" "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0"
06:00 🔗 odie5533_ That's bad.
06:00 🔗 odie5533_ What are you using to dump the wiki?
06:00 🔗 JRWR https://code.google.com/p/wikiteam/source/browse/trunk/dumpgenerator.py
06:00 🔗 odie5533_ Who wrote it?
06:00 🔗 odie5533_ wow that's a long script.
06:01 🔗 JRWR look at the reversions
06:01 🔗 odie5533_ nemo and emijrp
06:02 🔗 odie5533_ line 671
06:02 🔗 JRWR line 671/1195
06:02 🔗 odie5533_ perhaps
06:02 🔗 odie5533_ JRWR: What command did you use?
06:04 🔗 JRWR launcher.py wiki.txt
06:04 🔗 godane whats funny is that episode 1364 of hacker pubic radio talks about vintage tech icon pay phone coin box
06:04 🔗 JRWR https://code.google.com/p/wikiteam/source/browse/trunk/batchdownload/launcher.py
06:05 🔗 godane i will go after the website for stuff thats not .org, spx, and mp3 just so we have the other stuff
06:07 🔗 odie5533 so the launcher.py calls the dumpgenerator.py?
06:07 🔗 odie5533 crazy.
06:07 🔗 JRWR Yep
06:07 🔗 JRWR its meant for a big ol list of wikis
06:08 🔗 odie5533 JRWR: well
06:09 🔗 odie5533 for a quick fix, just delete the ", data=...") stuff
06:09 🔗 odie5533 so that the line reads: urllib.urlretrieve(url=url, filename='%s/%s' % (imagepath, filename2))
06:09 🔗 odie5533 might break other stuff though! :D
06:10 🔗 odie5533 but that code is hacking since he's overriding urllib internals. bad bad bad! But I've done similar stuff before heh
06:11 🔗 Lord_Nigh http://bap.ece.cmu.edu/download/bap-0.8/ was released on oct 17 and taken down on oct 22; unsure why; it was also stored at a git repo at https://github.com/cmubap/bap which was taken down simultaneously; i'm in communications with someone who has a checkout of that git
06:11 🔗 Lord_Nigh there is some lawyer related crap why it was taken down
06:11 🔗 JRWR oh my
06:11 🔗 JRWR sounds like a bittorrent mirror is in order
06:11 🔗 Lord_Nigh exactly
06:12 🔗 JRWR ill be happy to seed it for some time :)
06:12 🔗 odie5533 don't these lawyers know that code wants to be free? :)
06:13 🔗 Lord_Nigh especially since the 0.7 code is still up at http://bap.ece.cmu.edu/download/bap-0.7/ though it looks like it may have been modified when everything else was taken down
06:13 🔗 odie5533 listing: http://webcache.googleusercontent.com/search?q=cache:http://bap.ece.cmu.edu/download/bap-0.8/&ie=utf-8&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&gws_rd=cr&ei=TLpoUuKaC8v5kQfvpoBY
06:14 🔗 JRWR damn, who thought it was a good idea to do POSTs to get data
06:14 🔗 Lord_Nigh the code WAS released as gplv2... so once i get a copy i'm pretty sure i'm allowed to further distribute it...
06:14 🔗 odie5533 JRWR: looks like it was done to fix the GET not working oddly enough
06:14 🔗 odie5533 Lord_Nigh: sort of.
06:14 🔗 JRWR lol
06:14 🔗 odie5533 Not if it's illegal code
06:14 🔗 JRWR ill watch the logs
06:15 🔗 Lord_Nigh afaik its not illegal
06:15 🔗 odie5533 If it's illegal to begin with, and they had no right to release it, then you have no right either
06:15 🔗 Lord_Nigh true
06:15 🔗 JRWR dat user agent
06:16 🔗 odie5533 JRWR: yeah, I'm not sure why they didn't just use URLopenerUserAgent().retriever(...)
06:16 🔗 odie5533 *retrieve
06:16 🔗 JRWR sounds like a rewrite is in order
06:17 🔗 odie5533 Perhaps just a fix. If it were rewritten, I'd say change from urllib to Twisted.
06:17 🔗 JRWR also, I noticed its border line a DoS
06:18 🔗 JRWR it spams the fuck out of the webserver
06:18 🔗 odie5533 that's not good.
06:18 🔗 odie5533 also, it looks like the _urlopener, while looking a bit hackish, is actually recommended by the API docs.
06:18 🔗 JRWR same network, im getting 40req/s
06:19 🔗 odie5533 with Twisted I always use delays and set a max number of requests
06:19 🔗 JRWR I dont mind, but adding random requests and maybe some better user agents would work
06:19 🔗 odie5533 JRWR: What is it doing, exactly? You give it a list of image urls?
06:19 🔗 JRWR no
06:20 🔗 JRWR it dumps the ENTIRE contents of a wiki
06:20 🔗 JRWR XML + Images
06:21 🔗 odie5533 First does XML right?
06:21 🔗 JRWR Yes
06:21 🔗 JRWR uses the API to pull it all
06:22 🔗 odie5533 Do you do a lot of wiki archiving?
06:22 🔗 JRWR I own a Very LARGE wiki farm
06:22 🔗 odie5533 What does that mean?
06:22 🔗 JRWR and I hate messing with the database, my caches love me :)
06:23 🔗 JRWR and well, I broke my own dump scripts that they include with mediawiki
06:23 🔗 JRWR even now im dumping a 2G XML file
06:25 🔗 JRWR Yay! its working
06:25 🔗 JRWR all 2200 images
06:30 🔗 JRWR oh god FTLWiki is Huge
06:38 🔗 JRWR ah, thats more like it https://archive.org/details/wiki-pcgamingwikicom
06:39 🔗 Nemo_bis odie5533: no, only emijrp; I just do some small changes
06:40 🔗 Nemo_bis JRWR: what do you mean that it breaks nginx?
06:40 🔗 JRWR it 405s on "true" files
06:40 🔗 JRWR if you try and do a POST on them
06:40 🔗 Nemo_bis perhaps we should try both then
06:41 🔗 JRWR I would try GETs first, then POSTs
06:41 🔗 Nemo_bis apparently POST was used because in some cases GET requests didn't work, according to the comment
06:41 🔗 Nemo_bis yeah, sure; wanna submit a patch? :)
06:41 🔗 JRWR uhhh..... me + python = bwhahah
06:54 🔗 godane this sucks
06:54 🔗 godane looks like there is already a collection
06:55 🔗 godane but it was done badly and out of date
06:55 🔗 godane this is about hacker pubic radio
06:57 🔗 godane i may have redo the first two items i have uploaded
06:57 🔗 godane add a _mp3 to item names just so they will upload
06:58 🔗 godane some of the way this collection was done is sort of half ass
06:58 🔗 godane like this item: https://archive.org/details/hpr1282
06:58 🔗 odie5533 Nemo_bis: just do GET. Leave POST for if the GET didn't work someone can fix to that
06:58 🔗 godane it should only be hpr1282 in it
06:58 🔗 odie5533 I don't think POST should ever be the default behavior.
06:58 🔗 godane but hpr1284 is also in it
07:00 🔗 odie5533 Nemo_bis: have you read through all the code of dumpgenerator.py? Or have you only made tiny fixes to it>?
07:01 🔗 Nemo_bis odie5533: yes, I guess I read it all at some point in time
07:02 🔗 odie5533 aren't there other scripts to generate backups of medaiwiki sites?
07:03 🔗 JRWR there are, but this set is very nice as it does all the heavy lifting for you when it works
07:04 🔗 JRWR I just submitted three bugs
07:08 🔗 odie5533 JRWR: It would probably help your issues if you gave the specific commands you used to reproduce the problem
07:08 🔗 odie5533 "1. Do a normal API based Full XML+Image Dump using SVN Trunk "
07:11 🔗 JRWR odie5533 added a comment
07:12 🔗 odie5533 looks better
07:12 🔗 odie5533 Is dumping wikis popular?
07:12 🔗 odie5533 Or is dumping other stuff more popular?
07:12 🔗 JRWR somewhat
07:13 🔗 JRWR its more common to find a wiki
07:13 🔗 JRWR since mediawikis are easy to setup and allow for content to be stored
07:13 🔗 JRWR I run PCGamingWiki.com (Their servers) and well 47k a day in visits is nice
07:28 🔗 odie5533 JRWR: What do you use to view warc files?
07:36 🔗 odie5533 http://www.magicthegatheringtactics.com/ is already down. I assume no one got a grab of it?
07:40 🔗 godane odie5533: its not down for me
07:41 🔗 odie5533 oh. won't load for me. someone should probably grab it since the game is shutting down
09:39 🔗 odie5533 Does WARC support HTTP1.1?
09:41 🔗 odie5533 I guess it does by splitting up the request/responses.
09:41 🔗 odie5533 HTTP1.1 makes things more complicated...
12:05 🔗 yipdw odie5533: so long as there's one or more responses to a given request, WARC/1.0 should be able to handle any such version of HTTP
12:06 🔗 yipdw correction, zero or more responses per one request
12:06 🔗 yipdw WARC will correctly capture a "no response received" situation
16:14 🔗 DFJustin paging sketchcow / undersco2 - rsync to fos failing for lack of space on device
18:10 🔗 undersco2 please bang on this and make sure you don't see any breakage or errors
18:10 🔗 undersco2 http://archive.org/details/historicalsoftware
18:37 🔗 phillipsj in-browser emulators? lynx won't touch it :P
18:39 🔗 ats undersco2: it's a bit weird pointing at the Spectrum version of Elite -- isn't the BBC version (the original) in the archive?
18:39 🔗 ats (Ian Bell actually recommends the NES version as the best 8-bit one...)
18:40 🔗 undersco2 unsure, would be a SketchCow question
18:40 🔗 undersco2 he picked the things
18:41 🔗 DFJustin it's kind of pot luck currently as to what computer systems are working
18:42 🔗 DFJustin bbc is in mess ought to work but there may be some silly issue with the compile
18:44 🔗 * ats launches his Z80-equipped Cobra MkIII and goes for a spin
18:47 🔗 touya new elite coming 2014, can't wait
18:51 🔗 ats Spectrum, Apple ][ and Osborne I all seem to work OK for me, and the text looks good
18:53 🔗 * ats idly ponders a "focus on British games" page along similar lines to point his students at...
19:42 🔗 SketchCow Any weirdness, let me know
19:44 🔗 SketchCow https://docs.google.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDgtQmxhQS1ibEJua1JRYlJScWt2dWc&usp=sharing
19:57 🔗 SketchCow So look.
19:57 🔗 SketchCow I shifted data off the filling partition
20:02 🔗 yipdw SketchCow: my coworkers saw that software archive, they love it
20:07 🔗 SketchCow Great
20:46 🔗 JRWR I might have a new peoject to do
20:46 🔗 JRWR http://community.eveonline.com/news/news-channels/eve-online-news/old-portrait-services-temporarily-re-enabled/
20:47 🔗 JRWR eve has re-enabled their old portrait server, Im already running a script right now that is brute forcing it, since the id for the avatar can be 1 ro 9000000
20:47 🔗 JRWR the old docs are here for it http://oldportraits.eveonline.com/
20:47 🔗 JRWR WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
20:48 🔗 SketchCow yahoosucks, good sir
20:48 🔗 SketchCow I have the greatest question ever.
20:48 🔗 SketchCow https://archive.org/details/VisiCalc_1979_SoftwareArts
20:48 🔗 SketchCow I can't get it to do a second row of data
20:48 🔗 SketchCow Any ideas?
20:53 🔗 DFJustin there's probably an easier way but you can type >A2
20:53 🔗 DFJustin source https://archive.org/stream/atariusersguide00fyls#page/16/mode/2up
20:56 🔗 deathy Nemo_bis: on hp ftp.. did a compression test on "hpdesignjet.zip" to see what's possible.. nothing much came out of it. "Compression Ratio: 1.010.", couple of hundred meg savings. Not useful at all to upload it I guess..
21:04 🔗 mistym SketchCow: Ha, I was *just* wondering the same thing
21:05 🔗 mistym Oh huh, I entered something that made left/right do vertical scrolling instead
21:12 🔗 Nemo_bis deathy: with what settings?
21:13 🔗 Nemo_bis unless you have over 20 GB RAM, you'd need -U for that one :)
21:18 🔗 deathy Nemo_bis: ran with "-lU" since that's what you mentioned yesterday. Just got a server with 48 GB of ram today :)
21:20 🔗 JRWR I wonder if I should submit this project to the warriors
21:20 🔗 JRWR this is taking forever, Ive got 9 million IDs to find
21:25 🔗 Nemo_bis deathy: wow, so you don't even need to use -U :D how long did it take? maybe you can remove even -l
21:26 🔗 Nemo_bis I suspect the piping done by lrztar has worse effects than lrzip directly on a tar on disk
21:28 🔗 deathy Nemo_bis: 21 minutes for the lrzip. I actually unarchived, created a tar and then ran lrzip. Well..sleep now. Let me know if you want me to try it on any other big archives
21:29 🔗 Jacek JRWR, I'd imagine their servers can handle a nice number of connections. Got threading?
21:33 🔗 Nemo_bis deathy: impressive :) a test without -lU would be fun
21:33 🔗 Nemo_bis maybe that's the wrong testcase, it's possible there isn't as much duplication as in others
21:37 🔗 TSwift what if archive.org goes down
21:37 🔗 TSwift do we archive archive.org
21:46 🔗 Nemo_bis TSwift: yes, for instance I ask people to mirror my https://archive.org/details/wikimediacommons collection; I'd also like to know more about the Alexandria mirror
21:47 🔗 Nemo_bis I wonder if some researcher is downloading huge datasets; usually the link to Internet2 is much less busy, iirc. https://monitor.archive.org/weathermap/weathermap.html Maybe someone I asked to mirror Commons files :) https://en.wikipedia.org/wiki/Category:Internet_mirror_services
21:47 🔗 Nemo_bis Also fun: http://www.internet2.edu/news/pr/2013.04.24.first-100G-transcontinental-transmission-rande-link.html
21:48 🔗 DFJustin TSwift: http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
21:48 🔗 TSwift cool, ty
21:49 🔗 Nemo_bis also, while you're at it: http://www.newegg.com/Product/Product.aspx?Item=N82E16840995035 ;)
21:49 🔗 DFJustin I've been meaning to write a gui leech tool with the new ia python stuff but someone will probably beat me to it
21:52 🔗 Nemo_bis DFJustin: which new stuff? https://pypi.python.org/pypi/internetarchive (which has quite impressive stats btw)
21:56 🔗 DFJustin that's the one
22:04 🔗 JRWR update of the eve project atm: http://pcgamingwiki.com/eve
22:21 🔗 dzne ugh, efnet seriously doesn't even partially mask people's IP address after all these years?
22:22 🔗 touya never did, never will
22:29 🔗 joepie91 ^ likely accurate
22:35 🔗 dzne the things that never change are never good things
22:56 🔗 JRWR lol
23:08 🔗 JRWR I like freenodes system
23:08 🔗 JRWR :)
23:08 🔗 JRWR why are we not on freenode anyway?
23:09 🔗 SketchCow I like EFNet
23:10 🔗 balrog freenode is too structured for a band of rogue archivists :)
23:30 🔗 JRWR man this is going to take forever, anyone have ideas? the eve online project im working on, I've contacted the devs with no response so far
23:31 🔗 JRWR here is the code Im using for the worker right now: http://hastebin.com/tonamaxovu.php
23:34 🔗 dzne what problem are you having?
23:35 🔗 JRWR its a image every 0.5
23:35 🔗 JRWR the keyspace is 9 million
23:35 🔗 JRWR they close on the 28th, the server
23:35 🔗 JRWR 0.5s
23:36 🔗 dzne like they're throttling your connection?
23:36 🔗 JRWR na, more like ccp being slow
23:37 🔗 dzne when you say "worker" does that mean you have a pool of multiple of those things going at once?
23:37 🔗 JRWR its a IIS server with a backend to MSSQL (I think)
23:37 🔗 JRWR nope, just one ATM
23:37 🔗 JRWR didnt want to kill it, but I didnt expect for it to be this slow
23:37 🔗 dzne I'd run about 100 of those at once and see if that improves things :)
23:38 🔗 JRWR illl give that a try, I hope they dont get mad at me
23:38 🔗 dzne if they're closing down anyway...
23:38 🔗 dzne they probably won't care/notice
23:39 🔗 dzne what's "ccp" ?
23:40 🔗 dzne oh n/m
23:41 🔗 dzne don't know much about the game :)
23:42 🔗 JRWR its all good, CCP are reditors and I have already made a post
23:42 🔗 JRWR http://www.reddit.com/r/Eve/comments/1p5hrq/in_light_of_the_old_portrait_server_being_nuked/

irclogger-viewer