[00:15] any other very near future projects we know about but haven't been started yet? or have things calmed down for now? [02:53] so, how do you archive twitter accounts? if someone knows, Dan Kaminsky mentioned that https://twitter.com/conorpotpie died recently [03:05] don't know anything first hand, but there are some tools mentioned on the wiki: http://www.archiveteam.org/index.php?title=Twitter [03:06] out of 3 tools, one is dead/broken link and another requires your login/pass (obviously not possible here..) [03:10] and the 3rd one is also useless. URL params and classes used in page not the same anymore. [03:15] updated wiki, made it clear 3rd tool is not usable anymore [03:16] perhaps it would be good to add one there that actually works.. [03:31] deathy, dashcloud: archive.is seems to expand the whole page of a twitter account [03:33] nope [03:33] not really [03:34] tried on that account dashcloud mentioned. It was doing at least a few URL requests which I recognized from monitoring the twitter infinite scroll thing [03:34] but it maybe got 1-3 additional pages/scrolls [03:34] from more than 50 I think on that user at least [03:36] (used a lot of PgDn keys while looking at it.. ) [03:36] ah [03:38] from archive.is: "There is 5 minutes timeout, if page is not fully loaded in 5 minutes, the saving considered failed. It is not often, but it happens." [03:38] and maybe for extreme cases this could be an issue (if lots of twitter pics): "The stored page with all images must be smaller than 50Mb" [03:42] and double/triple-confirmed, from archive.is blog, issues with twitter: http://blog.archive.is/post/51400352393/it-seems-that-twitter-feeds-with-a-lot-of-tweets-500 [04:13] mm [07:02] Where's the hug. [07:05] * BlueMaxim hugs SketchCow [08:35] * Nemo_bis read "bug" [09:07] why are wikipedia pages saved so bad in the wayback machine? [09:07] http://web.archive.org/save/http://nl.wikipedia.org/wiki/Hoofdpagina [10:19] hi=) what about getting more yahoo blogs from google cache (and gigablast cache) (or a cdn avaiable for this if any)? [10:45] jonas_: join #shipwretched and update your y!b code for new grabs [11:31] looks like the archive.is saves are made like #### [11:31] would make it possible to save [11:31] 14.776.336 combinations [11:31] then run a program on then to only have the existing ones and discovering all the other urls [11:31] and then download [11:44] looks like they also have ##### (with 5) now... [11:45] the #### is full, all of them are used, so that would be good to archive [11:45] will try to start a grab on that... :) [12:09] generated all urls from aaaa-0000 [12:10] now starting the url discovery of the first batch: aaaa-a000 [12:10] or nah, gonna do aaaa-d000 [12:37] jonas_: Apparently google cache is really hard to archive from, they rate-limit so agressively that it's almost impossible to do on any scale [12:41] pretty much need a massive block of IP's and randonly scatter requests accross the block [12:56] so far I'm not having problems with concurrency 4 [12:57] there's an old piece of software in Perl that does Google Cache extraction pretty well afaik [13:06] I waited 2 minutes between requests to google cache and it worked fine [13:06] is that wretch or blogs or both, Nemo_bis ? [13:06] sounds encouraging [13:07] blogs [13:07] cool [13:07] but I see I uploaded only 14 items so far, dunno what's going on for real [13:07] are you talking about getting the yahoo things from google cache? [13:15] archive.is is blocked from the IA for some reason... [13:15] :( [15:46] anyone else blocked from dropbox for too much BW, even when you know it is not true? [16:47] Cowering: like blocked completely, or one file blocked? [17:32] etsi.org/deliver/ save almost complete [17:32] working on wikileaks website [18:21] don't know if anyone would want to keep this, but it's here in case: https://bui.pm/ded [18:21] it was a /b/ archive, dead now, images and database stuff is up for grabs [18:23] and with that, it's time for me to leave [18:23] later all o/ [18:40] being archived [19:51] hello [19:51] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [19:52] looks like the bot is not responding - secret word is yahoosucks [19:52] thank you [19:53] also I might as well ask this over here too [19:53] I want to help archive myopera [19:53] how can I help? [19:54] info page is here - http://archiveteam.org/index.php?title=My_Opera [19:54] yes I know [19:54] but what can I do [19:55] It doesn't seam set up for use with the warrior [19:55] looks like it's not a warrior project, would need to check with Mithrandir who is running that project [19:55] oh [19:56] BiggieJo1: wait, we have a bot responding with yahoosucks? [19:57] I see, hmm there seams to be no contact info on his wiki page [19:57] irony, a bot responding to a question for the anti-bot system... [19:57] just a public key [19:58] does anyone know if he is regularly on IRC? [19:58] joepie91: where ? [19:58] nico_32: see what BiggieJo1 said [19:58] arrgh, nick broken again [19:59] not broken, just in an alternative state of functioning [19:59] :) [20:00] or should I write in the discussion page for myopera or for him? [20:02] try to write on his discution page [20:02] it "should" make mediawiki show him/her/hir a message at next login [20:03] ok, thanks for the advice [20:13] Just found a webhost with insane amount of datasheets [20:13] DeVan: url ? [20:14] I dont know [20:15] ... [20:15] afraid the swarm will make him close it [20:16] send it privatly and i will make a very slow download [20:17] with --delay 20s [20:17] electronicsandbooks? [20:17] yeah [20:17] people have archived that before; it's very very slow [20:20] balrog: not that site [20:22] last IA crawling: 2011/2012 [21:37] would be good maybe to backup xbins (xbox-scene.com file downloads) [21:39] http://www.xbins.org/ and the actual files obtained via IRC and FTP [21:39] http://www.xbox-scene.com/articles/xbins.php [21:39] tutorial. [21:41] I downloaded a whole bunch about ten years ago but all the files probably have newer releases/versions and i never got the whole lot. [21:42] Also, I think there was a limit on how many you could get a day. or hour. Can't remember though. [21:42] arkhive: I will run an url discovery program tomorrow on those and see how big the sites are and how many files they contain [21:43] will then start a grab [21:43] okay. i think they are hosted off site. like not on xbins.org. can't remember though. I was like 13 lol [21:43] joepie91: oh are you grabbing that /b/ thing, I just emailed jason about it [21:43] but i'll get those CD's i put the xbins files on. [21:44] DFJustin: yes, one of my boxes is downloading it atm [21:46] \o/ [21:46] I still need to continue my saving of old apps/programs from dead/zombied mobile platforms. good article if i remember right lol. http://www.visionmobile.com/blog/2012/01/the-dead-platform-graveyard-lessons-learned-2/ [21:46] but to -bs for me. :) [21:56] for the record, I have several servers with 500G disk now [21:56] so anything up to that size, I can fetch [21:56] (ping me when necessary) [21:59] joepie91: !a https://bui.pm/ded [21:59] yipdw: it's too big for that [22:00] I should add that to ArchiveBot [22:00] "Sorry, this item is too big" [22:04] http://bofh.nikhef.nl/events/ this seems to be a mirror for... everything con related for tech things. how can i get a size, without just dl-ing all of it? [22:06] arkhive: I will d the xbox-scene website first, since those download are on-site [22:06] and then I'll take a look at the other one [22:06] :) [22:11] bsmith093: not, probably [22:11] unless you script a bit [22:13] joepie91: I'm gonna run a wget spider , then dump the log to a url extractor, then dump that into jdownloader, so i atleast know how big it is. [22:13] oh man :P [22:15] bsmith093: I will try to get the size and the number of urls for you tomorrow [22:16] ark i'm already running the spider thats faster, isnt it? [22:16] mentioning jdownloader to joepie91 is like talking to a TEA Party member about taxes [22:16] the cool thing about US-centric similes is that they're always worse than you intend [22:16] :P [22:17] joepie91: whats wrong with this workflow? seriously I'd love any suggestions :) [22:17] bsmith093: it just seems a bit... duct-tapey :) [22:17] aside from the jdownloader bit [22:19] hmm, we can see how accurate jdownloader is, please let me know the size of the site tomorrow [22:20] * m1das moves to #archiveteam-bs and opens the popcorn [22:23] bsmith093: I will tell you in a few minutes [22:24] I am using HTTrack and find . -name 'index.html' | xargs cat | grep 'alt="\[ \]">' | perl -p -i -e 's/ +/ /g' | python -c "exec 'import sys\nfor line in sys.stdin: print line.strip().split()[-1]'" | sed 's/G/*1024*1024*1024/g' | sed 's/M/*1024*1024/g' | sed 's/K/*1024/g' [22:25] buggy, ask me for fixed version later if you want it [22:31] regex ftw [22:40] 400GB so far but it's still grabbing indexes [22:40] find . -name 'index.html' | xargs cat | grep 'alt="\[...\]">' | grep -v 'alt="\[DIR\]">' | perl -p -i -e 's/ +/ /g' | python -c "exec 'import sys\nfor line in sys.stdin: print line.strip().split()[-1]'" | sed 's/G/*1024*1024*1024/g' | sed 's/M/*1024*1024/g' | sed 's/K/*1024/g' | python -c "exec 'import sys\nprint sum(int(eval(line.strip(), {\'__builtins__\': None})) for line in sys.stdin)'" [23:57] well ok then, ivan` you can grab it if you eant, holy crap thats big [23:57] *want