#archiveteam 2013-12-29,Sun

↑back Search

Time Nickname Message
00:15 🔗 deathy any other very near future projects we know about but haven't been started yet? or have things calmed down for now?
02:53 🔗 dashcloud so, how do you archive twitter accounts? if someone knows, Dan Kaminsky mentioned that https://twitter.com/conorpotpie died recently
03:05 🔗 deathy don't know anything first hand, but there are some tools mentioned on the wiki: http://www.archiveteam.org/index.php?title=Twitter
03:06 🔗 deathy out of 3 tools, one is dead/broken link and another requires your login/pass (obviously not possible here..)
03:10 🔗 deathy and the 3rd one is also useless. URL params and classes used in page not the same anymore.
03:15 🔗 deathy updated wiki, made it clear 3rd tool is not usable anymore
03:16 🔗 deathy perhaps it would be good to add one there that actually works..
03:31 🔗 xmc deathy, dashcloud: archive.is seems to expand the whole page of a twitter account
03:33 🔗 deathy nope
03:33 🔗 deathy not really
03:34 🔗 deathy tried on that account dashcloud mentioned. It was doing at least a few URL requests which I recognized from monitoring the twitter infinite scroll thing
03:34 🔗 deathy but it maybe got 1-3 additional pages/scrolls
03:34 🔗 deathy from more than 50 I think on that user at least
03:36 🔗 deathy (used a lot of PgDn keys while looking at it.. )
03:36 🔗 xmc ah
03:38 🔗 deathy from archive.is: "There is 5 minutes timeout, if page is not fully loaded in 5 minutes, the saving considered failed. It is not often, but it happens."
03:38 🔗 deathy and maybe for extreme cases this could be an issue (if lots of twitter pics): "The stored page with all images must be smaller than 50Mb"
03:42 🔗 deathy and double/triple-confirmed, from archive.is blog, issues with twitter: http://blog.archive.is/post/51400352393/it-seems-that-twitter-feeds-with-a-lot-of-tweets-500
04:13 🔗 xmc mm
07:02 🔗 SketchCow Where's the hug.
07:05 🔗 * BlueMaxim hugs SketchCow
08:35 🔗 * Nemo_bis read "bug"
09:07 🔗 arkiver why are wikipedia pages saved so bad in the wayback machine?
09:07 🔗 arkiver http://web.archive.org/save/http://nl.wikipedia.org/wiki/Hoofdpagina
10:19 🔗 jonas_ hi=) what about getting more yahoo blogs from google cache (and gigablast cache) (or a cdn avaiable for this if any)?
10:45 🔗 m1das jonas_: join #shipwretched and update your y!b code for new grabs
11:31 🔗 arkiver looks like the archive.is saves are made like ####
11:31 🔗 arkiver would make it possible to save
11:31 🔗 arkiver 14.776.336 combinations
11:31 🔗 arkiver then run a program on then to only have the existing ones and discovering all the other urls
11:31 🔗 arkiver and then download
11:44 🔗 arkiver looks like they also have ##### (with 5) now...
11:45 🔗 arkiver the #### is full, all of them are used, so that would be good to archive
11:45 🔗 arkiver will try to start a grab on that... :)
12:09 🔗 arkiver generated all urls from aaaa-0000
12:10 🔗 arkiver now starting the url discovery of the first batch: aaaa-a000
12:10 🔗 arkiver or nah, gonna do aaaa-d000
12:37 🔗 antomatic jonas_: Apparently google cache is really hard to archive from, they rate-limit so agressively that it's almost impossible to do on any scale
12:41 🔗 BiggieJo1 pretty much need a massive block of IP's and randonly scatter requests accross the block
12:56 🔗 Nemo_bis so far I'm not having problems with concurrency 4
12:57 🔗 joepie91 there's an old piece of software in Perl that does Google Cache extraction pretty well afaik
13:06 🔗 ivan` I waited 2 minutes between requests to google cache and it worked fine
13:06 🔗 antomatic is that wretch or blogs or both, Nemo_bis ?
13:06 🔗 antomatic sounds encouraging
13:07 🔗 Nemo_bis blogs
13:07 🔗 antomatic cool
13:07 🔗 Nemo_bis but I see I uploaded only 14 items so far, dunno what's going on for real
13:07 🔗 arkiver are you talking about getting the yahoo things from google cache?
13:15 🔗 arkiver archive.is is blocked from the IA for some reason...
13:15 🔗 arkiver :(
15:46 🔗 Cowering anyone else blocked from dropbox for too much BW, even when you know it is not true?
16:47 🔗 balrog Cowering: like blocked completely, or one file blocked?
17:32 🔗 arkiver etsi.org/deliver/ save almost complete
17:32 🔗 arkiver working on wikileaks website
18:21 🔗 chavezery don't know if anyone would want to keep this, but it's here in case: https://bui.pm/ded
18:21 🔗 chavezery it was a /b/ archive, dead now, images and database stuff is up for grabs
18:23 🔗 chavezery and with that, it's time for me to leave
18:23 🔗 chavezery later all o/
18:40 🔗 joepie91 being archived
19:51 🔗 alexvoda hello
19:51 🔗 alexvoda WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
19:52 🔗 BiggieJo1 looks like the bot is not responding - secret word is yahoosucks
19:52 🔗 alexvoda thank you
19:53 🔗 alexvoda also I might as well ask this over here too
19:53 🔗 alexvoda I want to help archive myopera
19:53 🔗 alexvoda how can I help?
19:54 🔗 BiggieJo1 info page is here - http://archiveteam.org/index.php?title=My_Opera
19:54 🔗 alexvoda yes I know
19:54 🔗 alexvoda but what can I do
19:55 🔗 alexvoda It doesn't seam set up for use with the warrior
19:55 🔗 BiggieJo1 looks like it's not a warrior project, would need to check with Mithrandir who is running that project
19:55 🔗 alexvoda oh
19:56 🔗 joepie91 BiggieJo1: wait, we have a bot responding with yahoosucks?
19:57 🔗 alexvoda I see, hmm there seams to be no contact info on his wiki page
19:57 🔗 joepie91 irony, a bot responding to a question for the anti-bot system...
19:57 🔗 alexvoda just a public key
19:58 🔗 alexvoda does anyone know if he is regularly on IRC?
19:58 🔗 nico_32 joepie91: where ?
19:58 🔗 joepie91 nico_32: see what BiggieJo1 said
19:58 🔗 BiggieJo1 arrgh, nick broken again
19:59 🔗 joepie91 not broken, just in an alternative state of functioning
19:59 🔗 joepie91 :)
20:00 🔗 alexvoda or should I write in the discussion page for myopera or for him?
20:02 🔗 nico_32 try to write on his discution page
20:02 🔗 nico_32 it "should" make mediawiki show him/her/hir a message at next login
20:03 🔗 alexvoda ok, thanks for the advice
20:13 🔗 DeVan Just found a webhost with insane amount of datasheets
20:13 🔗 nico_32 DeVan: url ?
20:14 🔗 DeVan I dont know
20:15 🔗 nico_32 ...
20:15 🔗 DeVan afraid the swarm will make him close it
20:16 🔗 nico_32 send it privatly and i will make a very slow download
20:17 🔗 nico_32 with --delay 20s
20:17 🔗 balrog electronicsandbooks?
20:17 🔗 DeVan yeah
20:17 🔗 balrog people have archived that before; it's very very slow
20:20 🔗 DeVan balrog: not that site
20:22 🔗 nico_32 last IA crawling: 2011/2012
21:37 🔗 arkhive would be good maybe to backup xbins (xbox-scene.com file downloads)
21:39 🔗 arkhive http://www.xbins.org/ and the actual files obtained via IRC and FTP
21:39 🔗 arkhive http://www.xbox-scene.com/articles/xbins.php
21:39 🔗 arkhive tutorial.
21:41 🔗 arkhive I downloaded a whole bunch about ten years ago but all the files probably have newer releases/versions and i never got the whole lot.
21:42 🔗 arkhive Also, I think there was a limit on how many you could get a day. or hour. Can't remember though.
21:42 🔗 arkiver arkhive: I will run an url discovery program tomorrow on those and see how big the sites are and how many files they contain
21:43 🔗 arkiver will then start a grab
21:43 🔗 arkhive okay. i think they are hosted off site. like not on xbins.org. can't remember though. I was like 13 lol
21:43 🔗 DFJustin joepie91: oh are you grabbing that /b/ thing, I just emailed jason about it
21:43 🔗 arkhive but i'll get those CD's i put the xbins files on.
21:44 🔗 joepie91 DFJustin: yes, one of my boxes is downloading it atm
21:46 🔗 DFJustin \o/
21:46 🔗 arkhive I still need to continue my saving of old apps/programs from dead/zombied mobile platforms. good article if i remember right lol. http://www.visionmobile.com/blog/2012/01/the-dead-platform-graveyard-lessons-learned-2/
21:46 🔗 arkhive but to -bs for me. :)
21:56 🔗 joepie91 for the record, I have several servers with 500G disk now
21:56 🔗 joepie91 so anything up to that size, I can fetch
21:56 🔗 joepie91 (ping me when necessary)
21:59 🔗 yipdw joepie91: !a https://bui.pm/ded
21:59 🔗 joepie91 yipdw: it's too big for that
22:00 🔗 yipdw I should add that to ArchiveBot
22:00 🔗 yipdw "Sorry, this item is too big"
22:04 🔗 bsmith093 http://bofh.nikhef.nl/events/ this seems to be a mirror for... everything con related for tech things. how can i get a size, without just dl-ing all of it?
22:06 🔗 arkiver arkhive: I will d the xbox-scene website first, since those download are on-site
22:06 🔗 arkiver and then I'll take a look at the other one
22:06 🔗 arkiver :)
22:11 🔗 joepie91 bsmith093: not, probably
22:11 🔗 joepie91 unless you script a bit
22:13 🔗 bsmith093 joepie91: I'm gonna run a wget spider , then dump the log to a url extractor, then dump that into jdownloader, so i atleast know how big it is.
22:13 🔗 joepie91 oh man :P
22:15 🔗 arkiver bsmith093: I will try to get the size and the number of urls for you tomorrow
22:16 🔗 bsmith093 ark i'm already running the spider thats faster, isnt it?
22:16 🔗 yipdw mentioning jdownloader to joepie91 is like talking to a TEA Party member about taxes
22:16 🔗 yipdw the cool thing about US-centric similes is that they're always worse than you intend
22:16 🔗 yipdw :P
22:17 🔗 bsmith093 joepie91: whats wrong with this workflow? seriously I'd love any suggestions :)
22:17 🔗 joepie91 bsmith093: it just seems a bit... duct-tapey :)
22:17 🔗 joepie91 aside from the jdownloader bit
22:19 🔗 arkiver hmm, we can see how accurate jdownloader is, please let me know the size of the site tomorrow
22:20 🔗 * m1das moves to #archiveteam-bs and opens the popcorn
22:23 🔗 ivan` bsmith093: I will tell you in a few minutes
22:24 🔗 ivan` I am using HTTrack and find . -name 'index.html' | xargs cat | grep 'alt="\[ \]">' | perl -p -i -e 's/ +/ /g' | python -c "exec 'import sys\nfor line in sys.stdin: print line.strip().split()[-1]'" | sed 's/G/*1024*1024*1024/g' | sed 's/M/*1024*1024/g' | sed 's/K/*1024/g'
22:25 🔗 ivan` buggy, ask me for fixed version later if you want it
22:31 🔗 Nemo_bis regex ftw
22:40 🔗 ivan` 400GB so far but it's still grabbing indexes
22:40 🔗 ivan` find . -name 'index.html' | xargs cat | grep 'alt="\[...\]">' | grep -v 'alt="\[DIR\]">' | perl -p -i -e 's/ +/ /g' | python -c "exec 'import sys\nfor line in sys.stdin: print line.strip().split()[-1]'" | sed 's/G/*1024*1024*1024/g' | sed 's/M/*1024*1024/g' | sed 's/K/*1024/g' | python -c "exec 'import sys\nprint sum(int(eval(line.strip(), {\'__builtins__\': None})) for line in sys.stdin)'"
23:57 🔗 bsmith093 well ok then, ivan` you can grab it if you eant, holy crap thats big
23:57 🔗 bsmith093 *want

irclogger-viewer