#archiveteam-bs 2013-06-21,Fri

↑back Search

Time Nickname Message
00:53 🔗 omf_ The first episode of 'Ray Donovan' is free on youtube
01:53 🔗 omf_ best unicode tweet ever https://twitter.com/Wu_Tang_Finance/status/347793126234148864
02:27 🔗 dashcloud so, what does everyone use to keep a single program from accidentally eating up all the CPU time?
02:39 🔗 winr4r dashcloud: nice
02:40 🔗 dashcloud I'll look into that- thanks!
02:41 🔗 dashcloud ever used cpulimit? that seemed to be the preferred choice over nice
02:43 🔗 winr4r nope!
04:58 🔗 godane g4tv.com-video56930-flvhd: Internet Goes On Strike Against SOPA - AOTS Loops In Reddit's Ohanian: https://archive.org/details/g4tv.com-video56930-flvhd
04:59 🔗 godane just a random video from my g4 video grabs
05:08 🔗 omf_ http://www.technologyreview.com/news/516156/a-popular-ad-blocker-also-helps-the-ad-industry/
05:11 🔗 * omf_ pokes Smiley in the eyeball
05:21 🔗 * BlueMax pokes omf_ with an anvil
05:22 🔗 Coderjoe nice and ionice. generally, it is fine to let a program use all spare CPU time as long as higher-priority tasks can get in front of it properly
06:06 🔗 omf_ yes BlueMax
06:18 🔗 godane so i found this: http://web.gbtv.com/gen/multimedia/detail/7/0/1/19968701.xml
06:18 🔗 godane Glenn Beck learns what may be ahead in a worst-case-scenario roundtable discussion.
06:19 🔗 godane the best part is this is a hour and 56 mins long
06:36 🔗 godane of course its not that
06:38 🔗 godane it looks to be him explaining how he going to build the network gbtv now
08:10 🔗 Smiley GLaDOS: awaken!!!!
08:11 🔗 winr4r hi Smiley
08:11 🔗 Smiley hey winr4r
08:14 🔗 arrith1 g'morning. i really should be heading to bed but i'm slowly chipping away at this perfect python script
08:15 🔗 Smiley what does it do?
08:15 🔗 winr4r keeps arrith1 awake
08:16 🔗 winr4r meta, bitches
08:16 🔗 arrith1 haha
08:16 🔗 arrith1 Smiley: well eventually it should work on multiple sites, but right now it's just to crawl livejournal.com and get a big textfile of usernames
08:17 🔗 Smiley nice
08:17 🔗 Smiley you have seen my bash right?
08:17 🔗 Smiley https://github.com/djsmiley2k/smileys-random-tools/blob/master/get_xanga_users
08:17 🔗 arrith1 i haven't hmm
08:17 🔗 arrith1 mine is for the Google Reader archiving effort which just needs lists of usernames from a range of sites, listed out on http://archiveteam.org/index.php?title=Google_Reader
08:18 🔗 arrith1 Smiley: oh btw, your wikipage is very helpful with wget-warc
08:18 🔗 Smiley no worries.
08:18 🔗 arrith1 Smiley: oh actually i have seen that script. i forgot about it though.
08:18 🔗 Smiley arrith1: well it's my own way of crawling any numbered site, to grab all the usernames on each page...
08:18 🔗 winr4r oh, talking of which
08:19 🔗 Smiley I'm not a programmer at all, no idea if it's actually good :D
08:19 🔗 Smiley but it works \o/
08:19 🔗 winr4r i just realised i still have greader-directory-grab running
08:19 🔗 arrith1 Smiley: yeah looks good
08:19 🔗 arrith1 winr4r: nice
08:19 🔗 * winr4r lets it be
08:19 🔗 arrith1 yeah we can use all the help we can get running greader-grab and greader-directory-grab
08:19 🔗 winr4r i think he still needs moar people on the job
08:19 🔗 Smiley yah, need help crawling these usernames too D:
08:19 🔗 arrith1 yeah
08:19 🔗 arrith1 i set concurrent to 32 on greader-directory-grab >:D
08:19 🔗 arrith1 Smiley: xanga usernames?
08:20 🔗 Smiley so much to grab, so little time
08:20 🔗 Smiley yup
08:20 🔗 winr4r arrith1: what is it by default?
08:20 🔗 arrith1 winr4r: the instructions had it not specifying, so i think 1. instructions were updated to 3. i ran 8 for a while without any problems, and then 16, then 32
08:21 🔗 arrith1 Smiley: awk is awesome btw. also is the xanga thing using ArchiveTeam Warrior? or is it some other script? i can help out if there's a thing delegating to clients
08:21 🔗 Smiley arrith1: both
08:21 🔗 Smiley actual _Grab_ for xanga is in warrior
08:21 🔗 Smiley for the username grabbing, it's seperate for now
08:22 🔗 Smiley http://www.archiveteam.org/index.php?title=Xanga << how can i help has instructions for username grab if you want to run some
08:22 🔗 Smiley you can run plenty concurrently, it's pretty slow tho
08:23 🔗 Smiley tomorrow I might run some from work ;)
08:25 🔗 arrith1 Smiley: what START and END should i use?
08:25 🔗 Smiley http://pad.archivingyoursh.it/p/xanga-ranges << take your pick
08:25 🔗 Smiley feel free to & them and run multiple too
08:25 🔗 Smiley and redirect the output/remove it if unwanted
08:26 🔗 arrith1 ahhh nice. that's what i'm looking for. nice big list to claim
08:26 🔗 arrith1 i'll claim a few then let run over night
08:28 🔗 Smiley if you have the spare bandwidth, rmeove the sleep
08:29 🔗 arrith1 will do. i'm basically cpu limited, none of this stuff maxes out the bandwidth on this seedbox of mine so far
08:30 🔗 arrith1 Smiley: btw in line 18 of your script, you can optionally use "seq" instead of that eval deal
08:30 🔗 Smiley nice
08:30 🔗 Smiley mp ypi cam
08:30 🔗 Smiley nope
08:30 🔗 Smiley at least I don't think it'll let you
08:30 🔗 arrith1 should be like for i in $(seq 1 $max_pages)
08:31 🔗 arrith1 or wait
08:31 🔗 Smiley hmmm
08:31 🔗 Smiley feel free to check :)
08:31 🔗 Smiley might work
08:31 🔗 Smiley I just know {1..$x} doesn't expand
08:31 🔗 arrith1 yeah, {} doesn't work
08:32 🔗 Smiley {1..1000} does D:
08:32 🔗 winr4r `seq 1 $x`
08:32 🔗 arrith1 $ foo=3; for i in $(seq 1 $foo); do echo "$i"; done
08:32 🔗 arrith1 1
08:32 🔗 arrith1 2
08:32 🔗 arrith1 3
08:32 🔗 winr4r (won't work on bsd)
08:32 🔗 arrith1 winr4r: supposed to use $() over ``
08:32 🔗 winr4r arrith1: really?
08:32 🔗 winr4r i've always used backticks
08:32 🔗 arrith1 yeah, bsd/osx instead uses 'jot'
08:32 🔗 winr4r ah :)
08:33 🔗 Smiley isn't the $( ) because it handles spaces etc in returned values better?
08:33 🔗 arrith1 diff syntax though, jot vs seq is wacky
08:33 🔗 arrith1 what i heard about backticks vs $() is readability, iirc
08:33 🔗 arrith1 people in #bash on freenode are very pro $()
08:33 🔗 winr4r $() isn't really that much more intuitive or obvious than ` `
08:34 🔗 Smiley `'`'``
08:34 🔗 arrith1 i think generally parens are used more for grouping. i don't know where else backticks are used
08:34 🔗 arrith1 Smiley: seq should work, but it's linux specific i guess. the eval/echo stuff is more platform independent. dunno if there's a performance benefit for using seq
08:35 🔗 Smiley prob is, but for this script it hardly matters.
08:37 🔗 arrith1 yeah. i'd be curious what the bottlenecks are to make it go faster though
08:37 🔗 Smiley remove the sleep
08:37 🔗 Smiley and wget ..... &
08:37 🔗 arrith1 Smiley: how much time is left to get the xanga stuff?
08:37 🔗 Smiley then it'll FLY
08:37 🔗 Smiley arrith1: not sure.
08:38 🔗 Smiley a month maybe? Need to ask SketchCow
08:38 🔗 Smiley the actual grabbing of blogs is more important
08:39 🔗 Smiley from my testing, we already have like 95% of the usernames, but as I don't know how they were collected, I can't be sure what I'm testing against is a "full" set
08:39 🔗 Smiley so that percentage may drop in the future
08:41 🔗 arrith1 Smiley: alright. wait so remove the sleep, and remove some wget line?
08:50 🔗 arrith1 hm
08:51 🔗 arrith1 Smiley: i'll assume you mean to run with & to do multiple concurrently
08:53 🔗 Smiley yes remove sleep
08:53 🔗 winr4r the 15th of july is the last day of xanga as we know it
08:53 🔗 winr4r after that they either die, or go to a paid account model
08:53 🔗 Smiley but if you do the "wget -v --directory-prefix=_$y -a wget.log "http://www.xanga.com/groups/subdir.aspx?id=$y&uni-72-pg=$x" &;
08:54 🔗 Smiley that won't wait for each wget to finish before continuing the loop
08:54 🔗 Smiley be warned, it'll fire up thousands
08:54 🔗 Smiley so you might want to try with just ./get_xanga_users x x+1
08:54 🔗 winr4r Smiley: teaching people how to forkbomb themselves? :P
08:54 🔗 Smiley winr4r: it came with a warning
08:59 🔗 arrith1 eh
08:59 🔗 arrith1 Smiley: yeah i'd rather not do that much
09:00 🔗 arrith1 Got value for group 90016; Max pages =
09:00 🔗 arrith1 Grabbing page {1..}
09:00 🔗 arrith1 grabbing pages 1 -
09:00 🔗 Smiley errr
09:00 🔗 arrith1 that's the output i get btw, but seems to be working
09:00 🔗 Smiley I mean like 1 2
09:00 🔗 Smiley or 10 11
09:00 🔗 Smiley not actual x :P
09:01 🔗 arrith1 Smiley: which line is this on?
09:01 🔗 arrith1 oh
09:02 🔗 arrith1 add & after that line
09:03 🔗 arrith1 then run get_xanga_users with a really low number?
09:03 🔗 Smiley not low
09:03 🔗 Smiley the numbers are normally the range your doing
09:03 🔗 Smiley so like from 30000 to 40000
09:03 🔗 Smiley but try it with like 30001 30002
09:04 🔗 arrith1 erm, i think that'd spawn like 10,000
09:04 🔗 Smiley as it'll open as many connections as there is pages.
09:04 🔗 arrith1 yeah
09:04 🔗 Smiley well biggest one I've seen is 2000
09:04 🔗 Smiley grabbing pages 1 - 2144
09:06 🔗 Smiley there is other ways of doing it....
09:06 🔗 arrith1 well i'm doing 8, the ones i claimed. seems to be going about one per second or a little over.
09:06 🔗 Smiley sleeping for smaller amounts of time, passing wget a collection of a few urls per spawn, but it'll be awhile before I can get around to looking into that
09:06 🔗 Smiley Got a party to plan and run
09:06 🔗 Smiley aND I'm no coder.
09:07 🔗 arrith1 spawning a few wgets would be good i think
09:07 🔗 arrith1 i can help next month probably
09:07 🔗 Smiley you cvould do something like z=$(y~
09:07 🔗 Smiley you cvould do something like z=$(y)
09:07 🔗 arrith1 by my calculations the 8 i'm doing should finish in around 3 hours
09:08 🔗 Smiley wget z, wget z+1, wget z+2, wget z+3; end loop, y+4; repeat
09:08 🔗 Smiley So grabbing 4 per loop run
09:08 🔗 * Smiley realises he appears to be thinking like a coder
09:09 🔗 arrith1 yeah. there's also xargs and gnu parallel
09:09 🔗 arrith1 echo urls | xargs -P 4 wget
09:09 🔗 Smiley i'm not well versed in tehm yet.
09:09 🔗 Smiley I only rteally got the hang of awk yesterday :D
09:09 🔗 arrith1 xargs is pretty neat, i'm gonna use it with wget warc this week
09:09 🔗 arrith1 heh
09:09 🔗 arrith1 well i'm all for concurrency
09:12 🔗 arrith1 seems there's about 30 or so sets left, so max time it'll take is 30 items * 3 hours/item = 90 hours, or about 3.75 days. but with people running them at the same time that'll go way faster
09:13 🔗 arrith1 i'd say at most a day or two. assuming there's no ratelimiting that comes up
09:13 🔗 Smiley i've seen none so far at my current speeds of 1url per secondish.
09:14 🔗 Smiley those sets near the end will take longer tho, lots of 404s
09:14 🔗 arrith1 Smiley: i did remove that sleep, but it's not really going all that fast
09:14 🔗 arrith1 which is fine, there's time i think
09:15 🔗 arrith1 alright, gtg. bbl
09:15 🔗 Smiley o/
09:16 🔗 Smiley grabbing pages 1 - 14649
09:16 🔗 Smiley So much for 2000 being the highest ;D
09:46 🔗 Schbirid i think i once had a tr or sed line to make IA compatible filenames, ring a bell for anyone? http://archive.org/about/faqs.php#216
10:01 🔗 godane my dream is a alive: http://hardware.slashdot.org/story/13/06/21/0255241/new-technique-for-optical-storage-claims-1-petabyte-on-a-single-dvd/
10:17 🔗 godane also someone should grab this: http://www.guardian.co.uk/world/interactive/2013/jun/20/exhibit-b-nsa-procedures-document
14:36 🔗 DFJustin godane:
14:36 🔗 DFJustin wget http://s3.amazonaws.com/s3.documentcloud.org/documents/716633/pages/exhibit-a-p{1..9}-normal.gif
14:36 🔗 DFJustin wget http://s3.amazonaws.com/s3.documentcloud.org/documents/716634/pages/exhibit-b-p{1..9}-normal.gif
15:18 🔗 DFJustin or actually, replace normal with large
17:41 🔗 winr4r so like
17:41 🔗 winr4r with greader-directory-grab
17:41 🔗 winr4r is it grabbing the feeds themselves or just crawling the directory
17:42 🔗 ivan` it's just querying the directory
17:42 🔗 ivan` you can upload querylists to the OPML collector if you wish
17:43 🔗 winr4r oh gotcha
20:07 🔗 arrith1 Smiley: hmm seems my estimates were a bit off. in my grab they're all around 3000
20:16 🔗 arrith1 Smiley: i have approx 25k, so ~3.1k for each of the 8. so i guess i'm a little under a third done.
20:18 🔗 arrith1 11 hrs for 3.1k, means ~35.5 hours for 10k items
20:21 🔗 arrith1 Smiley: so should be done in ~24 hours
20:33 🔗 Smiley arrith1: k
20:33 🔗 Smiley we have a new script too that someone else has written
20:33 🔗 Smiley you should join #jenga
20:56 🔗 arrith1 Smiley: ah alright, just joined
21:06 🔗 Smiley hey
23:26 🔗 joepie91 https://keenot.es/read/cause-and-infect-why-people-get-hacked

irclogger-viewer