[00:53] The first episode of 'Ray Donovan' is free on youtube [01:53] best unicode tweet ever https://twitter.com/Wu_Tang_Finance/status/347793126234148864 [02:27] so, what does everyone use to keep a single program from accidentally eating up all the CPU time? [02:39] dashcloud: nice [02:40] I'll look into that- thanks! [02:41] ever used cpulimit? that seemed to be the preferred choice over nice [02:43] nope! [04:58] g4tv.com-video56930-flvhd: Internet Goes On Strike Against SOPA - AOTS Loops In Reddit's Ohanian: https://archive.org/details/g4tv.com-video56930-flvhd [04:59] just a random video from my g4 video grabs [05:08] http://www.technologyreview.com/news/516156/a-popular-ad-blocker-also-helps-the-ad-industry/ [05:11] * omf_ pokes Smiley in the eyeball [05:21] * BlueMax pokes omf_ with an anvil [05:22] nice and ionice. generally, it is fine to let a program use all spare CPU time as long as higher-priority tasks can get in front of it properly [06:06] yes BlueMax [06:18] so i found this: http://web.gbtv.com/gen/multimedia/detail/7/0/1/19968701.xml [06:18] Glenn Beck learns what may be ahead in a worst-case-scenario roundtable discussion. [06:19] the best part is this is a hour and 56 mins long [06:36] of course its not that [06:38] it looks to be him explaining how he going to build the network gbtv now [08:10] GLaDOS: awaken!!!! [08:11] hi Smiley [08:11] hey winr4r [08:14] g'morning. i really should be heading to bed but i'm slowly chipping away at this perfect python script [08:15] what does it do? [08:15] keeps arrith1 awake [08:16] meta, bitches [08:16] haha [08:16] Smiley: well eventually it should work on multiple sites, but right now it's just to crawl livejournal.com and get a big textfile of usernames [08:17] nice [08:17] you have seen my bash right? [08:17] https://github.com/djsmiley2k/smileys-random-tools/blob/master/get_xanga_users [08:17] i haven't hmm [08:17] mine is for the Google Reader archiving effort which just needs lists of usernames from a range of sites, listed out on http://archiveteam.org/index.php?title=Google_Reader [08:18] Smiley: oh btw, your wikipage is very helpful with wget-warc [08:18] no worries. [08:18] Smiley: oh actually i have seen that script. i forgot about it though. [08:18] arrith1: well it's my own way of crawling any numbered site, to grab all the usernames on each page... [08:18] oh, talking of which [08:19] I'm not a programmer at all, no idea if it's actually good :D [08:19] but it works \o/ [08:19] i just realised i still have greader-directory-grab running [08:19] Smiley: yeah looks good [08:19] winr4r: nice [08:19] * winr4r lets it be [08:19] yeah we can use all the help we can get running greader-grab and greader-directory-grab [08:19] i think he still needs moar people on the job [08:19] yah, need help crawling these usernames too D: [08:19] yeah [08:19] i set concurrent to 32 on greader-directory-grab >:D [08:19] Smiley: xanga usernames? [08:20] so much to grab, so little time [08:20] yup [08:20] arrith1: what is it by default? [08:20] winr4r: the instructions had it not specifying, so i think 1. instructions were updated to 3. i ran 8 for a while without any problems, and then 16, then 32 [08:21] Smiley: awk is awesome btw. also is the xanga thing using ArchiveTeam Warrior? or is it some other script? i can help out if there's a thing delegating to clients [08:21] arrith1: both [08:21] actual _Grab_ for xanga is in warrior [08:21] for the username grabbing, it's seperate for now [08:22] http://www.archiveteam.org/index.php?title=Xanga << how can i help has instructions for username grab if you want to run some [08:22] you can run plenty concurrently, it's pretty slow tho [08:23] tomorrow I might run some from work ;) [08:25] Smiley: what START and END should i use? [08:25] http://pad.archivingyoursh.it/p/xanga-ranges << take your pick [08:25] feel free to & them and run multiple too [08:25] and redirect the output/remove it if unwanted [08:26] ahhh nice. that's what i'm looking for. nice big list to claim [08:26] i'll claim a few then let run over night [08:28] if you have the spare bandwidth, rmeove the sleep [08:29] will do. i'm basically cpu limited, none of this stuff maxes out the bandwidth on this seedbox of mine so far [08:30] Smiley: btw in line 18 of your script, you can optionally use "seq" instead of that eval deal [08:30] nice [08:30] mp ypi cam [08:30] nope [08:30] at least I don't think it'll let you [08:30] should be like for i in $(seq 1 $max_pages) [08:31] or wait [08:31] hmmm [08:31] feel free to check :) [08:31] might work [08:31] I just know {1..$x} doesn't expand [08:31] yeah, {} doesn't work [08:32] {1..1000} does D: [08:32] `seq 1 $x` [08:32] $ foo=3; for i in $(seq 1 $foo); do echo "$i"; done [08:32] 1 [08:32] 2 [08:32] 3 [08:32] (won't work on bsd) [08:32] winr4r: supposed to use $() over `` [08:32] arrith1: really? [08:32] i've always used backticks [08:32] yeah, bsd/osx instead uses 'jot' [08:32] ah :) [08:33] isn't the $( ) because it handles spaces etc in returned values better? [08:33] diff syntax though, jot vs seq is wacky [08:33] what i heard about backticks vs $() is readability, iirc [08:33] people in #bash on freenode are very pro $() [08:33] $() isn't really that much more intuitive or obvious than ` ` [08:34] `'`'`` [08:34] i think generally parens are used more for grouping. i don't know where else backticks are used [08:34] Smiley: seq should work, but it's linux specific i guess. the eval/echo stuff is more platform independent. dunno if there's a performance benefit for using seq [08:35] prob is, but for this script it hardly matters. [08:37] yeah. i'd be curious what the bottlenecks are to make it go faster though [08:37] remove the sleep [08:37] and wget ..... & [08:37] Smiley: how much time is left to get the xanga stuff? [08:37] then it'll FLY [08:37] arrith1: not sure. [08:38] a month maybe? Need to ask SketchCow [08:38] the actual grabbing of blogs is more important [08:39] from my testing, we already have like 95% of the usernames, but as I don't know how they were collected, I can't be sure what I'm testing against is a "full" set [08:39] so that percentage may drop in the future [08:41] Smiley: alright. wait so remove the sleep, and remove some wget line? [08:50] hm [08:51] Smiley: i'll assume you mean to run with & to do multiple concurrently [08:53] yes remove sleep [08:53] the 15th of july is the last day of xanga as we know it [08:53] after that they either die, or go to a paid account model [08:53] but if you do the "wget -v --directory-prefix=_$y -a wget.log "http://www.xanga.com/groups/subdir.aspx?id=$y&uni-72-pg=$x" &; [08:54] that won't wait for each wget to finish before continuing the loop [08:54] be warned, it'll fire up thousands [08:54] so you might want to try with just ./get_xanga_users x x+1 [08:54] Smiley: teaching people how to forkbomb themselves? :P [08:54] winr4r: it came with a warning [08:59] eh [08:59] Smiley: yeah i'd rather not do that much [09:00] Got value for group 90016; Max pages = [09:00] Grabbing page {1..} [09:00] grabbing pages 1 - [09:00] errr [09:00] that's the output i get btw, but seems to be working [09:00] I mean like 1 2 [09:00] or 10 11 [09:00] not actual x :P [09:01] Smiley: which line is this on? [09:01] oh [09:02] add & after that line [09:03] then run get_xanga_users with a really low number? [09:03] not low [09:03] the numbers are normally the range your doing [09:03] so like from 30000 to 40000 [09:03] but try it with like 30001 30002 [09:04] erm, i think that'd spawn like 10,000 [09:04] as it'll open as many connections as there is pages. [09:04] yeah [09:04] well biggest one I've seen is 2000 [09:04] grabbing pages 1 - 2144 [09:06] there is other ways of doing it.... [09:06] well i'm doing 8, the ones i claimed. seems to be going about one per second or a little over. [09:06] sleeping for smaller amounts of time, passing wget a collection of a few urls per spawn, but it'll be awhile before I can get around to looking into that [09:06] Got a party to plan and run [09:06] aND I'm no coder. [09:07] spawning a few wgets would be good i think [09:07] i can help next month probably [09:07] you cvould do something like z=$(y~ [09:07] you cvould do something like z=$(y) [09:07] by my calculations the 8 i'm doing should finish in around 3 hours [09:08] wget z, wget z+1, wget z+2, wget z+3; end loop, y+4; repeat [09:08] So grabbing 4 per loop run [09:08] * Smiley realises he appears to be thinking like a coder [09:09] yeah. there's also xargs and gnu parallel [09:09] echo urls | xargs -P 4 wget [09:09] i'm not well versed in tehm yet. [09:09] I only rteally got the hang of awk yesterday :D [09:09] xargs is pretty neat, i'm gonna use it with wget warc this week [09:09] heh [09:09] well i'm all for concurrency [09:12] seems there's about 30 or so sets left, so max time it'll take is 30 items * 3 hours/item = 90 hours, or about 3.75 days. but with people running them at the same time that'll go way faster [09:13] i'd say at most a day or two. assuming there's no ratelimiting that comes up [09:13] i've seen none so far at my current speeds of 1url per secondish. [09:14] those sets near the end will take longer tho, lots of 404s [09:14] Smiley: i did remove that sleep, but it's not really going all that fast [09:14] which is fine, there's time i think [09:15] alright, gtg. bbl [09:15] o/ [09:16] grabbing pages 1 - 14649 [09:16] So much for 2000 being the highest ;D [09:46] i think i once had a tr or sed line to make IA compatible filenames, ring a bell for anyone? http://archive.org/about/faqs.php#216 [10:01] my dream is a alive: http://hardware.slashdot.org/story/13/06/21/0255241/new-technique-for-optical-storage-claims-1-petabyte-on-a-single-dvd/ [10:17] also someone should grab this: http://www.guardian.co.uk/world/interactive/2013/jun/20/exhibit-b-nsa-procedures-document [14:36] godane: [14:36] wget http://s3.amazonaws.com/s3.documentcloud.org/documents/716633/pages/exhibit-a-p{1..9}-normal.gif [14:36] wget http://s3.amazonaws.com/s3.documentcloud.org/documents/716634/pages/exhibit-b-p{1..9}-normal.gif [15:18] or actually, replace normal with large [17:41] so like [17:41] with greader-directory-grab [17:41] is it grabbing the feeds themselves or just crawling the directory [17:42] it's just querying the directory [17:42] you can upload querylists to the OPML collector if you wish [17:43] oh gotcha [20:07] Smiley: hmm seems my estimates were a bit off. in my grab they're all around 3000 [20:16] Smiley: i have approx 25k, so ~3.1k for each of the 8. so i guess i'm a little under a third done. [20:18] 11 hrs for 3.1k, means ~35.5 hours for 10k items [20:21] Smiley: so should be done in ~24 hours [20:33] arrith1: k [20:33] we have a new script too that someone else has written [20:33] you should join #jenga [20:56] Smiley: ah alright, just joined [21:06] hey [23:26] https://keenot.es/read/cause-and-infect-why-people-get-hacked