Time |
Nickname |
Message |
00:53
🔗
|
omf_ |
The first episode of 'Ray Donovan' is free on youtube |
01:53
🔗
|
omf_ |
best unicode tweet ever https://twitter.com/Wu_Tang_Finance/status/347793126234148864 |
02:27
🔗
|
dashcloud |
so, what does everyone use to keep a single program from accidentally eating up all the CPU time? |
02:39
🔗
|
winr4r |
dashcloud: nice |
02:40
🔗
|
dashcloud |
I'll look into that- thanks! |
02:41
🔗
|
dashcloud |
ever used cpulimit? that seemed to be the preferred choice over nice |
02:43
🔗
|
winr4r |
nope! |
04:58
🔗
|
godane |
g4tv.com-video56930-flvhd: Internet Goes On Strike Against SOPA - AOTS Loops In Reddit's Ohanian: https://archive.org/details/g4tv.com-video56930-flvhd |
04:59
🔗
|
godane |
just a random video from my g4 video grabs |
05:08
🔗
|
omf_ |
http://www.technologyreview.com/news/516156/a-popular-ad-blocker-also-helps-the-ad-industry/ |
05:11
🔗
|
* |
omf_ pokes Smiley in the eyeball |
05:21
🔗
|
* |
BlueMax pokes omf_ with an anvil |
05:22
🔗
|
Coderjoe |
nice and ionice. generally, it is fine to let a program use all spare CPU time as long as higher-priority tasks can get in front of it properly |
06:06
🔗
|
omf_ |
yes BlueMax |
06:18
🔗
|
godane |
so i found this: http://web.gbtv.com/gen/multimedia/detail/7/0/1/19968701.xml |
06:18
🔗
|
godane |
Glenn Beck learns what may be ahead in a worst-case-scenario roundtable discussion. |
06:19
🔗
|
godane |
the best part is this is a hour and 56 mins long |
06:36
🔗
|
godane |
of course its not that |
06:38
🔗
|
godane |
it looks to be him explaining how he going to build the network gbtv now |
08:10
🔗
|
Smiley |
GLaDOS: awaken!!!! |
08:11
🔗
|
winr4r |
hi Smiley |
08:11
🔗
|
Smiley |
hey winr4r |
08:14
🔗
|
arrith1 |
g'morning. i really should be heading to bed but i'm slowly chipping away at this perfect python script |
08:15
🔗
|
Smiley |
what does it do? |
08:15
🔗
|
winr4r |
keeps arrith1 awake |
08:16
🔗
|
winr4r |
meta, bitches |
08:16
🔗
|
arrith1 |
haha |
08:16
🔗
|
arrith1 |
Smiley: well eventually it should work on multiple sites, but right now it's just to crawl livejournal.com and get a big textfile of usernames |
08:17
🔗
|
Smiley |
nice |
08:17
🔗
|
Smiley |
you have seen my bash right? |
08:17
🔗
|
Smiley |
https://github.com/djsmiley2k/smileys-random-tools/blob/master/get_xanga_users |
08:17
🔗
|
arrith1 |
i haven't hmm |
08:17
🔗
|
arrith1 |
mine is for the Google Reader archiving effort which just needs lists of usernames from a range of sites, listed out on http://archiveteam.org/index.php?title=Google_Reader |
08:18
🔗
|
arrith1 |
Smiley: oh btw, your wikipage is very helpful with wget-warc |
08:18
🔗
|
Smiley |
no worries. |
08:18
🔗
|
arrith1 |
Smiley: oh actually i have seen that script. i forgot about it though. |
08:18
🔗
|
Smiley |
arrith1: well it's my own way of crawling any numbered site, to grab all the usernames on each page... |
08:18
🔗
|
winr4r |
oh, talking of which |
08:19
🔗
|
Smiley |
I'm not a programmer at all, no idea if it's actually good :D |
08:19
🔗
|
Smiley |
but it works \o/ |
08:19
🔗
|
winr4r |
i just realised i still have greader-directory-grab running |
08:19
🔗
|
arrith1 |
Smiley: yeah looks good |
08:19
🔗
|
arrith1 |
winr4r: nice |
08:19
🔗
|
* |
winr4r lets it be |
08:19
🔗
|
arrith1 |
yeah we can use all the help we can get running greader-grab and greader-directory-grab |
08:19
🔗
|
winr4r |
i think he still needs moar people on the job |
08:19
🔗
|
Smiley |
yah, need help crawling these usernames too D: |
08:19
🔗
|
arrith1 |
yeah |
08:19
🔗
|
arrith1 |
i set concurrent to 32 on greader-directory-grab >:D |
08:19
🔗
|
arrith1 |
Smiley: xanga usernames? |
08:20
🔗
|
Smiley |
so much to grab, so little time |
08:20
🔗
|
Smiley |
yup |
08:20
🔗
|
winr4r |
arrith1: what is it by default? |
08:20
🔗
|
arrith1 |
winr4r: the instructions had it not specifying, so i think 1. instructions were updated to 3. i ran 8 for a while without any problems, and then 16, then 32 |
08:21
🔗
|
arrith1 |
Smiley: awk is awesome btw. also is the xanga thing using ArchiveTeam Warrior? or is it some other script? i can help out if there's a thing delegating to clients |
08:21
🔗
|
Smiley |
arrith1: both |
08:21
🔗
|
Smiley |
actual _Grab_ for xanga is in warrior |
08:21
🔗
|
Smiley |
for the username grabbing, it's seperate for now |
08:22
🔗
|
Smiley |
http://www.archiveteam.org/index.php?title=Xanga << how can i help has instructions for username grab if you want to run some |
08:22
🔗
|
Smiley |
you can run plenty concurrently, it's pretty slow tho |
08:23
🔗
|
Smiley |
tomorrow I might run some from work ;) |
08:25
🔗
|
arrith1 |
Smiley: what START and END should i use? |
08:25
🔗
|
Smiley |
http://pad.archivingyoursh.it/p/xanga-ranges << take your pick |
08:25
🔗
|
Smiley |
feel free to & them and run multiple too |
08:25
🔗
|
Smiley |
and redirect the output/remove it if unwanted |
08:26
🔗
|
arrith1 |
ahhh nice. that's what i'm looking for. nice big list to claim |
08:26
🔗
|
arrith1 |
i'll claim a few then let run over night |
08:28
🔗
|
Smiley |
if you have the spare bandwidth, rmeove the sleep |
08:29
🔗
|
arrith1 |
will do. i'm basically cpu limited, none of this stuff maxes out the bandwidth on this seedbox of mine so far |
08:30
🔗
|
arrith1 |
Smiley: btw in line 18 of your script, you can optionally use "seq" instead of that eval deal |
08:30
🔗
|
Smiley |
nice |
08:30
🔗
|
Smiley |
mp ypi cam |
08:30
🔗
|
Smiley |
nope |
08:30
🔗
|
Smiley |
at least I don't think it'll let you |
08:30
🔗
|
arrith1 |
should be like for i in $(seq 1 $max_pages) |
08:31
🔗
|
arrith1 |
or wait |
08:31
🔗
|
Smiley |
hmmm |
08:31
🔗
|
Smiley |
feel free to check :) |
08:31
🔗
|
Smiley |
might work |
08:31
🔗
|
Smiley |
I just know {1..$x} doesn't expand |
08:31
🔗
|
arrith1 |
yeah, {} doesn't work |
08:32
🔗
|
Smiley |
{1..1000} does D: |
08:32
🔗
|
winr4r |
`seq 1 $x` |
08:32
🔗
|
arrith1 |
$ foo=3; for i in $(seq 1 $foo); do echo "$i"; done |
08:32
🔗
|
arrith1 |
1 |
08:32
🔗
|
arrith1 |
2 |
08:32
🔗
|
arrith1 |
3 |
08:32
🔗
|
winr4r |
(won't work on bsd) |
08:32
🔗
|
arrith1 |
winr4r: supposed to use $() over `` |
08:32
🔗
|
winr4r |
arrith1: really? |
08:32
🔗
|
winr4r |
i've always used backticks |
08:32
🔗
|
arrith1 |
yeah, bsd/osx instead uses 'jot' |
08:32
🔗
|
winr4r |
ah :) |
08:33
🔗
|
Smiley |
isn't the $( ) because it handles spaces etc in returned values better? |
08:33
🔗
|
arrith1 |
diff syntax though, jot vs seq is wacky |
08:33
🔗
|
arrith1 |
what i heard about backticks vs $() is readability, iirc |
08:33
🔗
|
arrith1 |
people in #bash on freenode are very pro $() |
08:33
🔗
|
winr4r |
$() isn't really that much more intuitive or obvious than ` ` |
08:34
🔗
|
Smiley |
`'`'`` |
08:34
🔗
|
arrith1 |
i think generally parens are used more for grouping. i don't know where else backticks are used |
08:34
🔗
|
arrith1 |
Smiley: seq should work, but it's linux specific i guess. the eval/echo stuff is more platform independent. dunno if there's a performance benefit for using seq |
08:35
🔗
|
Smiley |
prob is, but for this script it hardly matters. |
08:37
🔗
|
arrith1 |
yeah. i'd be curious what the bottlenecks are to make it go faster though |
08:37
🔗
|
Smiley |
remove the sleep |
08:37
🔗
|
Smiley |
and wget ..... & |
08:37
🔗
|
arrith1 |
Smiley: how much time is left to get the xanga stuff? |
08:37
🔗
|
Smiley |
then it'll FLY |
08:37
🔗
|
Smiley |
arrith1: not sure. |
08:38
🔗
|
Smiley |
a month maybe? Need to ask SketchCow |
08:38
🔗
|
Smiley |
the actual grabbing of blogs is more important |
08:39
🔗
|
Smiley |
from my testing, we already have like 95% of the usernames, but as I don't know how they were collected, I can't be sure what I'm testing against is a "full" set |
08:39
🔗
|
Smiley |
so that percentage may drop in the future |
08:41
🔗
|
arrith1 |
Smiley: alright. wait so remove the sleep, and remove some wget line? |
08:50
🔗
|
arrith1 |
hm |
08:51
🔗
|
arrith1 |
Smiley: i'll assume you mean to run with & to do multiple concurrently |
08:53
🔗
|
Smiley |
yes remove sleep |
08:53
🔗
|
winr4r |
the 15th of july is the last day of xanga as we know it |
08:53
🔗
|
winr4r |
after that they either die, or go to a paid account model |
08:53
🔗
|
Smiley |
but if you do the "wget -v --directory-prefix=_$y -a wget.log "http://www.xanga.com/groups/subdir.aspx?id=$y&uni-72-pg=$x" &; |
08:54
🔗
|
Smiley |
that won't wait for each wget to finish before continuing the loop |
08:54
🔗
|
Smiley |
be warned, it'll fire up thousands |
08:54
🔗
|
Smiley |
so you might want to try with just ./get_xanga_users x x+1 |
08:54
🔗
|
winr4r |
Smiley: teaching people how to forkbomb themselves? :P |
08:54
🔗
|
Smiley |
winr4r: it came with a warning |
08:59
🔗
|
arrith1 |
eh |
08:59
🔗
|
arrith1 |
Smiley: yeah i'd rather not do that much |
09:00
🔗
|
arrith1 |
Got value for group 90016; Max pages = |
09:00
🔗
|
arrith1 |
Grabbing page {1..} |
09:00
🔗
|
arrith1 |
grabbing pages 1 - |
09:00
🔗
|
Smiley |
errr |
09:00
🔗
|
arrith1 |
that's the output i get btw, but seems to be working |
09:00
🔗
|
Smiley |
I mean like 1 2 |
09:00
🔗
|
Smiley |
or 10 11 |
09:00
🔗
|
Smiley |
not actual x :P |
09:01
🔗
|
arrith1 |
Smiley: which line is this on? |
09:01
🔗
|
arrith1 |
oh |
09:02
🔗
|
arrith1 |
add & after that line |
09:03
🔗
|
arrith1 |
then run get_xanga_users with a really low number? |
09:03
🔗
|
Smiley |
not low |
09:03
🔗
|
Smiley |
the numbers are normally the range your doing |
09:03
🔗
|
Smiley |
so like from 30000 to 40000 |
09:03
🔗
|
Smiley |
but try it with like 30001 30002 |
09:04
🔗
|
arrith1 |
erm, i think that'd spawn like 10,000 |
09:04
🔗
|
Smiley |
as it'll open as many connections as there is pages. |
09:04
🔗
|
arrith1 |
yeah |
09:04
🔗
|
Smiley |
well biggest one I've seen is 2000 |
09:04
🔗
|
Smiley |
grabbing pages 1 - 2144 |
09:06
🔗
|
Smiley |
there is other ways of doing it.... |
09:06
🔗
|
arrith1 |
well i'm doing 8, the ones i claimed. seems to be going about one per second or a little over. |
09:06
🔗
|
Smiley |
sleeping for smaller amounts of time, passing wget a collection of a few urls per spawn, but it'll be awhile before I can get around to looking into that |
09:06
🔗
|
Smiley |
Got a party to plan and run |
09:06
🔗
|
Smiley |
aND I'm no coder. |
09:07
🔗
|
arrith1 |
spawning a few wgets would be good i think |
09:07
🔗
|
arrith1 |
i can help next month probably |
09:07
🔗
|
Smiley |
you cvould do something like z=$(y~ |
09:07
🔗
|
Smiley |
you cvould do something like z=$(y) |
09:07
🔗
|
arrith1 |
by my calculations the 8 i'm doing should finish in around 3 hours |
09:08
🔗
|
Smiley |
wget z, wget z+1, wget z+2, wget z+3; end loop, y+4; repeat |
09:08
🔗
|
Smiley |
So grabbing 4 per loop run |
09:08
🔗
|
* |
Smiley realises he appears to be thinking like a coder |
09:09
🔗
|
arrith1 |
yeah. there's also xargs and gnu parallel |
09:09
🔗
|
arrith1 |
echo urls | xargs -P 4 wget |
09:09
🔗
|
Smiley |
i'm not well versed in tehm yet. |
09:09
🔗
|
Smiley |
I only rteally got the hang of awk yesterday :D |
09:09
🔗
|
arrith1 |
xargs is pretty neat, i'm gonna use it with wget warc this week |
09:09
🔗
|
arrith1 |
heh |
09:09
🔗
|
arrith1 |
well i'm all for concurrency |
09:12
🔗
|
arrith1 |
seems there's about 30 or so sets left, so max time it'll take is 30 items * 3 hours/item = 90 hours, or about 3.75 days. but with people running them at the same time that'll go way faster |
09:13
🔗
|
arrith1 |
i'd say at most a day or two. assuming there's no ratelimiting that comes up |
09:13
🔗
|
Smiley |
i've seen none so far at my current speeds of 1url per secondish. |
09:14
🔗
|
Smiley |
those sets near the end will take longer tho, lots of 404s |
09:14
🔗
|
arrith1 |
Smiley: i did remove that sleep, but it's not really going all that fast |
09:14
🔗
|
arrith1 |
which is fine, there's time i think |
09:15
🔗
|
arrith1 |
alright, gtg. bbl |
09:15
🔗
|
Smiley |
o/ |
09:16
🔗
|
Smiley |
grabbing pages 1 - 14649 |
09:16
🔗
|
Smiley |
So much for 2000 being the highest ;D |
09:46
🔗
|
Schbirid |
i think i once had a tr or sed line to make IA compatible filenames, ring a bell for anyone? http://archive.org/about/faqs.php#216 |
10:01
🔗
|
godane |
my dream is a alive: http://hardware.slashdot.org/story/13/06/21/0255241/new-technique-for-optical-storage-claims-1-petabyte-on-a-single-dvd/ |
10:17
🔗
|
godane |
also someone should grab this: http://www.guardian.co.uk/world/interactive/2013/jun/20/exhibit-b-nsa-procedures-document |
14:36
🔗
|
DFJustin |
godane: |
14:36
🔗
|
DFJustin |
wget http://s3.amazonaws.com/s3.documentcloud.org/documents/716633/pages/exhibit-a-p{1..9}-normal.gif |
14:36
🔗
|
DFJustin |
wget http://s3.amazonaws.com/s3.documentcloud.org/documents/716634/pages/exhibit-b-p{1..9}-normal.gif |
15:18
🔗
|
DFJustin |
or actually, replace normal with large |
17:41
🔗
|
winr4r |
so like |
17:41
🔗
|
winr4r |
with greader-directory-grab |
17:41
🔗
|
winr4r |
is it grabbing the feeds themselves or just crawling the directory |
17:42
🔗
|
ivan` |
it's just querying the directory |
17:42
🔗
|
ivan` |
you can upload querylists to the OPML collector if you wish |
17:43
🔗
|
winr4r |
oh gotcha |
20:07
🔗
|
arrith1 |
Smiley: hmm seems my estimates were a bit off. in my grab they're all around 3000 |
20:16
🔗
|
arrith1 |
Smiley: i have approx 25k, so ~3.1k for each of the 8. so i guess i'm a little under a third done. |
20:18
🔗
|
arrith1 |
11 hrs for 3.1k, means ~35.5 hours for 10k items |
20:21
🔗
|
arrith1 |
Smiley: so should be done in ~24 hours |
20:33
🔗
|
Smiley |
arrith1: k |
20:33
🔗
|
Smiley |
we have a new script too that someone else has written |
20:33
🔗
|
Smiley |
you should join #jenga |
20:56
🔗
|
arrith1 |
Smiley: ah alright, just joined |
21:06
🔗
|
Smiley |
hey |
23:26
🔗
|
joepie91 |
https://keenot.es/read/cause-and-infect-why-people-get-hacked |