Time |
Nickname |
Message |
00:01
🔗
|
godane |
shaqfu: I'm getting the ?page now |
00:02
🔗
|
godane |
i only need --post-data instead of --post-data --user=blah --password=blah |
00:02
🔗
|
godane |
otherwise i will get ?page=2.html?user=blah.html or something |
00:43
🔗
|
shaqfu |
Ah, clever |
00:51
🔗
|
instence |
shaqfu, if you have a gun shoot me in the brain |
00:51
🔗
|
instence |
or give me temporary amnesia |
00:53
🔗
|
instence |
i just wish during archiving there was a way to de-stress the brain somehow so you could start fresh |
00:53
🔗
|
instence |
i guess that is what naps are for |
00:53
🔗
|
instence |
but time is always of the essence so its like *fuck* |
00:59
🔗
|
Coderjoe |
woo |
00:59
🔗
|
Coderjoe |
infocube 2.0 is now at 221% |
01:00
🔗
|
balrog_ |
wow. |
03:10
🔗
|
godane |
Coderjoe: i thought we was doing in -bs |
03:11
🔗
|
godane |
*talking |
03:13
🔗
|
godane |
looks like starfinder is in avgeeks |
03:16
🔗
|
godane |
ooks like a ton of nasa videos was saved by avgeeks too |
03:16
🔗
|
Coderjoe |
i don't need a running tally of what is there |
04:39
🔗
|
godane |
just found something funny |
04:40
🔗
|
godane |
i torrent from kat.ph was removed by the request of copyright owner |
04:42
🔗
|
shaqfu |
Which? |
04:43
🔗
|
godane |
http://kat.ph/keri-hilson-pretty-girl-rock-2010-single-sw-t4672360.html |
12:58
🔗
|
Schbirid |
hm, "q2l\#354ft.map": Invalid or incomplete multibyte or wide character". would that be a ascii ì ? |
12:58
🔗
|
Schbirid |
any idea how i can find out? |
12:58
🔗
|
Schbirid |
my fs are utf8 but no idea what the source was |
13:21
🔗
|
ersi |
http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/ [HN discussion: ] http://news.ycombinator.com/item?id=4367933 |
13:35
🔗
|
Schbirid |
On December 19, 2008, BusinessWeek listed Cuil as one of the most successful U.S. startups of 2008 |
13:35
🔗
|
Schbirid |
, based on the amount of money they raised. |
13:36
🔗
|
godane |
my kat.ph-community is still going |
13:39
🔗
|
winr4r |
Schbirid: lol, cuil |
13:49
🔗
|
Schbirid |
wicked, i mounted that forumlpanet bz2 again and now cpu usage is no problem. i wonder what went wrong the other time |
13:49
🔗
|
Schbirid |
s |
13:49
🔗
|
Schbirid |
this rock |
14:01
🔗
|
ersi |
I've encountered Common Crawl before, but the Everything-Amazon-tech-and-Cloud stuff scares me away |
14:15
🔗
|
alard |
Can't you just download the data and use it somewhere else? |
14:17
🔗
|
ersi |
yeah, but you need an Amazon account and pay for the download etc |
14:17
🔗
|
ersi |
I mean, sure - that's fair. But it make me reluctant to take a look at it |
14:20
🔗
|
alard |
https://aws-publicdatasets.s3.amazonaws.com/?prefix=common-crawl/crawl-002 |
14:21
🔗
|
alard |
I think you can download everything for free, no account needed. |
14:22
🔗
|
alard |
https://s3.amazonaws.com/aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz |
14:23
🔗
|
ersi |
oh, cool |
15:18
🔗
|
godane |
all most 13000 forum posts from kat.ph/community has been downloaded |
15:32
🔗
|
godane |
i'm getting a lot of 404s in my kat.ph/community dump |
15:33
🔗
|
godane |
there is also stuff like this too that needs to be backup: http://kat.ph/blog/TheBatman/ |
15:35
🔗
|
godane |
i just have no idea how other then scan my newer dump with http://kat.ph/user/[[:alnum:]]* or something to get user name urls |
15:36
🔗
|
godane |
then user part to blog and start grabing |
15:36
🔗
|
godane |
i also have to look a images from all urls in this dump |
16:01
🔗
|
godane |
blog post like this need to be saved for them: http://kat.ph/blog/Nemesis43/post/5200/ |
17:11
🔗
|
godane |
just updated my linux jouranl collection |
17:12
🔗
|
ersi |
linux journal collection? |
17:12
🔗
|
godane |
you get some here: http://www.missoulapubliclibrary.org/online-resources/317-linux |
17:12
🔗
|
godane |
whats funny is that its a library |
17:13
🔗
|
ersi |
ah |
17:14
🔗
|
godane |
also here: www.iar.unlp.edu.ar/biblio/htdocs/artic/bajad/linuxj/linuxj.htm |
17:15
🔗
|
godane |
the library has some pdfs that are index |
17:15
🔗
|
godane |
so i grab those index ones too |
18:59
🔗
|
arkhive |
I'm picking up 'hundreds' of 5.25" floppies Monday. Will be dumping like crazy. |
19:03
🔗
|
winr4r |
arkhive: excellent |
19:04
🔗
|
balrog_ |
arkhive: what sort of floppies? |
19:19
🔗
|
winr4r |
good evening, btw |
19:28
🔗
|
arkhive |
Not sure yet. |
19:28
🔗
|
arkhive |
:) |
19:28
🔗
|
arkhive |
evenin' |
19:33
🔗
|
godane |
hey winr4r |
19:33
🔗
|
winr4r |
:) |
19:33
🔗
|
winr4r |
been busy, godane? |
19:34
🔗
|
godane |
my kat.ph/community still is |
19:34
🔗
|
godane |
thanks to alard i will be able to grab all images off of kat.ph/community dump |
19:35
🔗
|
godane |
still pulling new images from it |
19:37
🔗
|
godane |
so do sort and uniq works not just uniq |
20:07
🔗
|
godane |
its in a url loop |
20:09
🔗
|
godane |
i think i got most of it anyway |
20:10
🔗
|
godane |
i should have blocked ?p_id paths |
20:11
🔗
|
godane |
and blocked 26799 post |
20:42
🔗
|
godane |
getting a ton of user pictures now |
20:44
🔗
|
godane |
there is 5000+ user pics |
20:44
🔗
|
godane |
from kastatic.com/i2/u/# path |
20:45
🔗
|
godane |
then there is kastatic.com/i2/userpics/# |
20:55
🔗
|
godane |
the kastatic.com image dump is very big |
20:55
🔗
|
godane |
and i have not got to kastatic.com/i2/userpics/ |
20:55
🔗
|
godane |
yet |
21:05
🔗
|
godane |
my eyes |
21:05
🔗
|
godane |
a fat guy took picture of himself naked |
21:06
🔗
|
godane |
that is what is data dump |
23:47
🔗
|
godane |
i'm downloading 8-bit theatre |