Time |
Nickname |
Message |
03:42
🔗
|
kyan |
Once I have a WARC/megawarc, how best to extract outbound URLs from it?? |
03:43
🔗
|
kyan |
(and internal links) |
03:52
🔗
|
dashcloud |
is there anything in the .cdx file of use for you? |
03:52
🔗
|
ivan` |
use hanzo warc-tools to get the response bodies, then parse them with html5lib/lxml/beautifulsoup/whatever the pythonistas are using now |
03:54
🔗
|
ivan` |
or if you want something super-terrible to extract <a href="blah where the full target is on the same line, you can use zgrep -o on the .warc.gz |
03:55
🔗
|
ivan` |
(will not actually work across chunk boundaries, don't rely on it) |
03:56
🔗
|
kyan |
(that would get resources & some javasrcipt links, which would be a plus) |
03:57
🔗
|
ivan` |
I don't know about it |
03:59
🔗
|
kyan |
Thanks for the warc-tools pointer, that's definitely handy :) |
04:04
🔗
|
dashcloud |
there's a nice wiki page here: http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem on various tools for WARCs |
04:14
🔗
|
kyan |
dashcloud, thanks, sweet :) |
04:14
🔗
|
kyan |
this is very useful |
09:21
🔗
|
* |
midas1 stabs magento in the face |
09:25
🔗
|
BlueMax |
...there's no magneto here |
09:36
🔗
|
midas1 |
there is on my servers |
09:36
🔗
|
midas1 |
and i prefer to stab it in the face, that way it sees who is stabbing it |
11:02
🔗
|
joepie91 |
I'm thinking of improving my pastebin scraper so that A. it will live-crawl multiple pastebins and B. it will offer websockets/0mq streams of pastes as they are crawled |
11:02
🔗
|
joepie91 |
like, a realtime feed of pastes |
11:02
🔗
|
joepie91 |
shouldn't be too hard |
11:03
🔗
|
joepie91 |
and I'd imagine you can do a lot of fun things with that :) |
11:09
🔗
|
Smiley |
could be fun |
11:34
🔗
|
SketchCow |
The Hilton did a $600 charge against my card. Not cool, Hilton |
11:39
🔗
|
joepie91 |
SketchCow: :( |
11:55
🔗
|
SketchCow |
Bad, bad Hilton |
12:30
🔗
|
midas1 |
did you empty the minibar? |
12:31
🔗
|
midas1 |
if not, do it anyway |
12:41
🔗
|
joepie91 |
lol |
12:41
🔗
|
joepie91 |
"I paid for it - now I'll make sure that I make use of it" |
13:16
🔗
|
midas1 |
indeed joepie91 |
13:16
🔗
|
midas1 |
"fuck this, im getting drunk @ 600 dollar" |
15:14
🔗
|
GLaDOS |
http://scr.terrywri.st/1385996531.png i have no idea what im doing |
15:17
🔗
|
midas1 |
wot wot GLaDOS |
15:18
🔗
|
GLaDOS |
the faces, oh god the faces |
15:19
🔗
|
BiggieJon |
glad that was tiny, cuz I'm guessing ti was NSFW :) |
15:19
🔗
|
GLaDOS |
http://scr.terrywri.st/yolo.png here you go, full size at 10mb |
15:20
🔗
|
BiggieJon |
they all look sooo happy :) |
15:21
🔗
|
midas1 |
10MB...wtf |
15:21
🔗
|
midas1 |
MY LORD THE FACES! |
15:21
🔗
|
GLaDOS |
and this is why you never let me down a pepsi when im sleep deprived |
15:21
🔗
|
GLaDOS |
faceswap ALL the people. |
15:22
🔗
|
midas1 |
i'm no person that believes in god, but hell, this is horrible, i would almost start to pray if i knew how |
15:22
🔗
|
GLaDOS |
lets play this fun game called spot the original! |
15:23
🔗
|
midas1 |
the girl 4th of the right? :P |
15:23
🔗
|
GLaDOS |
nope |
15:23
🔗
|
midas1 |
3rd of the right second row? |
15:24
🔗
|
midas1 |
was it a girl? :P |
15:24
🔗
|
GLaDOS |
nope, and yep |
15:25
🔗
|
midas1 |
ah yes! i got her |
15:25
🔗
|
midas1 |
it was the 4th from the left |
15:25
🔗
|
GLaDOS |
nope. |
15:25
🔗
|
midas1 |
hahaha |
15:25
🔗
|
midas1 |
the one on the right |
15:25
🔗
|
midas1 |
last one |
15:25
🔗
|
GLaDOS |
yeah, its her |
15:26
🔗
|
midas1 |
i missed her, im watching this picture on a potato as screen |
15:26
🔗
|
GLaDOS |
ah, that'd be why |
15:27
🔗
|
midas1 |
it's a 19"screen with a res of 1024x768 |
15:27
🔗
|
midas1 |
so yeah, i can see 2 faces when it's full size |
15:29
🔗
|
midas1 |
thank god for 100mbit@home |
15:29
🔗
|
GLaDOS |
and now, to fix the sleep deprivation, i sleep. |
15:29
🔗
|
GLaDOS |
o7 |
15:31
🔗
|
midas1 |
good night! |
17:01
🔗
|
arkiver |
if a website is blocked from being archived in the robots.txt |
17:01
🔗
|
arkiver |
is it then still downloaded by the Wayback machine, but not shown? |
17:01
🔗
|
arkiver |
or is it not downloaded at all |
17:04
🔗
|
balrog |
it's not downloaded at all while it is blocked via robots.txt |
17:05
🔗
|
balrog |
old versions are retained but not shown |
17:06
🔗
|
arkiver |
hmm oke |
17:06
🔗
|
arkiver |
I'm going to search for robots.txt blocked pages as well then |
17:07
🔗
|
arkiver |
for archival for the IA |
17:07
🔗
|
arkiver |
oke = ok* |
18:03
🔗
|
ivan` |
arkiver: it would be interesting to grab robots.txt for every domain on the 'net and search for those that block IA or block all unknown bots |
18:05
🔗
|
DFJustin |
someone here was downloading all robots.txt |
18:07
🔗
|
balrog |
IA *does* grab robots.txt |
18:11
🔗
|
Schbirid |
<@DFJustin> someone here was downloading all robots.txt |
18:11
🔗
|
Schbirid |
you rang |
18:12
🔗
|
Schbirid |
only top 10000 alexa sites |
18:23
🔗
|
DFJustin |
can you easily filter for ones that block ia_archiver or * |
18:25
🔗
|
Schbirid |
sure, let's see |
18:25
🔗
|
Schbirid |
err, well, i cant |
18:25
🔗
|
Schbirid |
only grep |
18:25
🔗
|
Schbirid |
i never found a good parser so i never did anything with them |
18:30
🔗
|
Schbirid |
i am running: grep -ER -A 1 "(ia_archiver|User-agent: \*)" * |
18:31
🔗
|
Schbirid |
for "some" hits |
18:34
🔗
|
Schbirid |
https://pastee.org/88nu6 |
18:45
🔗
|
DFJustin |
95 hyves.nl |
20:01
🔗
|
ivan` |
I'm submitting a few of those to archivebot |
20:02
🔗
|
arkiver |
so |
20:02
🔗
|
ivan` |
if you see anything remotely interesting please do the same |
20:02
🔗
|
arkiver |
do we have a list of blocked websites? |
20:02
🔗
|
arkiver |
I got a few terabytes free here |
20:02
🔗
|
arkiver |
so I can still downlad quite some websites |
20:02
🔗
|
arkiver |
and then upload them |
20:02
🔗
|
arkiver |
:) |
20:02
🔗
|
ivan` |
not all of https://pastee.org/88nu6 are blocked but there's a lot |
20:02
🔗
|
ivan` |
arkiver: do you have upstream? |
20:03
🔗
|
arkiver |
? |
20:03
🔗
|
arkiver |
nope, what is it upstream? |
20:03
🔗
|
ivan` |
how fast can you upload? |
20:03
🔗
|
arkiver |
well |
20:03
🔗
|
arkiver |
let's see |
20:03
🔗
|
arkiver |
download speed: 7 - 8 Megabyte per second |
20:04
🔗
|
arkiver |
upload speed: 700 - 800 Kilobyte per second |
20:04
🔗
|
arkiver |
so I think that should be ok |
20:04
🔗
|
arkiver |
buuut |
20:04
🔗
|
arkiver |
what's upstream? |
20:05
🔗
|
ivan` |
"In computer networking, upstream refers to the direction in which data can be transferred from the client to the server (uploading)." |
20:05
🔗
|
arkiver |
ah nope |
20:05
🔗
|
ersi |
upstream == upload |
20:05
🔗
|
ivan` |
how much RAM do you have? wondering if you could run an archivebot pipeline to do archivebot jobs |
20:06
🔗
|
arkiver |
I am downloading everything with heritrix 3.1.1, then uplaoding it to the archive and then sending an email to jason to move the files to the wayback machine |
20:06
🔗
|
arkiver |
my ram? |
20:06
🔗
|
arkiver |
4 gb right now |
20:06
🔗
|
arkiver |
but |
20:06
🔗
|
arkiver |
soon I'm going to buy a new computer, which will have around 16GB SDRAM |
20:06
🔗
|
arkiver |
(like 70% sure I'm going to buy it) |
20:06
🔗
|
arkiver |
also |
20:06
🔗
|
ivan` |
you know you want 32GB ;) |
20:07
🔗
|
arkiver |
I don't have my computer on 24/7 |
20:07
🔗
|
ivan` |
heh |
20:07
🔗
|
arkiver |
you got 32?? |
20:07
🔗
|
arkiver |
O.o |
20:07
🔗
|
ivan` |
I have 96GB in a box but my upstream is 160KB/s |
20:07
🔗
|
arkiver |
ah |
20:07
🔗
|
* |
ersi pokes the VM host machine with 256GB RAM |
20:07
🔗
|
arkiver |
my ram is lower but upstream faster |
20:07
🔗
|
arkiver |
:P |
20:07
🔗
|
arkiver |
-.- |
20:07
🔗
|
arkiver |
ok ok ok |
20:07
🔗
|
ivan` |
why do you turn off your computer? |
20:07
🔗
|
ersi |
I turn off most of my shit as well |
20:08
🔗
|
arkiver |
I now know that my ram isn't high guys... -.- |
20:08
🔗
|
arkiver |
yep |
20:08
🔗
|
arkiver |
at night |
20:08
🔗
|
arkiver |
it's in my room |
20:08
🔗
|
ersi |
that machine isn't mine |
20:08
🔗
|
arkiver |
and making noice |
20:08
🔗
|
ersi |
it's a machine at work |
20:08
🔗
|
arkiver |
and it's irritating then... |
20:08
🔗
|
arkiver |
yeah |
20:08
🔗
|
arkiver |
as ersi says |
20:08
🔗
|
ersi |
My laptop got 8GB and my workstation got 4GB |
20:08
🔗
|
ivan` |
do you have a closet? perfect place for a computer |
20:08
🔗
|
arkiver |
-.- |
20:08
🔗
|
ersi |
I do have a closet "server" machine though ^_^ |
20:08
🔗
|
arkiver |
not gonna place my pc in there |
20:09
🔗
|
arkiver |
in my closet... |
20:09
🔗
|
arkiver |
so oke |
20:09
🔗
|
arkiver |
I'm going through that list right now and looking at the robots.txt |
20:09
🔗
|
arkiver |
and then selecting the websites to download |
20:09
🔗
|
ivan` |
closet blocks like 30dB |
20:10
🔗
|
arkiver |
yeah well |
20:10
🔗
|
arkiver |
nah |
20:10
🔗
|
arkiver |
I'm happy like this |
20:10
🔗
|
arkiver |
maybe some other time |
20:10
🔗
|
arkiver |
so |
20:10
🔗
|
arkiver |
which sites from the list are already downloaded? |
20:10
🔗
|
arkiver |
or downloading |
20:13
🔗
|
arkiver |
http://www.insideview.com/robots.txt |
20:14
🔗
|
arkiver |
# bad crawlers |
20:14
🔗
|
arkiver |
Disallow: / |
20:14
🔗
|
arkiver |
User-agent: * |
20:14
🔗
|
arkiver |
"bad crawlers" O.o :'( |
20:14
🔗
|
Schbirid |
:D |
20:15
🔗
|
Schbirid |
is there any value in keeping the daily 1m top sites zip from alexa? i want to clean up |
20:15
🔗
|
arkiver |
I don't know |
20:15
🔗
|
arkiver |
but did you create that list of websites? |
20:15
🔗
|
ersi |
Schbirid: How large is the data? |
20:15
🔗
|
ersi |
arkiver: alexa.com provides a list of 1m top sites |
20:16
🔗
|
arkiver |
yes |
20:16
🔗
|
arkiver |
but can we automatically check the robots.txt? |
20:16
🔗
|
arkiver |
also |
20:16
🔗
|
arkiver |
this site is also blocked: |
20:16
🔗
|
arkiver |
http://svs.gsfc.nasa.gov/ |
20:16
🔗
|
ersi |
"Free download" from http://www.alexa.com/topsites -> http://s3.amazonaws.com/alexa-static/top-1m.csv.zip |
20:16
🔗
|
ersi |
Well, sure.. |
20:16
🔗
|
arkiver |
many GB's of create visualisation videos |
20:16
🔗
|
arkiver |
just blocked... :( |
20:16
🔗
|
arkiver |
if gone everything is gone |
20:17
🔗
|
Schbirid |
y |
20:17
🔗
|
Schbirid |
~10M per da |
20:17
🔗
|
Schbirid |
i got 8G here |
20:17
🔗
|
ersi |
~10MB/day? |
20:18
🔗
|
arkiver |
8GB of robots.txt's? |
20:18
🔗
|
Schbirid |
nah, 8G of the 1m file |
20:18
🔗
|
Schbirid |
3G of robots files :D |
20:19
🔗
|
arkiver |
ah |
20:19
🔗
|
Schbirid |
365 7z files in one item sound idiotic or ok? i want to dump them to IA |
20:19
🔗
|
arkiver |
and are they already looked at for if IA is blocked? |
20:19
🔗
|
Schbirid |
no |
20:19
🔗
|
Schbirid |
dumb daily downloading |
20:20
🔗
|
Schbirid |
https://github.com/ArchiveTeam/robots-relapse is some version, not sure what exactly that one does |
20:21
🔗
|
arkiver |
already downloaded 2.5 GB of www.webmonkey.com/ |
20:22
🔗
|
arkiver |
hmm |
20:22
🔗
|
arkiver |
maybe it would be helpful to create a list of websites people from the Archiveteam are currently downloading at home? |
20:22
🔗
|
arkiver |
maybe in the wiki? |
20:22
🔗
|
arkiver |
and that we regularly update it? |
20:26
🔗
|
ivan` |
better to just get everything to IA |
20:27
🔗
|
arkiver |
I mean that we have more organised list of what is done and what still needs to be done? |
20:28
🔗
|
arkiver |
and that we take some website, put our name behind them if we are working on them |
20:28
🔗
|
arkiver |
know what I mean |
20:28
🔗
|
arkiver |
? |
20:29
🔗
|
ivan` |
once you're grabbing many domains per day, I don't think you'll be motivated to keep it in sync |
20:30
🔗
|
arkiver |
hmm oke then |
20:30
🔗
|
arkiver |
but some domains will be grabbed twice or more maybe... |
20:30
🔗
|
arkiver |
well |
20:30
🔗
|
ersi |
so? :) |
20:30
🔗
|
arkiver |
:P |
20:30
🔗
|
ersi |
disk is cheap |
20:30
🔗
|
arkiver |
and everything is going to IA |
20:30
🔗
|
arkiver |
so not on our disk |
20:46
🔗
|
w0rp |
Redundancy is good for archiving. |
21:11
🔗
|
arkiver |
is anyone else here usin heritrix? |
21:11
🔗
|
arkiver |
I'm having a problem right now... :( |
21:17
🔗
|
ersi |
I always recommend the following: Write about the problem instead of asking to ask about asking to ask |
21:17
🔗
|
ersi |
I'm not running heritrix. What's the problem? |
21:18
🔗
|
godane |
i think my bluray player may hate me |
21:18
🔗
|
godane |
*bluray burner |
21:18
🔗
|
arkiver |
I'm keep getting this error: |
21:18
🔗
|
arkiver |
2013-12-02T21:15:01.002Z SEVERE Failed to start bean 'bdb'; nested exception is java.lang.UnsatisfiedLinkError: Error looking up function 'link': Kan opgegeven procedure niet vinden. |
21:18
🔗
|
godane |
i see it to burn at speed 4x |
21:18
🔗
|
arkiver |
when I try to swtart a job from the checkpoint |
21:18
🔗
|
godane |
and now its trying to burn at 10x |
21:18
🔗
|
ersi |
What does the Dutch error message mean? |
21:19
🔗
|
arkiver |
Can't find given procedure |
21:20
🔗
|
ersi |
Hm. Has it worked before? |
21:20
🔗
|
arkiver |
well |
21:20
🔗
|
arkiver |
It suddenly worked on eitme |
21:20
🔗
|
arkiver |
time |
21:20
🔗
|
arkiver |
but then not |
21:20
🔗
|
arkiver |
and before that time also not |
21:21
🔗
|
arkiver |
if I try it one time and I get the before error and then try it a second it gives me this error: |
21:21
🔗
|
arkiver |
2013-12-02T21:20:48.918Z SEVERE Failed to start bean 'bdb'; nested exception is java.lang.IllegalStateException: com.sleepycat.je.EnvironmentFailureException: (JE 4.1.6) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 4.1.6) K:\Internet-Archive\heritrix-3.1.1\bin\.\jobs\test\state fetchTarget of 0x0/0xbf parent IN=2 IN |
21:21
🔗
|
arkiver |
class=com.sleepycat.je.tree.BIN lastFullVersion=0x1/0x540 parent.getDirty()=false state=0 LOG_FILE_NOT_FOUND: Log file missing, log is likely invalid. Environment is invalid and must be closed. (in thread 'test launchthread') |
21:21
🔗
|
arkiver |
and yeah |
21:21
🔗
|
arkiver |
the problem is that a log file is missing |
21:21
🔗
|
arkiver |
so I checked it |
21:21
🔗
|
arkiver |
it is missing a 00000000.jdb file |
21:21
🔗
|
arkiver |
now |
21:22
🔗
|
arkiver |
I opened that folder and right before I click to start the job again 00000000.jdb is still there |
21:22
🔗
|
arkiver |
but right after I click it 00000000.jdb dissapears and then I get the error |
21:22
🔗
|
arkiver |
as if it is first deleting it and then trying to open it... |
21:23
🔗
|
arkiver |
instead of first opening and then deleting it |
21:53
🔗
|
godane |
so it looks like this disc is doing better then the last one |
21:54
🔗
|
godane |
not saying everything is ok yet |
21:55
🔗
|
godane |
the video is still like last time |
21:55
🔗
|
godane |
but filesystem can be viewed |
21:56
🔗
|
godane |
and the video does play |
21:56
🔗
|
godane |
just fastforwarding is slower then normal |
22:01
🔗
|
yipdw |
oh neat, DigitalOcean has a second datacenter in Amsterdam |
22:18
🔗
|
godane |
so good news |
22:18
🔗
|
godane |
turns out i missing type my burning script |
22:18
🔗
|
ersi |
http://fortvv2.capitex.se/beskrivning.aspx?guid=46PQ44OP65VBJM2B&typ=CMFastighet |
22:19
🔗
|
ersi |
want |
22:19
🔗
|
ersi |
so bad |
22:19
🔗
|
godane |
it had --speed=4 instead of -speed=4 |
22:19
🔗
|
ersi |
(Old Swedish Military fortification/base with tunnels and everything) |
23:17
🔗
|
godane |
SketchCow: at some point i will be uploading all the pdfs i got from ftp.qmags.com to my godaneinbox |
23:17
🔗
|
godane |
that way we can make tons of collections for them |
23:19
🔗
|
godane |
there are like magazines about cleaning rooms |
23:19
🔗
|
godane |
for making computer chips |
23:20
🔗
|
godane |
this doesn't take of the table anyone want into put a full ftp tar of it up |
23:33
🔗
|
godane |
i copied a html file of the ftp root index and made a list of pdf files to grab |
23:33
🔗
|
godane |
this way i don't have download every exe and sea.hqx file |