#archiveteam-bs 2013-12-02,Mon

↑back Search

Time Nickname Message
03:42 🔗 kyan Once I have a WARC/megawarc, how best to extract outbound URLs from it??
03:43 🔗 kyan (and internal links)
03:52 🔗 dashcloud is there anything in the .cdx file of use for you?
03:52 🔗 ivan` use hanzo warc-tools to get the response bodies, then parse them with html5lib/lxml/beautifulsoup/whatever the pythonistas are using now
03:54 🔗 ivan` or if you want something super-terrible to extract <a href="blah where the full target is on the same line, you can use zgrep -o on the .warc.gz
03:55 🔗 ivan` (will not actually work across chunk boundaries, don't rely on it)
03:56 🔗 kyan (that would get resources & some javasrcipt links, which would be a plus)
03:57 🔗 ivan` I don't know about it
03:59 🔗 kyan Thanks for the warc-tools pointer, that's definitely handy :)
04:04 🔗 dashcloud there's a nice wiki page here: http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem on various tools for WARCs
04:14 🔗 kyan dashcloud, thanks, sweet :)
04:14 🔗 kyan this is very useful
09:21 🔗 * midas1 stabs magento in the face
09:25 🔗 BlueMax ...there's no magneto here
09:36 🔗 midas1 there is on my servers
09:36 🔗 midas1 and i prefer to stab it in the face, that way it sees who is stabbing it
11:02 🔗 joepie91 I'm thinking of improving my pastebin scraper so that A. it will live-crawl multiple pastebins and B. it will offer websockets/0mq streams of pastes as they are crawled
11:02 🔗 joepie91 like, a realtime feed of pastes
11:02 🔗 joepie91 shouldn't be too hard
11:03 🔗 joepie91 and I'd imagine you can do a lot of fun things with that :)
11:09 🔗 Smiley could be fun
11:34 🔗 SketchCow The Hilton did a $600 charge against my card. Not cool, Hilton
11:39 🔗 joepie91 SketchCow: :(
11:55 🔗 SketchCow Bad, bad Hilton
12:30 🔗 midas1 did you empty the minibar?
12:31 🔗 midas1 if not, do it anyway
12:41 🔗 joepie91 lol
12:41 🔗 joepie91 "I paid for it - now I'll make sure that I make use of it"
13:16 🔗 midas1 indeed joepie91
13:16 🔗 midas1 "fuck this, im getting drunk @ 600 dollar"
15:14 🔗 GLaDOS http://scr.terrywri.st/1385996531.png i have no idea what im doing
15:17 🔗 midas1 wot wot GLaDOS
15:18 🔗 GLaDOS the faces, oh god the faces
15:19 🔗 BiggieJon glad that was tiny, cuz I'm guessing ti was NSFW :)
15:19 🔗 GLaDOS http://scr.terrywri.st/yolo.png here you go, full size at 10mb
15:20 🔗 BiggieJon they all look sooo happy :)
15:21 🔗 midas1 10MB...wtf
15:21 🔗 midas1 MY LORD THE FACES!
15:21 🔗 GLaDOS and this is why you never let me down a pepsi when im sleep deprived
15:21 🔗 GLaDOS faceswap ALL the people.
15:22 🔗 midas1 i'm no person that believes in god, but hell, this is horrible, i would almost start to pray if i knew how
15:22 🔗 GLaDOS lets play this fun game called spot the original!
15:23 🔗 midas1 the girl 4th of the right? :P
15:23 🔗 GLaDOS nope
15:23 🔗 midas1 3rd of the right second row?
15:24 🔗 midas1 was it a girl? :P
15:24 🔗 GLaDOS nope, and yep
15:25 🔗 midas1 ah yes! i got her
15:25 🔗 midas1 it was the 4th from the left
15:25 🔗 GLaDOS nope.
15:25 🔗 midas1 hahaha
15:25 🔗 midas1 the one on the right
15:25 🔗 midas1 last one
15:25 🔗 GLaDOS yeah, its her
15:26 🔗 midas1 i missed her, im watching this picture on a potato as screen
15:26 🔗 GLaDOS ah, that'd be why
15:27 🔗 midas1 it's a 19"screen with a res of 1024x768
15:27 🔗 midas1 so yeah, i can see 2 faces when it's full size
15:29 🔗 midas1 thank god for 100mbit@home
15:29 🔗 GLaDOS and now, to fix the sleep deprivation, i sleep.
15:29 🔗 GLaDOS o7
15:31 🔗 midas1 good night!
17:01 🔗 arkiver if a website is blocked from being archived in the robots.txt
17:01 🔗 arkiver is it then still downloaded by the Wayback machine, but not shown?
17:01 🔗 arkiver or is it not downloaded at all
17:04 🔗 balrog it's not downloaded at all while it is blocked via robots.txt
17:05 🔗 balrog old versions are retained but not shown
17:06 🔗 arkiver hmm oke
17:06 🔗 arkiver I'm going to search for robots.txt blocked pages as well then
17:07 🔗 arkiver for archival for the IA
17:07 🔗 arkiver oke = ok*
18:03 🔗 ivan` arkiver: it would be interesting to grab robots.txt for every domain on the 'net and search for those that block IA or block all unknown bots
18:05 🔗 DFJustin someone here was downloading all robots.txt
18:07 🔗 balrog IA *does* grab robots.txt
18:11 🔗 Schbirid <@DFJustin> someone here was downloading all robots.txt
18:11 🔗 Schbirid you rang
18:12 🔗 Schbirid only top 10000 alexa sites
18:23 🔗 DFJustin can you easily filter for ones that block ia_archiver or *
18:25 🔗 Schbirid sure, let's see
18:25 🔗 Schbirid err, well, i cant
18:25 🔗 Schbirid only grep
18:25 🔗 Schbirid i never found a good parser so i never did anything with them
18:30 🔗 Schbirid i am running: grep -ER -A 1 "(ia_archiver|User-agent: \*)" *
18:31 🔗 Schbirid for "some" hits
18:34 🔗 Schbirid https://pastee.org/88nu6
18:45 🔗 DFJustin 95 hyves.nl
20:01 🔗 ivan` I'm submitting a few of those to archivebot
20:02 🔗 arkiver so
20:02 🔗 ivan` if you see anything remotely interesting please do the same
20:02 🔗 arkiver do we have a list of blocked websites?
20:02 🔗 arkiver I got a few terabytes free here
20:02 🔗 arkiver so I can still downlad quite some websites
20:02 🔗 arkiver and then upload them
20:02 🔗 arkiver :)
20:02 🔗 ivan` not all of https://pastee.org/88nu6 are blocked but there's a lot
20:02 🔗 ivan` arkiver: do you have upstream?
20:03 🔗 arkiver ?
20:03 🔗 arkiver nope, what is it upstream?
20:03 🔗 ivan` how fast can you upload?
20:03 🔗 arkiver well
20:03 🔗 arkiver let's see
20:03 🔗 arkiver download speed: 7 - 8 Megabyte per second
20:04 🔗 arkiver upload speed: 700 - 800 Kilobyte per second
20:04 🔗 arkiver so I think that should be ok
20:04 🔗 arkiver buuut
20:04 🔗 arkiver what's upstream?
20:05 🔗 ivan` "In computer networking, upstream refers to the direction in which data can be transferred from the client to the server (uploading)."
20:05 🔗 arkiver ah nope
20:05 🔗 ersi upstream == upload
20:05 🔗 ivan` how much RAM do you have? wondering if you could run an archivebot pipeline to do archivebot jobs
20:06 🔗 arkiver I am downloading everything with heritrix 3.1.1, then uplaoding it to the archive and then sending an email to jason to move the files to the wayback machine
20:06 🔗 arkiver my ram?
20:06 🔗 arkiver 4 gb right now
20:06 🔗 arkiver but
20:06 🔗 arkiver soon I'm going to buy a new computer, which will have around 16GB SDRAM
20:06 🔗 arkiver (like 70% sure I'm going to buy it)
20:06 🔗 arkiver also
20:06 🔗 ivan` you know you want 32GB ;)
20:07 🔗 arkiver I don't have my computer on 24/7
20:07 🔗 ivan` heh
20:07 🔗 arkiver you got 32??
20:07 🔗 arkiver O.o
20:07 🔗 ivan` I have 96GB in a box but my upstream is 160KB/s
20:07 🔗 arkiver ah
20:07 🔗 * ersi pokes the VM host machine with 256GB RAM
20:07 🔗 arkiver my ram is lower but upstream faster
20:07 🔗 arkiver :P
20:07 🔗 arkiver -.-
20:07 🔗 arkiver ok ok ok
20:07 🔗 ivan` why do you turn off your computer?
20:07 🔗 ersi I turn off most of my shit as well
20:08 🔗 arkiver I now know that my ram isn't high guys... -.-
20:08 🔗 arkiver yep
20:08 🔗 arkiver at night
20:08 🔗 arkiver it's in my room
20:08 🔗 ersi that machine isn't mine
20:08 🔗 arkiver and making noice
20:08 🔗 ersi it's a machine at work
20:08 🔗 arkiver and it's irritating then...
20:08 🔗 arkiver yeah
20:08 🔗 arkiver as ersi says
20:08 🔗 ersi My laptop got 8GB and my workstation got 4GB
20:08 🔗 ivan` do you have a closet? perfect place for a computer
20:08 🔗 arkiver -.-
20:08 🔗 ersi I do have a closet "server" machine though ^_^
20:08 🔗 arkiver not gonna place my pc in there
20:09 🔗 arkiver in my closet...
20:09 🔗 arkiver so oke
20:09 🔗 arkiver I'm going through that list right now and looking at the robots.txt
20:09 🔗 arkiver and then selecting the websites to download
20:09 🔗 ivan` closet blocks like 30dB
20:10 🔗 arkiver yeah well
20:10 🔗 arkiver nah
20:10 🔗 arkiver I'm happy like this
20:10 🔗 arkiver maybe some other time
20:10 🔗 arkiver so
20:10 🔗 arkiver which sites from the list are already downloaded?
20:10 🔗 arkiver or downloading
20:13 🔗 arkiver http://www.insideview.com/robots.txt
20:14 🔗 arkiver # bad crawlers
20:14 🔗 arkiver Disallow: /
20:14 🔗 arkiver User-agent: *
20:14 🔗 arkiver "bad crawlers" O.o :'(
20:14 🔗 Schbirid :D
20:15 🔗 Schbirid is there any value in keeping the daily 1m top sites zip from alexa? i want to clean up
20:15 🔗 arkiver I don't know
20:15 🔗 arkiver but did you create that list of websites?
20:15 🔗 ersi Schbirid: How large is the data?
20:15 🔗 ersi arkiver: alexa.com provides a list of 1m top sites
20:16 🔗 arkiver yes
20:16 🔗 arkiver but can we automatically check the robots.txt?
20:16 🔗 arkiver also
20:16 🔗 arkiver this site is also blocked:
20:16 🔗 arkiver http://svs.gsfc.nasa.gov/
20:16 🔗 ersi "Free download" from http://www.alexa.com/topsites -> http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
20:16 🔗 ersi Well, sure..
20:16 🔗 arkiver many GB's of create visualisation videos
20:16 🔗 arkiver just blocked... :(
20:16 🔗 arkiver if gone everything is gone
20:17 🔗 Schbirid y
20:17 🔗 Schbirid ~10M per da
20:17 🔗 Schbirid i got 8G here
20:17 🔗 ersi ~10MB/day?
20:18 🔗 arkiver 8GB of robots.txt's?
20:18 🔗 Schbirid nah, 8G of the 1m file
20:18 🔗 Schbirid 3G of robots files :D
20:19 🔗 arkiver ah
20:19 🔗 Schbirid 365 7z files in one item sound idiotic or ok? i want to dump them to IA
20:19 🔗 arkiver and are they already looked at for if IA is blocked?
20:19 🔗 Schbirid no
20:19 🔗 Schbirid dumb daily downloading
20:20 🔗 Schbirid https://github.com/ArchiveTeam/robots-relapse is some version, not sure what exactly that one does
20:21 🔗 arkiver already downloaded 2.5 GB of www.webmonkey.com/
20:22 🔗 arkiver hmm
20:22 🔗 arkiver maybe it would be helpful to create a list of websites people from the Archiveteam are currently downloading at home?
20:22 🔗 arkiver maybe in the wiki?
20:22 🔗 arkiver and that we regularly update it?
20:26 🔗 ivan` better to just get everything to IA
20:27 🔗 arkiver I mean that we have more organised list of what is done and what still needs to be done?
20:28 🔗 arkiver and that we take some website, put our name behind them if we are working on them
20:28 🔗 arkiver know what I mean
20:28 🔗 arkiver ?
20:29 🔗 ivan` once you're grabbing many domains per day, I don't think you'll be motivated to keep it in sync
20:30 🔗 arkiver hmm oke then
20:30 🔗 arkiver but some domains will be grabbed twice or more maybe...
20:30 🔗 arkiver well
20:30 🔗 ersi so? :)
20:30 🔗 arkiver :P
20:30 🔗 ersi disk is cheap
20:30 🔗 arkiver and everything is going to IA
20:30 🔗 arkiver so not on our disk
20:46 🔗 w0rp Redundancy is good for archiving.
21:11 🔗 arkiver is anyone else here usin heritrix?
21:11 🔗 arkiver I'm having a problem right now... :(
21:17 🔗 ersi I always recommend the following: Write about the problem instead of asking to ask about asking to ask
21:17 🔗 ersi I'm not running heritrix. What's the problem?
21:18 🔗 godane i think my bluray player may hate me
21:18 🔗 godane *bluray burner
21:18 🔗 arkiver I'm keep getting this error:
21:18 🔗 arkiver 2013-12-02T21:15:01.002Z SEVERE Failed to start bean 'bdb'; nested exception is java.lang.UnsatisfiedLinkError: Error looking up function 'link': Kan opgegeven procedure niet vinden.
21:18 🔗 godane i see it to burn at speed 4x
21:18 🔗 arkiver when I try to swtart a job from the checkpoint
21:18 🔗 godane and now its trying to burn at 10x
21:18 🔗 ersi What does the Dutch error message mean?
21:19 🔗 arkiver Can't find given procedure
21:20 🔗 ersi Hm. Has it worked before?
21:20 🔗 arkiver well
21:20 🔗 arkiver It suddenly worked on eitme
21:20 🔗 arkiver time
21:20 🔗 arkiver but then not
21:20 🔗 arkiver and before that time also not
21:21 🔗 arkiver if I try it one time and I get the before error and then try it a second it gives me this error:
21:21 🔗 arkiver 2013-12-02T21:20:48.918Z SEVERE Failed to start bean 'bdb'; nested exception is java.lang.IllegalStateException: com.sleepycat.je.EnvironmentFailureException: (JE 4.1.6) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 4.1.6) K:\Internet-Archive\heritrix-3.1.1\bin\.\jobs\test\state fetchTarget of 0x0/0xbf parent IN=2 IN
21:21 🔗 arkiver class=com.sleepycat.je.tree.BIN lastFullVersion=0x1/0x540 parent.getDirty()=false state=0 LOG_FILE_NOT_FOUND: Log file missing, log is likely invalid. Environment is invalid and must be closed. (in thread 'test launchthread')
21:21 🔗 arkiver and yeah
21:21 🔗 arkiver the problem is that a log file is missing
21:21 🔗 arkiver so I checked it
21:21 🔗 arkiver it is missing a 00000000.jdb file
21:21 🔗 arkiver now
21:22 🔗 arkiver I opened that folder and right before I click to start the job again 00000000.jdb is still there
21:22 🔗 arkiver but right after I click it 00000000.jdb dissapears and then I get the error
21:22 🔗 arkiver as if it is first deleting it and then trying to open it...
21:23 🔗 arkiver instead of first opening and then deleting it
21:53 🔗 godane so it looks like this disc is doing better then the last one
21:54 🔗 godane not saying everything is ok yet
21:55 🔗 godane the video is still like last time
21:55 🔗 godane but filesystem can be viewed
21:56 🔗 godane and the video does play
21:56 🔗 godane just fastforwarding is slower then normal
22:01 🔗 yipdw oh neat, DigitalOcean has a second datacenter in Amsterdam
22:18 🔗 godane so good news
22:18 🔗 godane turns out i missing type my burning script
22:18 🔗 ersi http://fortvv2.capitex.se/beskrivning.aspx?guid=46PQ44OP65VBJM2B&typ=CMFastighet
22:19 🔗 ersi want
22:19 🔗 ersi so bad
22:19 🔗 godane it had --speed=4 instead of -speed=4
22:19 🔗 ersi (Old Swedish Military fortification/base with tunnels and everything)
23:17 🔗 godane SketchCow: at some point i will be uploading all the pdfs i got from ftp.qmags.com to my godaneinbox
23:17 🔗 godane that way we can make tons of collections for them
23:19 🔗 godane there are like magazines about cleaning rooms
23:19 🔗 godane for making computer chips
23:20 🔗 godane this doesn't take of the table anyone want into put a full ftp tar of it up
23:33 🔗 godane i copied a html file of the ftp root index and made a list of pdf files to grab
23:33 🔗 godane this way i don't have download every exe and sea.hqx file

irclogger-viewer