#archiveteam 2015-11-26,Thu

↑back Search

Time Nickname Message
00:13 🔗 Start has joined #archiveteam
00:13 🔗 cvb has quit IRC (Ping timeout: 255 seconds)
00:17 🔗 bwn_ has quit IRC (Read error: Operation timed out)
00:32 🔗 remsen has quit IRC (Read error: Operation timed out)
00:39 🔗 xk_id has quit IRC (Remote host closed the connection)
01:08 🔗 jleclanch https://web.archive.org/web/20101204061054/http://www.worldofwarcraft.com/info/burningcrusade/index.xml yay websites that used xsl...
01:09 🔗 joepie91 heh
01:20 🔗 Ymgve has quit IRC (Read error: Connection reset by peer)
01:22 🔗 zenguy_pc has joined #archiveteam
01:23 🔗 Ymgve has joined #archiveteam
01:37 🔗 philpem has quit IRC (Ping timeout: 252 seconds)
01:53 🔗 remsen has joined #archiveteam
02:06 🔗 vitzli has joined #archiveteam
02:18 🔗 primus104 has quit IRC (Leaving.)
02:27 🔗 vitzli has quit IRC (Ping timeout: 255 seconds)
02:41 🔗 vitzli has joined #archiveteam
02:47 🔗 dashcloud no ongoing projects, but there's a very complete archive of HP's stuff (pre-split HP), and piles of driver CDs on archive.org; I think there's also a pretty complete Dell driver FTP set as well
03:02 🔗 SketchCow Needs more.
03:02 🔗 SketchCow Site needs tons of curation, too, obviously.
03:05 🔗 remsen has quit IRC (Read error: Operation timed out)
03:13 🔗 Stiletto has joined #archiveteam
03:33 🔗 vtyl has joined #archiveteam
03:33 🔗 phuzion SketchCow: sorry about the ping earlier without a message to follow up with, I was gonna ask about the google code rsync target on FOS.
03:34 🔗 remsen has joined #archiveteam
03:37 🔗 lytv has quit IRC (Read error: Operation timed out)
03:39 🔗 godane SketchCow: looks like some mp3s are incomplete from kpfa
03:40 🔗 godane i put like 2 of the mp3s into archivebot so we have full prove that there server is hosting the incomplete file
03:41 🔗 godane look at the Wed0700 hour here: https://archive.org/details/kpfa-archives-radio-podcast-2005-08-10
03:41 🔗 godane it should be closer to 2 hours
03:41 🔗 godane cause thats the morning show
03:42 🔗 SketchCow Got it
04:20 🔗 vOYtEC has quit IRC (Read error: Connection reset by peer)
04:24 🔗 vOYtEC has joined #archiveteam
04:35 🔗 aaaaaaaaa has quit IRC (Leaving)
04:37 🔗 icedice has joined #archiveteam
04:45 🔗 icedice has quit IRC (Ping timeout: 360 seconds)
04:47 🔗 chfoo has quit IRC (Quit: quit)
04:52 🔗 vitzli has quit IRC (Quit: Leaving)
05:04 🔗 jleclanch SketchCow: hey there, you wanted some stuff
05:07 🔗 Sk1d has quit IRC (Read error: Operation timed out)
05:07 🔗 SketchCow I want so much fucking stuff.
05:07 🔗 jleclanch SketchCow: you like xml,?
05:08 🔗 SketchCow Like, as friends, or dating?
05:08 🔗 jleclanch well, you know, it's flexible and extensible
05:08 🔗 jleclanch SketchCow: https://leclan.ch/public/armory-dump-warning-2gb-uncompressed.tar.xz (18mb compressed, 1.5gb uncompressed)
05:09 🔗 jleclanch lotsa dumps i found on my hd
05:09 🔗 jleclanch SketchCow: 3 full scans from the wowarmory website back when it was fully xml-based. structured wow data basically. it's stuff!
05:12 🔗 jleclanch SketchCow: and with that out of the way, i have 40 gigabytes of virtual ticket videos and id like to put them up somewhere, but my upload is really poor =(
05:13 🔗 WinterFox has joined #archiveteam
05:14 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
05:19 🔗 RichardG has joined #archiveteam
05:20 🔗 SketchCow What are virtual ticket videos.
05:22 🔗 jleclanch SketchCow: sorry, shouldve said. paywalled videos. http://sprunge.us/gRIf
05:26 🔗 SketchCow Can you upload this xml
05:26 🔗 SketchCow Do the description and context. Otherwise, I'd just be making stuff up.
05:28 🔗 jleclanch SketchCow: xml's in there -> https://leclan.ch/public/armory-dump-warning-2gb-uncompressed.tar.xz
05:28 🔗 jleclanch it's like 300k files
05:37 🔗 nightpool has joined #archiveteam
05:45 🔗 nightpool has quit IRC (Read error: Operation timed out)
05:46 🔗 DFJustin http://archive.org/upload/
05:47 🔗 jleclanch DFJustin: im aware and it's not ideal at all, I can't maintain a good upload over time, ideally id like to torrent it out to someone who can upload it
05:53 🔗 SketchCow I came across your website and was really intruiged behind the whole concept of it.
05:53 🔗 SketchCow I was wondering, are there any current bypasses for Yahoo 2-step verfication, I would be willing to pay (if necessary).
05:53 🔗 SketchCow Fine, I'll do it.
05:54 🔗 phuzion jleclanch: If you torrent it, that data would be uploaded from your connection anyways. IA supports ingesting data with bittorrent, but it's less than ideal. If that's the route you'd like to go, then check this out: https://archive.org/about/faqs.php#321
05:55 🔗 jleclanch phuzion: yeah uploading is not a problem in and of itself. it's maintaining a connection for 40gb over the web page
05:55 🔗 jleclanch phuzion: i didnt know it supported bittorrent ill check it out
05:55 🔗 phuzion jleclanch: I don't know whether IA times out those bittorrent sessions after a while or anything, but it's worth a shot.
05:57 🔗 phuzion Either way, you're still talking about pushing 40GB of data up on what I'm assuming is a rather small residential internet connection.
05:57 🔗 phuzion It's going to take a long time no matter what protocol you use.
05:57 🔗 jleclanch yeah but torrent is easier :)
05:57 🔗 jleclanch i mean
05:58 🔗 jleclanch this is already technically split in 1-4gb videos
06:10 🔗 xmc if they are a bunch of 1-4gb files, you can upload them individually to the same item
06:11 🔗 xmc i think the website uploader works with that
06:12 🔗 jleclanch ill figure sth out, there's no rush
06:12 🔗 jleclanch if someone wants the dump though let me know
06:18 🔗 icedice has joined #archiveteam
06:27 🔗 phuzion jleclanch: are you sending the data to IA with a torrent?
06:28 🔗 phuzion Or are you uploading using the web interface?
06:28 🔗 jleclanch phuzion: im not sending anything yet, got other stuff to upload first
06:28 🔗 phuzion Do you have the torrent created yet?
06:30 🔗 jleclanch phuzion: no, I won't be doing this today. I'll call my isp to bump my upload speed first
06:30 🔗 phuzion Ok
06:30 🔗 jleclanch phuzion: why, you interested?
06:32 🔗 phuzion My thoughts were that I could download the torrent in the background and hang onto it.
06:38 🔗 nightpool has joined #archiveteam
06:49 🔗 nightpool has quit IRC (Ping timeout: 606 seconds)
06:57 🔗 SketchCow Length: 18956704 (18M) [application/octet-stream]
06:57 🔗 SketchCow Saving to: 'armory-dump-warning-2gb-uncompressed.tar.xz'
06:57 🔗 SketchCow 100%[=====================================================================================================>] 18,956,704 83.4KB/s in 3m 47s
06:57 🔗 SketchCow 2015-11-26 05:57:12 (81.6 KB/s) - 'armory-dump-warning-2gb-uncompressed.tar.xz' saved [18956704/18956704]
07:02 🔗 BlueMaxim has quit IRC (Leaving)
07:03 🔗 bwn_ has joined #archiveteam
07:18 🔗 cvb has joined #archiveteam
07:18 🔗 cvb has quit IRC (Connection closed)
07:22 🔗 cvb has joined #archiveteam
07:25 🔗 primus104 has joined #archiveteam
07:52 🔗 vitzli has joined #archiveteam
07:54 🔗 remsen has quit IRC (Read error: Operation timed out)
07:54 🔗 GLaDOS has quit IRC (Read error: Operation timed out)
07:54 🔗 BlueMaxim has joined #archiveteam
07:59 🔗 GLaDOS has joined #archiveteam
08:01 🔗 icedice has quit IRC (Quit: Leaving)
08:03 🔗 godane has left
08:08 🔗 godane has joined #archiveteam
08:10 🔗 SketchCow https://archive.org/details/negativland coming nicely.
08:14 🔗 godane you maybe getting this as a collection of videos: https://www.youtube.com/channel/UCFlf_u19WYW0ftOuxuWfLKQ/videos
08:15 🔗 godane alot of Macy's Thanksgiving Day Parades
08:17 🔗 cvb has quit IRC (Quit: Leaving)
08:27 🔗 rolfb has joined #archiveteam
08:33 🔗 atomotic has joined #archiveteam
08:33 🔗 nightpool has joined #archiveteam
08:35 🔗 bwn_ has quit IRC (Read error: Operation timed out)
08:36 🔗 rolfb has quit IRC (Linkinus - http://linkinus.com)
08:39 🔗 nightpool has quit IRC (Read error: Operation timed out)
08:42 🔗 primus104 has quit IRC (Leaving.)
09:03 🔗 xk_id has joined #archiveteam
09:43 🔗 schbirid has joined #archiveteam
09:54 🔗 BlueMaxim has quit IRC (Quit: Leaving)
10:20 🔗 primus104 has joined #archiveteam
10:26 🔗 bwn has joined #archiveteam
10:46 🔗 vOYtEC has quit IRC (rm -r *)
10:54 🔗 Sk1d has joined #archiveteam
10:54 🔗 vOYtEC has joined #archiveteam
11:44 🔗 bwn has quit IRC (Read error: Connection reset by peer)
11:57 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
11:58 🔗 remsen has joined #archiveteam
12:03 🔗 remsen2 has joined #archiveteam
12:04 🔗 antomatic The docsstoc grab seems to be occasionally getting caught up in loops of URLS like /images/images/images/images/images/images/ or /docs/../images/../content/../docs/../images/../content/ and such... any word on a fix yet?
12:08 🔗 remsen has quit IRC (Read error: Operation timed out)
12:29 🔗 PrincessK has joined #archiveteam
12:30 🔗 PrincessK is now known as Knoeki
12:30 🔗 Knoeki \o
12:35 🔗 nightpool has joined #archiveteam
12:43 🔗 nightpool has quit IRC (Ping timeout: 483 seconds)
12:44 🔗 antomatic another popular one is http://embed.docstoc.com/handlers/downloadfilefromflash.ashx?docid=10971830&ref_url=http://www.docstoc.com/docs/chrome://skype_ff_toolbar_win/content/flags/chrome://skype_ff_toolbar_win/content/chrome://skype_ff_toolbar_win/content/chrome://skype_ff_toolbar_win/content/chrome://skype_ff_toolbar_win/content/flags/chrome://skype_ff_toolbar_win/content/chrome://skype_ff_toolbar_
12:44 🔗 antomatic win/content/flags/chrome://skype_ff_toolbar_win/content/chrome://skype_ff_toolbar_win/content/chrome://skype_ff_toolbar_win/content/flags/chrome://skype_ff_toolbar_win/content/chrome://skype_ff_toolbar_win/content/arrow.gif
12:45 🔗 Knoeki I can't even click that anymore :')
12:46 🔗 antomatic I doubt if it does anything. :)
12:48 🔗 luckcolor has joined #archiveteam
12:51 🔗 atomotic has joined #archiveteam
12:51 🔗 Knoeki antomatic: haha, well, I was refering to the fact that it's split up over 2 lines :P
12:51 🔗 Knoeki https://twitter.com/knoeki/status/669860020931141632
12:54 🔗 antomatic ah, always had a soft spot for those commodore monitors. :)
12:56 🔗 luckcolor hello guys
12:57 🔗 antomatic hlloo!
12:57 🔗 Knoeki antomatic: I've got 2 here, they were both dead before I moved a couple weeks ago
12:58 🔗 Knoeki now one of them magically works perfectly again :')
12:58 🔗 Knoeki must've bumped it the right way
13:10 🔗 WinterFox has quit IRC (Remote host closed the connection)
13:22 🔗 Knoeki is now known as PrincessK
13:26 🔗 arkiver antomatic: fix is in
13:27 🔗 antomatic [[applause]]
13:29 🔗 antomatic nice one arkiver
13:35 🔗 arkiver chfoo: sorry to ping you again. Can you please recreate the googlecode rsync target on FOS?
13:39 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
13:43 🔗 vitzli_ has joined #archiveteam
13:45 🔗 vitzli has quit IRC (Read error: Operation timed out)
13:47 🔗 primus104 has quit IRC (Leaving.)
13:52 🔗 Atluxity whats up with docstoc?
13:53 🔗 arkiver hmm looks like items/min went down dramatically
13:54 🔗 Atluxity yeah
13:54 🔗 arkiver looks like our useragent isn't blocked
13:54 🔗 arkiver Atluxity: can you check your IPs?
13:54 🔗 arkiver or just some of them?
13:54 🔗 arkiver might be IP bans
13:54 🔗 Atluxity besides some of them trying to get some ridiculus loop urls, I only saw tracker limit error
13:55 🔗 arkiver yeah I just paused the grab
13:55 🔗 arkiver restarted now, do you see anything?
13:56 🔗 Atluxity I do not seem to be ip banned
13:58 🔗 arkiver Atluxity: I recently set the new version of the scripts in the tracker
13:59 🔗 arkiver Though, that version was released some time ago already. I wanted to give some time to update
13:59 🔗 Ghost_of_ has joined #archiveteam
14:01 🔗 arkiver Hi Ghost_of_
14:01 🔗 Ghost_of_ hi, arkiver ... 'sup?
14:01 🔗 arkiver We had someone from yuku here yesterday
14:01 🔗 arkiver our useragent was blocked due to the high traffic, but unblocked now
14:02 🔗 arkiver Basically they have some problems with advertising income due to our non-human traffic
14:02 🔗 arkiver But that'll be fixed
14:03 🔗 Ghost_of_ so, they're basically OK with the archiving?
14:03 🔗 arkiver yes
14:03 🔗 Ghost_of_ cool
14:03 🔗 arkiver Log starting from here http://archive.fart.website/bin/irclogger_log/archiveteam?date=2015-11-25,Wed&sel=435#l431
14:03 🔗 arkiver person is yukundali
14:03 🔗 Ghost_of_ "a bad script" :)
14:03 🔗 arkiver let's talk further in #archiveteam-bs though
14:19 🔗 atomotic has joined #archiveteam
14:19 🔗 icedice has joined #archiveteam
14:33 🔗 antomatic aw, man.. got a docstoc grab running through recursive URLs including /../../../../Local%20Settings/Temporary%20%Internet%20Files/varun_/varun?AMFI%20-%20Association%20of%20Mutual%20Fund%20in%20India1_files/mutualfundind_files//../../../../Local%20Settings/Temporary%20%Internet%20Files/varun_/varun?AMFI%20-%20Association%20of%20Mutual%20Fund%20in%20India1_files/mutualfundind_files//../../../../Loca
14:33 🔗 antomatic l%20Settings/Temporary%20%Internet%20Files/varun_/varun?AMFI%20-%20Association%20of%20Mutual%20Fund%20in%20India1_files/mutualfundind_files//../../../../Local%20Settings/Temporary%20%Internet%20Files/varun_/varun?AMFI%20-%20Association%20of%20Mutual%20Fund%20in%20India1_files/mutualfundind_files/
14:36 🔗 vitzli_ has quit IRC (Quit: Leaving)
14:36 🔗 arkiver is that with updated scripts?
14:36 🔗 antomatic yes
14:39 🔗 arkiver do you have a full log for me?
14:39 🔗 arkiver scripts do mark the url as a loop and skip the URL, so something else keeps queueing more URLs
14:39 🔗 arkiver does*
14:39 🔗 arkiver and which item is it?
14:39 🔗 antomatic not sure, I'm trying to stop the scripts at the moment
14:39 🔗 antomatic I only see this stuff going past on the screen
14:40 🔗 Atluxity arkiver: when did you update script?
14:40 🔗 arkiver yesterday (for me)
14:42 🔗 arkiver I returned the version to the previous now and the tracker has a lot more requests coming in.
14:42 🔗 arkiver so the problem is not updated scripts
14:42 🔗 vitzli has joined #archiveteam
14:52 🔗 Atluxity to make debugging easier I thought it was a good time for me to restart my hose, kill some of the long looping urls
14:53 🔗 Atluxity I had some urls counting 60 000 chars in lenght
14:54 🔗 Atluxity it seemed to affekt my memory
14:55 🔗 Atluxity unfortunatly I did not consider you would have use for those logs
14:55 🔗 zerkalo has quit IRC (Remote host closed the connection)
15:05 🔗 antomatic arkiver: here's one looper - http://pastebin.com/VNA4CsUH
15:06 🔗 antomatic the other one is a URL so long that it's bigger than my scrollback buffer - http://pastebin.com/zZk9Yh2q
15:08 🔗 antomatic how can I run-pipeline and log everything too?
15:08 🔗 antomatic [I realise that's a silly question but I just haven't done it before]
15:11 🔗 antomatic Once things start to loop it looks like they go on for a while though: http://pastebin.com/wuWPP8Yt
15:12 🔗 luckcolor if you want too do a pipeline
15:13 🔗 luckcolor cat hello.txt > txt.log
15:13 🔗 luckcolor using tee
15:13 🔗 luckcolor command shoudl both print out and pipeline into a file
15:14 🔗 luckcolor https://en.wikipedia.org/wiki/Tee_%28command%29
15:15 🔗 antomatic ah, interesting - thanks
15:15 🔗 luckcolor np
15:15 🔗 luckcolor so yeah it will rpobably be ./run-pipeline | tee log.txt
15:16 🔗 luckcolor no wait
15:16 🔗 luckcolor ùthe examble i made is probably wrong
15:16 🔗 luckcolor :P
15:16 🔗 luckcolor no it's fine
15:16 🔗 luckcolor lint program.c | tee program.lint
15:30 🔗 DFJustin https://twitter.com/wikileaks/status/669900131777576960
15:36 🔗 luckcolor well i'm not downloading that :P
15:36 🔗 luckcolor too shitty internet speed
15:41 🔗 SketchCow HAPPY THANKSGIVING FROM THE USANIANS TO ALL
15:42 🔗 arkiver SketchCow: unfortunately chfoo has not yet responded to the googlecode rsync target problem
15:43 🔗 arkiver can you please create an rsync target in /chfoo/ for googlecode?
15:43 🔗 nightpool has joined #archiveteam
15:44 🔗 scyther has joined #archiveteam
15:49 🔗 Start has quit IRC (Quit: Disconnected.)
15:49 🔗 nightpool has quit IRC (Read error: Operation timed out)
15:51 🔗 SketchCow Done?
15:51 🔗 SketchCow googlecode?
15:52 🔗 icedice Can !yahoo be used at Blogspot and WordPress or will Google and Automattic detect that and block it?
15:53 🔗 arkiver SketchCow: we are about to start the project, but we need a rsync target.
15:53 🔗 arkiver We had a target on FOS, but it's not available anymore
15:55 🔗 SketchCow I think I just added one
15:55 🔗 arkiver Yes! looks like it's working!
15:55 🔗 arkiver I'll start the google code project
15:57 🔗 SketchCow ANything else before I go take care of my thanksgiving lady
16:01 🔗 arkiver Nothing very important at the moment
16:05 🔗 primus104 has joined #archiveteam
16:10 🔗 SketchCow I'll be on and off today.
16:12 🔗 nertzy has quit IRC (Read error: Connection reset by peer)
16:17 🔗 SketchCow Heading down
16:18 🔗 SketchCow One last thing: Of course all my automatic pushers had run out, so those are running, already took FOS's drive from 49% to 41% and dropping
16:18 🔗 SketchCow All gamefront, of course
16:18 🔗 SketchCow And Yuku continues uploading, and adrive went in
16:19 🔗 Start has joined #archiveteam
16:25 🔗 antomatic arkiver: can't tell you what caused it, but just seen "Lua runtime error: docstoc.lua:122: invalid capture index." scroll by
16:26 🔗 arkiver something with a loop?
16:26 🔗 antomatic i don't know - i don't think so
16:27 🔗 zenguy_pc has quit IRC (Ping timeout: 252 seconds)
16:30 🔗 xk_id has quit IRC (Remote host closed the connection)
16:31 🔗 dashcloud has quit IRC (Read error: Operation timed out)
16:34 🔗 dashcloud has joined #archiveteam
16:39 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
16:39 🔗 nightpool has joined #archiveteam
16:44 🔗 Jonimus has quit IRC (Read error: Operation timed out)
16:46 🔗 vitzli has quit IRC (Quit: Leaving)
16:47 🔗 nightpool has quit IRC (Read error: Operation timed out)
16:59 🔗 luckcolor arkiver is still the googlecode project missing the target?
17:03 🔗 Atluxity no
17:03 🔗 Atluxity it has a target now
17:03 🔗 Atluxity now we are waiting for a git merge
17:03 🔗 arkiver luckcolor: yeah
17:04 🔗 arkiver I mean no, it's not missing the target
17:04 🔗 luckcolor ok
17:04 🔗 luckcolor i'll do some items of docstoc in the meantime
17:04 🔗 luckcolor also i had a random bug
17:04 🔗 luckcolor when i loaded the warrior page it freeze
17:04 🔗 luckcolor and a script error oocured on thebrowser
17:05 🔗 luckcolor i rebooted the warrior and it's fixed
17:06 🔗 HCross luckcolor, did it have items with lots of URLS in it?
17:08 🔗 Start has quit IRC (Quit: Disconnected.)
17:08 🔗 luckcolor maybe
17:08 🔗 luckcolor i couldn't check
17:09 🔗 luckcolor the webpage didn't render
17:09 🔗 HCross if it has lots it will do that
17:09 🔗 luckcolor definetely the web interface has some bugs
17:09 🔗 luckcolor like not applying the web login password
17:10 🔗 remsen has joined #archiveteam
17:13 🔗 remsen2 has quit IRC (Read error: Operation timed out)
17:29 🔗 jleclanch SketchCow: hey idk if this is useful to you, or anyone in the chan, but i found some scraping scripts for gog/steam/metacritic reviews i made a year ago for a thing. https://github.com/jleclanche/scrape-scripts
17:36 🔗 arkiver So I'm now indexing some FTP servers
17:36 🔗 arkiver Basically we'll also check every now and then if the FTPs have new files and grab those too
17:51 🔗 icedice Do aborted ArchiveBot jobs still get uploaded to Archive.org?
17:52 🔗 icedice I have a an !a archivation running that I think got everything I need and now it has started archiving the blogs of the people that commented on the page.
17:53 🔗 icedice So I think it's time to pull the plug on that once
17:53 🔗 icedice *one
17:54 🔗 DFJustin yes
17:54 🔗 DFJustin you can just use agressive ignores though which is usually better
17:54 🔗 icedice DO you have any guide for that?
17:55 🔗 xk_id has joined #archiveteam
17:55 🔗 icedice Because 20 000 pages and counting is a bit much for one blog post archival
17:55 🔗 DFJustin oh yeah just abort that
17:55 🔗 icedice clarification: ca 1200 / 20 000+ archived
17:58 🔗 Start has joined #archiveteam
18:12 🔗 DopefishJ has joined #archiveteam
18:12 🔗 swebb sets mode: +o DopefishJ
18:13 🔗 DFJustin has quit IRC (Ping timeout: 310 seconds)
18:29 🔗 xk_id has quit IRC (Remote host closed the connection)
18:30 🔗 xk_id has joined #archiveteam
18:34 🔗 DFJustin has joined #archiveteam
18:34 🔗 swebb sets mode: +o DFJustin
18:35 🔗 DopefishJ has quit IRC (Read error: Operation timed out)
18:37 🔗 philpem has joined #archiveteam
18:40 🔗 Start has quit IRC (Quit: Disconnected.)
18:46 🔗 xk_id_ has joined #archiveteam
18:46 🔗 xk_id has quit IRC (Read error: Connection reset by peer)
18:48 🔗 Start has joined #archiveteam
18:58 🔗 Ghost_of_ has quit IRC (Quit: Leaving)
19:06 🔗 Start has quit IRC (Read error: Connection reset by peer)
19:06 🔗 Start has joined #archiveteam
19:14 🔗 remsen has quit IRC (Read error: Operation timed out)
19:16 🔗 bwn has joined #archiveteam
19:16 🔗 Start has quit IRC (Quit: Disconnected.)
19:25 🔗 xk_id_ has quit IRC (Remote host closed the connection)
19:32 🔗 bwn has quit IRC (Read error: Operation timed out)
19:56 🔗 bwn has joined #archiveteam
19:57 🔗 luckcolor has quit IRC (Quit: Leaving)
19:58 🔗 icedice How do I add the no-offsite-links ignore pattern to a running archivation job?
19:59 🔗 Atluxity define running archivation job
19:59 🔗 icedice I am archiving a Photobucket album and currently it's archiving Pintrest images
20:00 🔗 Atluxity how is this archivation done?
20:01 🔗 icedice I used !a http://smg.photobucket.com/user/BlackjackGabbiani/library/Snatcher%20Leo/ on ArchiveBot
20:01 🔗 Atluxity ah
20:02 🔗 icedice I figured that I wouldn't be archiving 3300 items just for seven images
20:03 🔗 Atluxity I have never seen such an igset
20:03 🔗 Atluxity are you looking for !ao ?
20:04 🔗 icedice I've been told that !ao wouldn't get the fullsize images
20:05 🔗 icedice and that it'd just get the thumbnails
20:05 🔗 icedice https://archivebot.readthedocs.org/en/latest/commands.html#ignore
20:07 🔗 joepie91 I don't believe it is possible to ignore offsite links later on
20:08 🔗 joepie91 (currently)
20:08 🔗 joepie91 icedice: you can hack something together with a regex with a negative lookahead, if you're feeling adventurous, but that will also prevent static assets from being downloaded
20:09 🔗 icedice I think I'll just abort it
20:09 🔗 icedice I have the pages archived since long ago by now
20:09 🔗 Atluxity icedice: --no-offsite-links needs to be a parameter to !a
20:09 🔗 Atluxity it can not be added with igset, it seems
20:09 🔗 icedice and I don't see any appeal in archiving random Pintrest images
20:09 🔗 icedice ok
20:10 🔗 joepie91 you'd be surprised :P
20:10 🔗 joepie91 heh
20:10 🔗 icedice If I was doing a complete siterip, sure
20:11 🔗 icedice but just archiving a few thousand Pintrest images seems more like a bumb in the road than a good archivation effort
20:11 🔗 icedice I mean, if it was the target that would be another thing
20:16 🔗 icedice Can the no-offsite-links pattern be added to the list of commands that can be used during archivation processes in a future update of ArchiveBot?
20:19 🔗 icedice 130 items instead of 3300 when using no-offsite-links on a small Photobucket album
20:19 🔗 icedice Not bad
20:20 🔗 icedice *edit: 255 (it hadn't counted all the items at that time, it seems)
20:24 🔗 joepie91 icedice: best file a bug on the repo :)
20:25 🔗 joepie91 icedice: https://github.com/ArchiveTeam/ArchiveBot
20:25 🔗 icedice Ok, I'll do that
20:25 🔗 xk_id has joined #archiveteam
20:28 🔗 icedice has left Leaving
20:32 🔗 xk_id has quit IRC (Read error: Operation timed out)
20:50 🔗 Start has joined #archiveteam
21:10 🔗 antomatic arkiver: docstoc doc number 18063834 goes into an /images/images/images/images/... loop, if that helps any?
21:11 🔗 antomatic arkiver: and docis 6588886 has recursive &ref_url= s like http://embed.docstoc.com/handlers/downloadfilefromflash.ashx?docid=6588886&ref_url=http://www.docstoc.com/docs/6588886/../../../images/../../../skins-1.5/common/images/../../../images/../../../images/../../../skins-1.5/common/images/../../../images/../../../skins-1.5/common/images/../../../skins-1.5/common/images/../../../images/wikimedia-bu
21:11 🔗 antomatic tton.png
21:11 🔗 antomatic *docid
21:11 🔗 arkiver it's always with the downloadfilefromflash urls
21:12 🔗 antomatic mm, these last two certainly were
21:15 🔗 antomatic then again, what creates those recursive URLs to serve as the referrer in the first place, I wonder.
21:17 🔗 antomatic the bare referrer URL does seem to redirect to something valid-looking (although it's not there, so then redirects again to a 404)
21:18 🔗 xk_id has joined #archiveteam
21:18 🔗 antomatic but I wonder if there's something on the underlying pages giving out bad links which then get innocently followed and which then trigger the recursion
21:19 🔗 atomotic has joined #archiveteam
21:24 🔗 antomatic hm, wonder if the &ref_url element is even needed at all, come to think of it
21:28 🔗 arkiver I'll block urls with a nuber of / i them
21:28 🔗 arkiver in*
21:29 🔗 cvb has joined #archiveteam
21:38 🔗 Ungstein1 has quit IRC (Quit: Leaving.)
21:42 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
21:49 🔗 Start has quit IRC (Read error: Operation timed out)
21:49 🔗 nertzy has joined #archiveteam
21:50 🔗 Start has joined #archiveteam
21:50 🔗 nertzy has quit IRC (Client Quit)
21:54 🔗 nertzy has joined #archiveteam
21:56 🔗 arkiver SketchCow: I sent you a mail
21:59 🔗 ersi I'll call you to make you aware of my notification on IRC where I notify you about the e-mail I sent you
22:01 🔗 aaaaaaaaa has joined #archiveteam
22:01 🔗 swebb sets mode: +o aaaaaaaaa
22:02 🔗 joepie91 ersi: I'll make sure to pass it along
22:02 🔗 joepie91 ;)
22:07 🔗 BlueMaxim has joined #archiveteam
22:13 🔗 HCross What is the docstock channel?
22:17 🔗 scyther has quit IRC (Read error: Connection reset by peer)
22:17 🔗 Start has quit IRC (Quit: Disconnected.)
22:18 🔗 arkiver #docstop
22:21 🔗 HCross has quit IRC (Read error: Operation timed out)
22:26 🔗 godane has quit IRC (Read error: Operation timed out)
22:28 🔗 schbirid has quit IRC (Quit: Leaving)
22:29 🔗 schbirid has joined #archiveteam
22:48 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
22:49 🔗 BlueMaxim has joined #archiveteam
22:51 🔗 JSharp___ has quit IRC (Remote host closed the connection)
22:51 🔗 zyphlar__ has quit IRC (Remote host closed the connection)
23:03 🔗 HarryCros has joined #archiveteam
23:05 🔗 arkiver antomatic: sorry, I was busy with google code
23:05 🔗 arkiver I'll add a ignore pattern for the loops tomorrow morning
23:05 🔗 * arkiver is afk for the night
23:06 🔗 arkiver loops always suck in these kind of projects. it's hard to find out in scripts is an url is a loop
23:07 🔗 arkiver basically you can never be 100% sure an url is a loop without a human looking at it
23:07 🔗 arkiver but we can be 90% sure with certain ignore patterns, so let's do that
23:11 🔗 JSharp___ has joined #archiveteam
23:19 🔗 zerkalo has joined #archiveteam
23:25 🔗 zyphlar__ has joined #archiveteam
23:27 🔗 maseck has quit IRC (Read error: Operation timed out)
23:28 🔗 maseck has joined #archiveteam
23:50 🔗 Start has joined #archiveteam
23:58 🔗 bwn has quit IRC (Read error: Operation timed out)

irclogger-viewer