#archiveteam 2016-02-12,Fri

↑back Search

Time Nickname Message
00:03 🔗 mutoso has joined #archiveteam
00:05 🔗 antomatic I've got GameTrailers items which will finish uploading in just under two weeks. [Which is not actually a problem, obvs.]
00:11 🔗 kyan has joined #archiveteam
00:37 🔗 RichardG has quit IRC (Ping timeout: 250 seconds)
00:54 🔗 n00b972 has joined #archiveteam
00:55 🔗 n00b972 not sure if this has been posted already but http://www.factmag.com/2016/02/11/soundcloud-financial-report-44m-losses/
00:56 🔗 snape Being discussed in #soundclowns, come on by
00:56 🔗 n00b972 ah, thanks!
00:56 🔗 n00b972 is now known as mhazinsk_
01:00 🔗 joepie91 snape: mhazinsk_: #soundclown, not #soundclowns :)
01:01 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
01:02 🔗 wyatt8740 has joined #archiveteam
01:06 🔗 mhazinsk has joined #archiveteam
01:06 🔗 mhazinsk_ has quit IRC (Quit: http://chat.efnet.org )
01:38 🔗 RichardG has joined #archiveteam
01:46 🔗 superkuh has quit IRC (Ping timeout: 252 seconds)
01:50 🔗 RichardG has quit IRC (Ping timeout: 633 seconds)
01:52 🔗 RichardG has joined #archiveteam
01:52 🔗 JesseW has joined #archiveteam
02:07 🔗 lunG has quit IRC (Ping timeout: 250 seconds)
02:12 🔗 brayden_ has joined #archiveteam
02:12 🔗 swebb sets mode: +o brayden_
02:14 🔗 Coderjoe since gametrailers is (currently) exhausted, where should i direct my warrior?
02:15 🔗 vitzli has joined #archiveteam
02:17 🔗 brayden has quit IRC (Read error: Operation timed out)
02:26 🔗 JesseW Coderjoe: fotolog
04:08 🔗 TheKiwi I have a rather large proxmox dedicated server sitting idle right now, hmmmm
04:08 🔗 TheKiwi it's not mine, but I admin it
04:19 🔗 kyan has quit IRC (Leaving)
04:27 🔗 kyan has joined #archiveteam
04:41 🔗 megaminxw has joined #archiveteam
04:41 🔗 Microguru has quit IRC (Read error: Connection reset by peer)
04:51 🔗 JesseW TheKiwi: feel free to point it at urlteam or fotolog...
04:54 🔗 kyan has quit IRC (Leaving)
04:59 🔗 superkuh has joined #archiveteam
05:06 🔗 JetBalsa has quit IRC (Read error: Connection reset by peer)
05:08 🔗 Atom__ has quit IRC (Read error: Connection reset by peer)
05:11 🔗 Infreq_ has quit IRC (Ping timeout: 252 seconds)
05:12 🔗 Infreq has joined #archiveteam
05:24 🔗 acridAxid has quit IRC (Quit: marauder)
05:26 🔗 acridAxid has joined #archiveteam
05:40 🔗 Sk1d has quit IRC (Ping timeout: 200 seconds)
05:46 🔗 Sk1d has joined #archiveteam
05:52 🔗 vitzli has quit IRC (Leaving)
05:56 🔗 WinterFox has joined #archiveteam
06:31 🔗 nickname_ has quit IRC (Ping timeout: 300 seconds)
06:40 🔗 GLaDOS has quit IRC (Read error: Operation timed out)
06:41 🔗 GLaDOS has joined #archiveteam
07:15 🔗 fpoee has joined #archiveteam
07:20 🔗 plog99 has quit IRC (Read error: Operation timed out)
07:27 🔗 vitzli has joined #archiveteam
07:39 🔗 yipdw has quit IRC (Ping timeout: 633 seconds)
07:57 🔗 JesseW has quit IRC (Quit: Leaving.)
08:06 🔗 yipdw has joined #archiveteam
08:12 🔗 yipdw has quit IRC (Ping timeout: 246 seconds)
08:19 🔗 JesseW has joined #archiveteam
08:43 🔗 SketchCow Load on FOS is now up to 250
08:43 🔗 SketchCow This is what happens when I increase rsync connections.
08:43 🔗 RichardG has quit IRC (Ping timeout: 250 seconds)
09:01 🔗 schbirid has joined #archiveteam
09:01 🔗 yipdw has joined #archiveteam
10:35 🔗 JesseW has quit IRC (Quit: Leaving.)
10:36 🔗 zino_ is now known as zino
11:19 🔗 ersi Faster, harder, scooter
11:35 🔗 dan- has quit IRC (Read error: Operation timed out)
11:49 🔗 oldcad has joined #archiveteam
12:15 🔗 dserodio has quit IRC (Read error: Operation timed out)
12:18 🔗 dserodio has joined #archiveteam
12:19 🔗 dserodio has quit IRC (Read error: Operation timed out)
12:22 🔗 dserodio has joined #archiveteam
12:26 🔗 megaminxw has quit IRC (Quit: Leaving.)
12:29 🔗 megaminxw has joined #archiveteam
13:00 🔗 HCross2 Can someone do an ao on http://www.bbc.co.uk/news/uk-35561645 for me please
13:03 🔗 dan- has joined #archiveteam
13:07 🔗 ersi done
13:08 🔗 HCross2 Thanks. Normally NewsBuddy would get it
13:11 🔗 WinterFox has quit IRC (Remote host closed the connection)
13:13 🔗 weles has joined #archiveteam
13:35 🔗 nickname_ has joined #archiveteam
13:42 🔗 Stilett0 has joined #archiveteam
13:44 🔗 Stiletto has quit IRC (Read error: Operation timed out)
13:56 🔗 Atom__ has joined #archiveteam
14:00 🔗 RichardG has joined #archiveteam
14:09 🔗 Jon has quit IRC (Ping timeout: 260 seconds)
14:10 🔗 GLaDOS has quit IRC (Ping timeout: 260 seconds)
14:10 🔗 GLaDOS has joined #archiveteam
14:12 🔗 jmtd has joined #archiveteam
14:29 🔗 philpem has joined #archiveteam
14:30 🔗 bzc6p has joined #archiveteam
14:30 🔗 swebb sets mode: +o bzc6p
14:34 🔗 bzc6p SketchCow: I don't know what you've been doing exactly with the rsync limit, but a bit after your last announcement the limit started to visibly affect all projects.
14:35 🔗 bzc6p A suggestion of mine would be suspending the myVIP project, even though it's not the core of the problem.
14:35 🔗 bzc6p arkiver: ^
14:35 🔗 GLaDOS has quit IRC (Read error: Operation timed out)
14:37 🔗 GLaDOS has joined #archiveteam
14:44 🔗 megaminxw has quit IRC (Quit: Leaving.)
14:54 🔗 zyphlar_ has quit IRC (Quit: Connection closed for inactivity)
14:59 🔗 Start has quit IRC (Quit: Disconnected.)
15:02 🔗 redlob has quit IRC (Ping timeout: 260 seconds)
15:05 🔗 rspn has joined #archiveteam
15:06 🔗 rspn If I wanted to make sure to mirror an entire website (but no outgoing links), would I simply use wget -m or something else?
15:07 🔗 MrRadar rspn: We recommend using the WARC format for archiving web sites
15:07 🔗 MrRadar It saves the HTTP headers in addition to the content
15:07 🔗 MrRadar And there are many programs like pywb that let you browse the WARC like archived pages in the Wayback Machine
15:07 🔗 rspn MrRadar: so this http://www.archiveteam.org/index.php?title=Wget#Creating_WARC_with_wget
15:07 🔗 rspn WARC is what I want anyway
15:08 🔗 MrRadar Yes, exactly
15:08 🔗 MrRadar I personally use wpull but wget is good too
15:08 🔗 rspn Alright, great! Thanks
15:11 🔗 redlob has joined #archiveteam
15:11 🔗 morbus_ has quit IRC (Quit: http://www.disobey.com/)
15:30 🔗 PuppyCock has quit IRC (Ping timeout: 300 seconds)
15:32 🔗 PuppyCock has joined #archiveteam
15:32 🔗 rspn has quit IRC (Ping timeout: 255 seconds)
15:48 🔗 Start has joined #archiveteam
15:48 🔗 PuppyCock has quit IRC (Ping timeout: 300 seconds)
15:54 🔗 WapCapLet has joined #archiveteam
16:56 🔗 vitzli has quit IRC (Leaving)
17:01 🔗 JesseW has joined #archiveteam
17:04 🔗 VADemon has joined #archiveteam
17:05 🔗 swebb My AlJazeraAmerica crawl is 4GB in and still rolling.
17:07 🔗 Start has quit IRC (Quit: Disconnected.)
17:08 🔗 MrRadar How many requests has it captured so far?
17:09 🔗 JesseW has quit IRC (Quit: Leaving.)
17:11 🔗 swebb https://www.evernote.com/l/ACnalPqDNFNIu73LL1oFZR5COgQgEbyEp64
17:12 🔗 MrRadar Nice
17:12 🔗 swebb Yup. Nicely chugging along. :)
17:12 🔗 arkiver I'm not sure if it's grabbing everything
17:13 🔗 arkiver what are your seed URLs?
17:13 🔗 swebb http://america.aljazeera.com/
17:14 🔗 MrRadar Does Heritrix also use the sitemap?
17:14 🔗 swebb I believe so - and it honors robots.txt
17:14 🔗 swebb Here's the 3rd party domains that it's also hitting from the crawl report: https://www.evernote.com/l/ACkp16aTeyxLxJEABrMhAXIhWR5EbjgC9hc
17:38 🔗 JesseW has joined #archiveteam
17:57 🔗 JesseW has quit IRC (Quit: Leaving.)
17:58 🔗 SketchCow FOS is on track to fill.
17:59 🔗 SketchCow I killed off a few things, and the load average is (slowly) dropping, but the time it takes to make and upload a chunk, people are just uploading so much.
18:01 🔗 notjack has joined #archiveteam
18:05 🔗 notjack Hey everyone, I'm back for the time being because work has ended.
18:06 🔗 notjack Just wondering if anyone's tracking SoundCloud, because it's been reported as in financial trouble, and contains lots of music, a large amount of which is not available elsewhere on the internet.
18:07 🔗 MrRadar Yep, head over to #soundclown
18:07 🔗 MrRadar We're planning to do a discovery crawl and possibly grab the most popular tracks
18:07 🔗 MrRadar Since SoundCloud is too big even for us to grab everything
18:08 🔗 ersi "even for us" - cute :)
18:09 🔗 SmileyG lol
18:09 🔗 antomatic We should develop a new scheme to turn data files into high speed barcodes, which can be uploaded and stored forever on the Infinite Storage System that is YouTube. :)
18:10 🔗 arkiver hmm
18:10 🔗 HCross antomatic, encoding data into video - interesting
18:11 🔗 SmileyG hmmm :D
18:11 🔗 MrRadar Back in the 80s and 90s there were cards that let you back up data to VHS tapes
18:11 🔗 SmileyG can we just upload all the audio to youtube, privately? :D
18:12 🔗 antomatic I had one, MrRadar. Was an interesting old thing.
18:12 🔗 ersi Maybe time to move over to #archiveteam-bs
18:12 🔗 SmileyG Nod
18:12 🔗 antomatic SmileyG: Doesn't stop it being contentID flagged, sadly. :)
18:12 🔗 antomatic [nods]
18:14 🔗 lunG has joined #archiveteam
18:24 🔗 SketchCow I took the rsync connections count for FOS down to 100 for a while.
18:35 🔗 xekc has joined #archiveteam
18:36 🔗 xekc WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
18:36 🔗 schbirid why
18:37 🔗 SketchCow Oh, let him in.
18:37 🔗 Atluxity xekc: yahoosucks
18:37 🔗 SketchCow HUZZAH, KNAVE, THE WORD IS YAHOOSUCKS
18:37 🔗 xekc thanks! :)
18:45 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
18:46 🔗 SketchCow Do us proud
18:46 🔗 phuzion SketchCow: How long has the secret word been yahoosucks?
18:47 🔗 xmc since forever
18:47 🔗 xmc why?
18:47 🔗 arkiver because yahoo sucked forever
18:47 🔗 phuzion Because they're still living up to that.
18:47 🔗 xmc it's been that since we started having antispam on the wiki
18:47 🔗 xmc so, basically forever
18:48 🔗 xmc remember, archiveteam more or less started with geocities
18:48 🔗 Start has joined #archiveteam
18:48 🔗 xmc we have always been at war with eurasia
18:48 🔗 phuzion Yeah, true.
18:48 🔗 phuzion -> -bs
18:50 🔗 beardicus has quit IRC (bye)
18:54 🔗 beardicus has joined #archiveteam
18:58 🔗 beardicus has quit IRC (Client Quit)
19:00 🔗 mismatch has joined #archiveteam
19:01 🔗 beardicus has joined #archiveteam
19:01 🔗 phuzion has quit IRC (Remote host closed the connection)
19:02 🔗 phuzion has joined #archiveteam
19:02 🔗 Start has quit IRC (Quit: Disconnected.)
19:11 🔗 HCross NewsBuddy is back, and ERROR FREE!
19:12 🔗 bzc6p has left
19:17 🔗 joepie91 famous last words
19:25 🔗 Start has joined #archiveteam
19:28 🔗 xekc added a section in the wiki on FreeFeed.net - hope didn't write anything too stupid there http://www.archiveteam.org/index.php?title=FriendFeed#Archiving
19:36 🔗 RichardG has quit IRC (Ping timeout: 492 seconds)
19:41 🔗 SketchCow I deleted a 330gb project I was working on. That puts it at 84% filled, with 1.2tb of space free.
19:46 🔗 Stiletto has joined #archiveteam
19:54 🔗 signius chfoo, its looking like the rsync server is maxing out on connection preventing friendsreunited grab from being able to upload data
19:55 🔗 signius @ERROR: max connections (100) reached -- try again later
20:02 🔗 RichardG has joined #archiveteam
20:12 🔗 SketchCow HI.
20:12 🔗 SketchCow --------------------------------------------------------------------------------------
20:12 🔗 SketchCow HERE IS THE DEAL. FOS IS OVERLOADED BECAUSE WE ARE DOING 5 MAJOR THINGS TO IT.
20:12 🔗 SketchCow RSYNC IS SET TO 100 CONNECTIONS WHILE (HOPEFULLY) IT DOESN'T RUN OUT OF DISK SPACE
20:13 🔗 SketchCow BUT HOO BOY, WE ARE AT 84 PERCENT FULL AND 1.2T LEFT ON THE DRIVE AND YOU GUYS ARE
20:13 🔗 SketchCow BLOWING THAT SHIT OUT OF THE WATER
20:13 🔗 SketchCow --------------------------------------------------------------------------------------
20:14 🔗 SketchCow I have to travel now. I will periodically check but the chances of a nightmare scenario are non-zero
20:14 🔗 SketchCow In the future, we need to see about a buffer box
20:14 🔗 SketchCow (again)
20:14 🔗 SketchCow And then it either does the work itself or it gives it to FOS later.
20:14 🔗 SketchCow FOS is an OK machine, but it's a VM and it hates everybody
20:15 🔗 phuzion SketchCow: Want me to see if I can track down someone who can offer a few TB of space in the meantime as an alternative rsync target?
20:15 🔗 SketchCow It better be a crapload
20:15 🔗 SketchCow coordinate with arkiver
20:15 🔗 phuzion okie dokie
20:18 🔗 xekc has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…)
20:18 🔗 weles I'm new to running ATWarrior.. I'm getting No item received... after selecting "current project". is that normal?
20:20 🔗 Atluxity weles: WELCOME!
20:20 🔗 Atluxity what project is currect project?
20:21 🔗 SimpBrain sometimes there is downtime and projects will either end of get paused, so sometimes you'll hit no items received
20:21 🔗 Atluxity yeah
20:21 🔗 weles looks like gametrailers is selected
20:21 🔗 Atluxity weles: http://tracker.archiveteam.org/gametrailers/
20:21 🔗 Atluxity this is the tracker view
20:22 🔗 Atluxity top left it says "items"
20:22 🔗 Atluxity on the "to do" is 0
20:22 🔗 Atluxity so no items left in tracker
20:23 🔗 weles thanks.. is any other project currently active?
20:23 🔗 weles or what's the priority
20:23 🔗 Atluxity fotolog, it looks to me
20:23 🔗 Atluxity no...
20:23 🔗 Atluxity thats not running, it looks
20:24 🔗 Atluxity I dont know, weles
20:24 🔗 Atluxity hang around
20:24 🔗 SimpBrain fos has been capped so some projects have slowed down
20:24 🔗 phuzion weles: If you're looking to put your warrior to use for the time being, feel free to set it to URLTeam
20:25 🔗 weles phuzion: i tried that but i'm getting 404 , no items available :)
20:26 🔗 phuzion oh right
20:26 🔗 * phuzion checks the dashboard
20:28 🔗 weles btw. does the archiveteam archive its irc channel? are there any logs anywhere?
20:29 🔗 HCross http://www.pcworld.com/article/3032490/internet/new-sourceforge-owners-kill-contentious-devshare-bloatware-program.html oooh err
20:31 🔗 jut has joined #archiveteam
20:38 🔗 MrRadar weles: http://archive.fart.website/bin/irclogger_logs
20:38 🔗 weles MrRadar: thanks
20:39 🔗 SimpBrain i wouldnt trust new sourceforge owners
20:40 🔗 swebb I have the old archive here: https://badcheese.com/~steve/atlogs/
20:40 🔗 BubuAnabe has joined #archiveteam
20:43 🔗 atlogbot has joined #archiveteam
20:44 🔗 MrRadar SimpBrain: I wouldn't either. They also own SlashDot which we should probably grab as well if we ever decide to go after SF
20:44 🔗 SimpBrain we partly did, till we were told not to
20:45 🔗 Start has quit IRC (Quit: Disconnected.)
20:46 🔗 swebb BTW, there is already an archive.org archive in the wayback machine for america.aljazera.com: https://web.archive.org/web/*/america.aljazeera.com
20:46 🔗 swebb and it's updated daily from the looks of things.
20:46 🔗 swebb Same with slashdot: https://web.archive.org/web/*/http://slashdot.org
20:48 🔗 signius has quit IRC (Read error: Operation timed out)
20:49 🔗 swebb Same with Soundcloud.
20:51 🔗 swebb aljazeera and slashdot look fantastically covered in the wayback machine, but soundcloud doesn't - probably because of the amount of javascript and whatnot - soundcloud is probably a single-page-app which is not easy for a traditional crawler to grab.
20:55 🔗 BubuAnabe Google's closing Picasa in favor of Google Photos. http://googlephotos.blogspot.com.ar/2016/02/moving-on-from-picasa.html
20:56 🔗 SimpBrain oh joy
20:56 🔗 antomatic ruh oh
20:57 🔗 swebb If you have photos or videos in a Picasa Web Album today, the easiest way to still access, modify and share most of that content is to log in to Google Photos, and all your photos and videos will already be there.
20:57 🔗 SimpBrain nothing being nuked
20:57 🔗 swebb Looks like they're merging
20:58 🔗 swebb Which is what they did when they shut down Google Video too, so nothing to worry about.
20:58 🔗 antomatic ISTR that they didn't have a migration plan for Google Video at first, though..
20:58 🔗 MrRadar Didn't it take a fair bit of cajouling to get them to fold the Google Video library into YouTube instead of deleting it?
20:59 🔗 antomatic Wasn't it all "Get your shit before this date, no we won't upload it to youtube for you| ?
20:59 🔗 signius has joined #archiveteam
20:59 🔗 swebb Yea, initially they were going to nuke it all, but there was enough pushback that they switched gears.
20:59 🔗 antomatic I'm certain Sketchcow makes that exact point in one of his talks
20:59 🔗 weles has quit IRC (Read error: Operation timed out)
20:59 🔗 antomatic ArchiveTeam made Google look bad, questions raised internally, policy changed
20:59 🔗 antomatic result.
21:00 🔗 swebb Google customers probably complained too.
21:00 🔗 MrRadar That's what the AT page for Google Video says http://archiveteam.org/index.php?title=Google_Video
21:00 🔗 antomatic [nods]
21:04 🔗 Nemo_bis SimpBrain: still, Picasa Web is a source for freely licensed images which 1) will disappear in most cases, 2) once migrated to Google Photos lose their copyright license marking.
21:06 🔗 mismatch has quit IRC (Ping timeout: 260 seconds)
21:11 🔗 WinterFox has joined #archiveteam
21:11 🔗 W1nterFox has joined #archiveteam
21:11 🔗 WinterFox has quit IRC (Read error: Connection reset by peer)
21:11 🔗 W1nterFox has quit IRC (Client Quit)
21:12 🔗 WinterFox has joined #archiveteam
21:31 🔗 megaminxw has joined #archiveteam
21:46 🔗 jut has quit IRC (Read error: Connection reset by peer)
22:08 🔗 PuppyCock has joined #archiveteam
22:09 🔗 WapCapLet has quit IRC (Read error: Operation timed out)
22:15 🔗 espes__ has quit IRC (Read error: Operation timed out)
22:15 🔗 fpoee does anyone know what's up with archiving Quora?
22:22 🔗 SN4T14 has joined #archiveteam
22:22 🔗 SN4T14 has quit IRC (Connection closed)
22:25 🔗 Stiletto has quit IRC (Read error: Connection reset by peer)
22:26 🔗 Stiletto has joined #archiveteam
22:32 🔗 casdr has joined #archiveteam
22:47 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
23:02 🔗 ivan` fpoee: it's pretty annoying because they deploy IP bans for crawling too many pages/sec, where it isn't very many pages at all
23:02 🔗 ivan` they might also do them manually, it's hard to tell
23:10 🔗 Stiletto has quit IRC (Read error: Operation timed out)
23:11 🔗 fpoee can it even be crawled properly? it seems pages are loaded in parts
23:13 🔗 fpoee maybe i'm lucky that my ips didnt get banned yet. my crawl is running at about 30GB now
23:14 🔗 fpoee does anyone crawl it?
23:14 🔗 fpoee *did or does anyone else?
23:19 🔗 ivan` ArchiveBot has partially crawled it a few times
23:19 🔗 ivan` yeah, you need some kind of browser engine to get all the content and nobody has done that afaik
23:20 🔗 ivan` but there is plenty of static HTML
23:20 🔗 ivan` you can try it in Firefox with NoScript blocking JS
23:24 🔗 fpoee its a very messy site to grab. how big have those crawls gotten?
23:25 🔗 fpoee i guess that was a long time ago
23:27 🔗 nickname_ has quit IRC (Ping timeout: 300 seconds)
23:39 🔗 nickname_ has joined #archiveteam
23:41 🔗 [phire] has quit IRC (Quit: ZNC - http://znc.in)
23:50 🔗 SN4T14 has joined #archiveteam
23:50 🔗 SN4T14 has quit IRC (Connection closed)
23:53 🔗 [phire] has joined #archiveteam

irclogger-viewer