[00:02] <Martini> I think we need more noise on Twitter. RT #IATelethon . lets send them to the YouTube live page, until they fix telethon.archive.org
[00:12] <Martini> https://www.youtube.com/watch?v=UM71NPrb5iM
[00:27] <JesseW> Martini: I'm trying to post links to neat things on the archive...
[00:27] <JesseW> along with the hashtag
[00:35] <DFJustin> telethon.archive.org is fixed
[00:40] <Martini> Thanks.
[00:40] <Martini> http://telethon.archive.org/ is working again.
[00:55] *** Ghost_of_ has joined #archiveteam
[01:13] *** asdf has joined #archiveteam
[01:22] *** aaaaaaaaa has joined #archiveteam
[01:22] *** swebb sets mode: +o aaaaaaaaa
[02:04] *** parker_ has quit IRC (Remote host closed the connection)
[02:05] *** parker_ has joined #archiveteam
[02:19] *** Froggypwn has quit IRC (Ping timeout: 311 seconds)
[02:29] *** nertzy has joined #archiveteam
[02:38] *** parker_ has quit IRC (Remote host closed the connection)
[02:38] *** parker_ has joined #archiveteam
[02:43] *** parker_ has quit IRC (Remote host closed the connection)
[02:44] *** parker_ has joined #archiveteam
[02:46] *** nd1ddy has quit IRC (Read error: Connection reset by peer)
[02:48] *** parker_ has quit IRC (Remote host closed the connection)
[02:49] *** parker_ has joined #archiveteam
[02:59] *** ndiddy has joined #archiveteam
[03:04] *** asdf has quit IRC (Ping timeout: 378 seconds)
[03:09] *** Martini has quit IRC (Quit: ChatZilla 0.9.92 [Firefox 43.0.1/20151216175450])
[03:15] *** Froggypwn has joined #archiveteam
[03:44] *** godane has quit IRC (Ping timeout: 311 seconds)
[03:46] *** godane has joined #archiveteam
[03:50] *** DDR has quit IRC (Remote host closed the connection)
[03:55] *** godane has quit IRC (Leaving.)
[03:55] *** godane has joined #archiveteam
[04:09] *** nertzy has quit IRC (Quit: This computer has gone to sleep)
[04:09] *** Ghost_of_ has quit IRC (Quit: Leaving)
[04:24] *** nertzy has joined #archiveteam
[04:28] *** aaaaaaaaa has quit IRC (Leaving)
[04:39] *** ndiddy has quit IRC (Read error: Connection reset by peer)
[05:56] *** nertzy has quit IRC (Quit: This computer has gone to sleep)
[06:09] *** nertzy has joined #archiveteam
[06:30] *** asdf has joined #archiveteam
[07:22] *** Ungstein has quit IRC (Quit: Leaving.)
[07:39] *** vitzli has joined #archiveteam
[08:03] *** BlueMaxim has quit IRC (Read error: Connection reset by peer)
[08:11] *** VADemon has quit IRC (left4dead)
[08:19] *** Boppen has quit IRC (Read error: Connection reset by peer)
[08:19] *** Boppen has joined #archiveteam
[08:37] *** nertzy has quit IRC (Quit: This computer has gone to sleep)
[08:37] *** JesseW has quit IRC (Leaving.)
[09:18] *** schbirid has joined #archiveteam
[09:25] *** asdf has quit IRC (Ping timeout: 252 seconds)
[14:15] *** Muad-Dib has joined #archiveteam
[14:16] *** WinterFox has quit IRC (Remote host closed the connection)
[14:41] *** Froggypwn has quit IRC (Ping timeout: 483 seconds)
[14:45] *** Froggypwn has joined #archiveteam
[15:08] *** signius has quit IRC (Ping timeout: 364 seconds)
[15:15] *** VADemon has joined #archiveteam
[15:17] *** Atom__ has quit IRC (Atom__)
[15:23] *** Froggypwn has quit IRC (Ping timeout: 483 seconds)
[15:26] *** Froggypwn has joined #archiveteam
[15:57] *** alberto has joined #archiveteam
[16:00] *** vitzli has quit IRC (Quit: Leaving)
[16:21] <arkiver> Me and HCross have been working for some days on a newsgrabber.
[16:21] <arkiver> The dashboard can be viewed here http://newsgrabber.harrycross.me:29000/
[16:21] <HCross> Sites can be submitted here: https://github.com/ArchiveTeam/NewsGrabber
[16:30] <arkiver> So feel free to read the readme and make a pull requst for youe newswebsites!
[16:30] <HCross> At the moment it doesnt automagically sync to the server for archive, but ping me when you add one and Ill copy it down
[16:43] *** Ghost_of_ has joined #archiveteam
[16:47] <HCross> you can watch it underway now
[16:49] <arkiver> Basically what the system does
[16:49] <arkiver> For every newssite you want to add you have to add a small python file
[16:50] <arkiver> this file contains the URLs it will recheck with a specified interval for new URLs
[16:51] <arkiver> the file also contains some regexes to match if the URL is a newsarticle or if it some a videoURL
[16:51] <arkiver> if it's a videoURL it will be downloaded with youtube-dl
[17:11] <Atluxity> does the newsgrabber got its own channel?
[17:11] <HCross> Not yet
[17:12] <Atluxity> the news-site I am trying to submit has both rss for "top items" and "latest". Include both or just "latest"?
[17:13] <arkiver> That would be just latest
[17:13] <Atluxity> ok
[17:13] <arkiver> Just add a good refresh time so it won't miss any articles
[17:13] <HCross> The grabber has gone down for a second to update the script
[17:28] <Atluxity> this freaking site has no structure! grrrr
[17:29] <Atluxity> "latest" is small news bulletings... articles are "top items" only
[17:30] <Atluxity> no tell in url if the page got video in it or not
[17:31] <HCross> Do most of the pages in that site have videos?
[17:34] <Atluxity> nah
[17:34] <Atluxity> that would be a strech
[17:35] <arkiver> If you have multiple URLs it has to check for new URLs you can multiple
[17:36] <arkiver> Always try to add as less URLs as possible, but still get all artices
[17:36] <Atluxity> yeah, I understand 
[17:51] *** JesseW has joined #archiveteam
[17:53] *** ndiddy has joined #archiveteam
[17:59] *** signius has joined #archiveteam
[18:03] *** atomotic has joined #archiveteam
[18:03] <joepie91> arkiver: HCross: been thinking for a while about something like that, good to see it happening
[18:03] <joepie91> :p
[18:04] <arkiver> joepie91: feel free to add as many websites as you can :)
[18:04] *** Amitari has joined #archiveteam
[18:04] *** RichardG has quit IRC (Read error: Connection reset by peer)
[18:05] <Amitari> Hey, anyone who knows wget that can help me?
[18:05] <joepie91> arkiver: how does one test it?
[18:05] <joepie91> also, dashboard shows nothing
[18:05] <arkiver> joepie91: it checks for new links every now and then
[18:05] <arkiver> and downloads the list of found new links every hour
[18:06] <arkiver> There's not many websites, so that's why it often doesn't show downloads
[18:06] <arkiver> joepie91: read the instructions please
[18:07] <arkiver> Instructions and looking at other items shows how everything works I think
[18:07] <arkiver> scripts will be made public later maybe
[18:07] <joepie91> arkiver: yes, I've read the instructions. it does not answer my question :)
[18:08] <joepie91> and eh, scripts should be public straightaway
[18:08] <HCross> joepie91, we are changing the code every half an hour at this point
[18:08] <joepie91> (also, checks every hour? it's not uncommon for controversial articles to be removed faster than that)
[18:08] <joepie91> HCross: ok?
[18:09] <HCross> Ye. When its more developed we are going to consider releasing
[18:09] <joepie91> "consider releasing"?
[18:09] <joepie91> and why does that have to wait until "when its more developed"?
[18:09] <arkiver> yeah I'll put it online
[18:09] <arkiver> I do want to keep this on one server for now though
[18:10] <joepie91> HCross: see also https://web.archive.org/web/20150429004351/http://blog.civiccommons.org/2011/01/be-open-from-day-one
[18:10] *** RichardG has joined #archiveteam
[18:10] <HCross> So we dont get overlap. We dont want 100 peoplle all archiving BBC news at the same time for example
[18:10] <Atluxity> I need help with a regex for the newsgrabber
[18:10] <joepie91> HCross: that is unrelated to releasing code.
[18:10] <Atluxity> videoregex should match on subdomain "tv"
[18:11] <joepie91> if you don't want people doing that, then put in the readme that you don't want people doing that
[18:11] <joepie91> making the code available, in this case, is a safety mechanism so that if you get hit by a bus, somebody can pick it up
[18:11] <HCross> True
[18:12] <arkiver> 3 north korean websites added!
[18:12] <HCross> When the scripts get updated. - doing that now
[18:12] <joepie91> basically, if you want people to use it carefully, just *ask* them to do so. don't immediately resort to the option of "force" (ie. keeping the code unavailable to them)
[18:15] <HCross> True, its in very early days right now 
[18:15] <HCross> godane, do we have any nres on the Cryengine stuff?
[18:15] <arkiver> joepie91: yeah, we get it
[18:16] <Amitari> Anyone who can help me with wget? When I try to save a cookie before archiving a PhpBB-forum, I get the message "Remote file exists and could contain further links,
[18:16] <Amitari> but recursion is disabled -- not retrieving.
[18:16] <Amitari> "
[18:19] <arkiver> Atluxity: I'm off for some time now, can I help you later?
[18:20] <HCross> Well, the north korean websites crashed on me
[18:20] <Atluxity> arkiver: sure
[18:23] <Atluxity> https://github.com/atluxity/NewsGrabber/blob/master/services/web_nrk_no.py
[18:23] <Atluxity> they split up in so many urls :\
[18:42] <joepie91> HCross: arkiver: do you want example URLs for some of the BBC's older and newer formats?
[18:42] <joepie91> some are still in use for specials
[18:42] <joepie91> others only for historical articles
[18:42] <joepie91> (they don't migrate - they just leave the old content where it is)
[18:43] <HCross> we have the BBC news stuff already, we are more about going after the breaking news. I dont see why not though
[18:43] <joepie91> HCross: the BBC uses more than one format
[18:43] <joepie91> including very fancy highly multimedial ones
[18:43] <HCross> ah. Go on then
[18:43] <joepie91> :p
[18:44] <Amitari> Hey, could anyone here possibly help me with wget?
[18:45] <joepie91> HCross: http://news.bbc.co.uk/2/hi/health/406713.stm, http://www.bbc.co.uk/news/resources/idt-07eeeebb-d450-4e4b-98d4-755369be7855 / http://www.bbc.com/news/special/2014/newsspec_7617/index.html, http://www.bbc.com/news/world-europe-25190119, http://www.bbc.co.uk/newsbeat/24449861, http://www.bbc.com/future/story/20131112-potato-power-to-light-the-world, http://www.bbc.co.uk/blogs/adamcurtis/posts/BUGGER, http://news.bbc.co.uk/2/hi/science/nature/
[18:45] <joepie91> 630961.stm, http://news.bbc.co.uk/2/hi/uk_news/england/manchester/3758209.stm, http://www.bbc.co.uk/music/reviews/9gvh
[18:45] <joepie91> err
[18:46] <joepie91> the cut-off one is http://news.bbc.co.uk/2/hi/science/nature/630961.stm
[18:46] <joepie91> these are all slightly different URL/content formats
[18:46] <joepie91> for different types of content
[18:46] <joepie91> most of these are still in use
[18:46] <joepie91> the .stm ones are legacy, no longer in use but still referenced
[18:47] <joepie91> the news/resources, news/special and BBC future ones are likely to have JS-loaded content
[18:47] <joepie91> Amitari: probably best to ask in #archiveteam-bs
[18:47] <Amitari> Thanks!
[18:47] *** Amitari has left Leaving
[18:48] <HCross> joepie91, thanks. cc arkiver 
[18:48] <joepie91> HCross: arkiver: also, keep in mind that nutech is on a different domain from nu.nl, and their articles are not consistently listed on nu.nl
[18:48] <joepie91> idem for rtlz/editienl and rtl.nl
[18:48] *** SN4T14 has quit IRC (Read error: Operation timed out)
[18:48] *** SN4T14 has joined #archiveteam
[18:49] <joepie91> webwereld is also one worth looking into, but they also cross-post across multiple sites but not reliably
[18:49] <joepie91> same for infoworld/pcworld
[18:49] <JesseW> urlteam tracker seems to be borked for now
[18:50] <arkiver> joepie91: https://github.com/ArchiveTeam/NewsGrabber/blob/master/services/web__bbc_com.py
[18:50] <arkiver> please have a look at those services
[18:51] <arkiver> and if you want anything added you can write a python file for it
[18:52] <joepie91> arkiver: I don't have much time right now (or rather, until after 32C3), hence sharing the knowledge :)
[18:52] <joepie91> plus I need some way to test things
[18:52] <arkiver> just test if the regex matches the URLs you want to extract from your seed URLs
[18:53] <JesseW> arkiver: could you look at the server logs on the urlteam tracker -- it seems to be broken
[18:53] <joepie91> regardless, no time for PRs atm
[19:01] <arkiver> Atluxity: commented
[19:03] <arkiver> JesseW: I think chfoo has to do that
[19:04] <JesseW> ah, ok
[19:04] <JesseW> xmc: do you have access?
[19:10] *** scyther has joined #archiveteam
[19:38] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
[19:38] *** schbirid has quit IRC (Quit: Leaving)
[19:50] *** brayden_ has quit IRC (Read error: Connection reset by peer)
[19:50] *** brayden has joined #archiveteam
[19:50] *** swebb sets mode: +o brayden
[19:51] <Atluxity> arkiver: ack
[20:00] <Start> it seems that rather than having 1 rss feed cbc has a whole bunch: http://www.cbc.ca/rss/
[20:01] *** maseck has quit IRC (Remote host closed the connection)
[20:04] <godane> joepie91: i'm saving those bbc news urls
[20:05] <godane> example: http://news.bbc.co.uk/2/hi/630961.stm
[20:05] <godane> you can just brute force
[20:11] *** schbirid has joined #archiveteam
[20:19] *** JesseW has quit IRC (Leaving.)
[20:25] *** alberto has quit IRC (Ping timeout: 250 seconds)
[20:25] *** JesseW has joined #archiveteam
[20:34] *** Ghost_of_ has quit IRC (Quit: Leaving)
[20:38] *** JesseW has quit IRC (Leaving.)
[20:41] *** maseck has joined #archiveteam
[21:02] *** xXx_ndidd has joined #archiveteam
[21:08] *** Coderjoe has quit IRC (Read error: Connection reset by peer)
[21:09] *** ndiddy has quit IRC (Read error: Operation timed out)
[21:14] *** Coderjoe has joined #archiveteam
[21:33] *** schbirid has quit IRC (Quit: Leaving)
[21:50] *** Ghost_of_ has joined #archiveteam
[21:55] <Atluxity> arkiver: updated
[21:56] *** JesseW has joined #archiveteam
[22:26] *** JesseW has quit IRC (Leaving.)
[22:30] *** scyther has quit IRC (Read error: Connection reset by peer)
[22:44] *** closure has joined #archiveteam
[22:45] *** nertzy has joined #archiveteam
[23:05] *** err3 has joined #archiveteam
[23:05] <err3> hello
[23:07] *** nertzy has quit IRC (Quit: This computer has gone to sleep)
[23:10] <Atluxity> GREETINGS!
[23:10] <err3> I've got an idea for archiving project
[23:10] <err3> just in case anyone likes it
[23:11] <Atluxity> lay it on us
[23:11] <err3> there's some good forums where people post math problems and solutions, e.g. artofproblemsolving
[23:11] <err3> just went to it after a long time and it had totally changed, I got a shock that maybe they removed all of the old stuff - apparently they haven't
[23:11] <err3> but it might be good to somehow make an archive of it
[23:12] <err3> I'm not sure if it would need some special scripting to do
[23:12] <Atluxity> got some urls?
[23:14] <err3> https://www.artofproblemsolving.com/community is it now
[23:14] <err3> https://web.archive.org/web/20130201150755/http://www.artofproblemsolving.com/Forum/index.php used to look like this
[23:15] <err3> let me gett a better one
[23:15] <Atluxity> wonder how big these sites are... probably not too big
[23:16] <err3> they might not be too large, the important thing is the text (although sometimes equations get rendered into images)
[23:16] <err3> https://web.archive.org/web/20130510031806/http://www.artofproblemsolving.com/Forum/index.php
[23:16] <err3> thats how i remember it
[23:17] <err3> https://web.archive.org/web/20140331091424/http://www.artofproblemsolving.com/Forum/viewforum.php?f=56
[23:17] <err3> i think a lot of the posts are not archived
[23:29] *** RichardG_ has joined #archiveteam
[23:29] *** RichardG has quit IRC (Read error: Connection reset by peer)
[23:35] *** Ghost_of_ has quit IRC (Quit: Leaving)
[23:42] *** WinterFox has joined #archiveteam
[23:44] <HCross> For the newsgrab, when you submit, please check the file naming. 
[23:48] <HCross> its web__foo_bar_com.py