[00:02] I think we need more noise on Twitter. RT #IATelethon . lets send them to the YouTube live page, until they fix telethon.archive.org [00:12] https://www.youtube.com/watch?v=UM71NPrb5iM [00:27] Martini: I'm trying to post links to neat things on the archive... [00:27] along with the hashtag [00:35] telethon.archive.org is fixed [00:40] Thanks. [00:40] http://telethon.archive.org/ is working again. [00:55] *** Ghost_of_ has joined #archiveteam [01:13] *** asdf has joined #archiveteam [01:22] *** aaaaaaaaa has joined #archiveteam [01:22] *** swebb sets mode: +o aaaaaaaaa [02:04] *** parker_ has quit IRC (Remote host closed the connection) [02:05] *** parker_ has joined #archiveteam [02:19] *** Froggypwn has quit IRC (Ping timeout: 311 seconds) [02:29] *** nertzy has joined #archiveteam [02:38] *** parker_ has quit IRC (Remote host closed the connection) [02:38] *** parker_ has joined #archiveteam [02:43] *** parker_ has quit IRC (Remote host closed the connection) [02:44] *** parker_ has joined #archiveteam [02:46] *** nd1ddy has quit IRC (Read error: Connection reset by peer) [02:48] *** parker_ has quit IRC (Remote host closed the connection) [02:49] *** parker_ has joined #archiveteam [02:59] *** ndiddy has joined #archiveteam [03:04] *** asdf has quit IRC (Ping timeout: 378 seconds) [03:09] *** Martini has quit IRC (Quit: ChatZilla 0.9.92 [Firefox 43.0.1/20151216175450]) [03:15] *** Froggypwn has joined #archiveteam [03:44] *** godane has quit IRC (Ping timeout: 311 seconds) [03:46] *** godane has joined #archiveteam [03:50] *** DDR has quit IRC (Remote host closed the connection) [03:55] *** godane has quit IRC (Leaving.) [03:55] *** godane has joined #archiveteam [04:09] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [04:09] *** Ghost_of_ has quit IRC (Quit: Leaving) [04:24] *** nertzy has joined #archiveteam [04:28] *** aaaaaaaaa has quit IRC (Leaving) [04:39] *** ndiddy has quit IRC (Read error: Connection reset by peer) [05:56] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [06:09] *** nertzy has joined #archiveteam [06:30] *** asdf has joined #archiveteam [07:22] *** Ungstein has quit IRC (Quit: Leaving.) [07:39] *** vitzli has joined #archiveteam [08:03] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [08:11] *** VADemon has quit IRC (left4dead) [08:19] *** Boppen has quit IRC (Read error: Connection reset by peer) [08:19] *** Boppen has joined #archiveteam [08:37] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [08:37] *** JesseW has quit IRC (Leaving.) [09:18] *** schbirid has joined #archiveteam [09:25] *** asdf has quit IRC (Ping timeout: 252 seconds) [14:15] *** Muad-Dib has joined #archiveteam [14:16] *** WinterFox has quit IRC (Remote host closed the connection) [14:41] *** Froggypwn has quit IRC (Ping timeout: 483 seconds) [14:45] *** Froggypwn has joined #archiveteam [15:08] *** signius has quit IRC (Ping timeout: 364 seconds) [15:15] *** VADemon has joined #archiveteam [15:17] *** Atom__ has quit IRC (Atom__) [15:23] *** Froggypwn has quit IRC (Ping timeout: 483 seconds) [15:26] *** Froggypwn has joined #archiveteam [15:57] *** alberto has joined #archiveteam [16:00] *** vitzli has quit IRC (Quit: Leaving) [16:21] Me and HCross have been working for some days on a newsgrabber. [16:21] The dashboard can be viewed here http://newsgrabber.harrycross.me:29000/ [16:21] Sites can be submitted here: https://github.com/ArchiveTeam/NewsGrabber [16:30] So feel free to read the readme and make a pull requst for youe newswebsites! [16:30] At the moment it doesnt automagically sync to the server for archive, but ping me when you add one and Ill copy it down [16:43] *** Ghost_of_ has joined #archiveteam [16:47] you can watch it underway now [16:49] Basically what the system does [16:49] For every newssite you want to add you have to add a small python file [16:50] this file contains the URLs it will recheck with a specified interval for new URLs [16:51] the file also contains some regexes to match if the URL is a newsarticle or if it some a videoURL [16:51] if it's a videoURL it will be downloaded with youtube-dl [17:11] does the newsgrabber got its own channel? [17:11] Not yet [17:12] the news-site I am trying to submit has both rss for "top items" and "latest". Include both or just "latest"? [17:13] That would be just latest [17:13] ok [17:13] Just add a good refresh time so it won't miss any articles [17:13] The grabber has gone down for a second to update the script [17:28] this freaking site has no structure! grrrr [17:29] "latest" is small news bulletings... articles are "top items" only [17:30] no tell in url if the page got video in it or not [17:31] Do most of the pages in that site have videos? [17:34] nah [17:34] that would be a strech [17:35] If you have multiple URLs it has to check for new URLs you can multiple [17:36] Always try to add as less URLs as possible, but still get all artices [17:36] yeah, I understand [17:51] *** JesseW has joined #archiveteam [17:53] *** ndiddy has joined #archiveteam [17:59] *** signius has joined #archiveteam [18:03] *** atomotic has joined #archiveteam [18:03] arkiver: HCross: been thinking for a while about something like that, good to see it happening [18:03] :p [18:04] joepie91: feel free to add as many websites as you can :) [18:04] *** Amitari has joined #archiveteam [18:04] *** RichardG has quit IRC (Read error: Connection reset by peer) [18:05] Hey, anyone who knows wget that can help me? [18:05] arkiver: how does one test it? [18:05] also, dashboard shows nothing [18:05] joepie91: it checks for new links every now and then [18:05] and downloads the list of found new links every hour [18:06] There's not many websites, so that's why it often doesn't show downloads [18:06] joepie91: read the instructions please [18:07] Instructions and looking at other items shows how everything works I think [18:07] scripts will be made public later maybe [18:07] arkiver: yes, I've read the instructions. it does not answer my question :) [18:08] and eh, scripts should be public straightaway [18:08] joepie91, we are changing the code every half an hour at this point [18:08] (also, checks every hour? it's not uncommon for controversial articles to be removed faster than that) [18:08] HCross: ok? [18:09] Ye. When its more developed we are going to consider releasing [18:09] "consider releasing"? [18:09] and why does that have to wait until "when its more developed"? [18:09] yeah I'll put it online [18:09] I do want to keep this on one server for now though [18:10] HCross: see also https://web.archive.org/web/20150429004351/http://blog.civiccommons.org/2011/01/be-open-from-day-one [18:10] *** RichardG has joined #archiveteam [18:10] So we dont get overlap. We dont want 100 peoplle all archiving BBC news at the same time for example [18:10] I need help with a regex for the newsgrabber [18:10] HCross: that is unrelated to releasing code. [18:10] videoregex should match on subdomain "tv" [18:11] if you don't want people doing that, then put in the readme that you don't want people doing that [18:11] making the code available, in this case, is a safety mechanism so that if you get hit by a bus, somebody can pick it up [18:11] True [18:12] 3 north korean websites added! [18:12] When the scripts get updated. - doing that now [18:12] basically, if you want people to use it carefully, just *ask* them to do so. don't immediately resort to the option of "force" (ie. keeping the code unavailable to them) [18:15] True, its in very early days right now [18:15] godane, do we have any nres on the Cryengine stuff? [18:15] joepie91: yeah, we get it [18:16] Anyone who can help me with wget? When I try to save a cookie before archiving a PhpBB-forum, I get the message "Remote file exists and could contain further links, [18:16] but recursion is disabled -- not retrieving. [18:16] " [18:19] Atluxity: I'm off for some time now, can I help you later? [18:20] Well, the north korean websites crashed on me [18:20] arkiver: sure [18:23] https://github.com/atluxity/NewsGrabber/blob/master/services/web_nrk_no.py [18:23] they split up in so many urls :\ [18:42] HCross: arkiver: do you want example URLs for some of the BBC's older and newer formats? [18:42] some are still in use for specials [18:42] others only for historical articles [18:42] (they don't migrate - they just leave the old content where it is) [18:43] we have the BBC news stuff already, we are more about going after the breaking news. I dont see why not though [18:43] HCross: the BBC uses more than one format [18:43] including very fancy highly multimedial ones [18:43] ah. Go on then [18:43] :p [18:44] Hey, could anyone here possibly help me with wget? [18:45] HCross: http://news.bbc.co.uk/2/hi/health/406713.stm, http://www.bbc.co.uk/news/resources/idt-07eeeebb-d450-4e4b-98d4-755369be7855 / http://www.bbc.com/news/special/2014/newsspec_7617/index.html, http://www.bbc.com/news/world-europe-25190119, http://www.bbc.co.uk/newsbeat/24449861, http://www.bbc.com/future/story/20131112-potato-power-to-light-the-world, http://www.bbc.co.uk/blogs/adamcurtis/posts/BUGGER, http://news.bbc.co.uk/2/hi/science/nature/ [18:45] 630961.stm, http://news.bbc.co.uk/2/hi/uk_news/england/manchester/3758209.stm, http://www.bbc.co.uk/music/reviews/9gvh [18:45] err [18:46] the cut-off one is http://news.bbc.co.uk/2/hi/science/nature/630961.stm [18:46] these are all slightly different URL/content formats [18:46] for different types of content [18:46] most of these are still in use [18:46] the .stm ones are legacy, no longer in use but still referenced [18:47] the news/resources, news/special and BBC future ones are likely to have JS-loaded content [18:47] Amitari: probably best to ask in #archiveteam-bs [18:47] Thanks! [18:47] *** Amitari has left Leaving [18:48] joepie91, thanks. cc arkiver [18:48] HCross: arkiver: also, keep in mind that nutech is on a different domain from nu.nl, and their articles are not consistently listed on nu.nl [18:48] idem for rtlz/editienl and rtl.nl [18:48] *** SN4T14 has quit IRC (Read error: Operation timed out) [18:48] *** SN4T14 has joined #archiveteam [18:49] webwereld is also one worth looking into, but they also cross-post across multiple sites but not reliably [18:49] same for infoworld/pcworld [18:49] urlteam tracker seems to be borked for now [18:50] joepie91: https://github.com/ArchiveTeam/NewsGrabber/blob/master/services/web__bbc_com.py [18:50] please have a look at those services [18:51] and if you want anything added you can write a python file for it [18:52] arkiver: I don't have much time right now (or rather, until after 32C3), hence sharing the knowledge :) [18:52] plus I need some way to test things [18:52] just test if the regex matches the URLs you want to extract from your seed URLs [18:53] arkiver: could you look at the server logs on the urlteam tracker -- it seems to be broken [18:53] regardless, no time for PRs atm [19:01] Atluxity: commented [19:03] JesseW: I think chfoo has to do that [19:04] ah, ok [19:04] xmc: do you have access? [19:10] *** scyther has joined #archiveteam [19:38] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [19:38] *** schbirid has quit IRC (Quit: Leaving) [19:50] *** brayden_ has quit IRC (Read error: Connection reset by peer) [19:50] *** brayden has joined #archiveteam [19:50] *** swebb sets mode: +o brayden [19:51] arkiver: ack [20:00] it seems that rather than having 1 rss feed cbc has a whole bunch: http://www.cbc.ca/rss/ [20:01] *** maseck has quit IRC (Remote host closed the connection) [20:04] joepie91: i'm saving those bbc news urls [20:05] example: http://news.bbc.co.uk/2/hi/630961.stm [20:05] you can just brute force [20:11] *** schbirid has joined #archiveteam [20:19] *** JesseW has quit IRC (Leaving.) [20:25] *** alberto has quit IRC (Ping timeout: 250 seconds) [20:25] *** JesseW has joined #archiveteam [20:34] *** Ghost_of_ has quit IRC (Quit: Leaving) [20:38] *** JesseW has quit IRC (Leaving.) [20:41] *** maseck has joined #archiveteam [21:02] *** xXx_ndidd has joined #archiveteam [21:08] *** Coderjoe has quit IRC (Read error: Connection reset by peer) [21:09] *** ndiddy has quit IRC (Read error: Operation timed out) [21:14] *** Coderjoe has joined #archiveteam [21:33] *** schbirid has quit IRC (Quit: Leaving) [21:50] *** Ghost_of_ has joined #archiveteam [21:55] arkiver: updated [21:56] *** JesseW has joined #archiveteam [22:26] *** JesseW has quit IRC (Leaving.) [22:30] *** scyther has quit IRC (Read error: Connection reset by peer) [22:44] *** closure has joined #archiveteam [22:45] *** nertzy has joined #archiveteam [23:05] *** err3 has joined #archiveteam [23:05] hello [23:07] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [23:10] GREETINGS! [23:10] I've got an idea for archiving project [23:10] just in case anyone likes it [23:11] lay it on us [23:11] there's some good forums where people post math problems and solutions, e.g. artofproblemsolving [23:11] just went to it after a long time and it had totally changed, I got a shock that maybe they removed all of the old stuff - apparently they haven't [23:11] but it might be good to somehow make an archive of it [23:12] I'm not sure if it would need some special scripting to do [23:12] got some urls? [23:14] https://www.artofproblemsolving.com/community is it now [23:14] https://web.archive.org/web/20130201150755/http://www.artofproblemsolving.com/Forum/index.php used to look like this [23:15] let me gett a better one [23:15] wonder how big these sites are... probably not too big [23:16] they might not be too large, the important thing is the text (although sometimes equations get rendered into images) [23:16] https://web.archive.org/web/20130510031806/http://www.artofproblemsolving.com/Forum/index.php [23:16] thats how i remember it [23:17] https://web.archive.org/web/20140331091424/http://www.artofproblemsolving.com/Forum/viewforum.php?f=56 [23:17] i think a lot of the posts are not archived [23:29] *** RichardG_ has joined #archiveteam [23:29] *** RichardG has quit IRC (Read error: Connection reset by peer) [23:35] *** Ghost_of_ has quit IRC (Quit: Leaving) [23:42] *** WinterFox has joined #archiveteam [23:44] For the newsgrab, when you submit, please check the file naming. [23:48] its web__foo_bar_com.py