#archiveteam 2015-12-20,Sun

↑back Search

Time Nickname Message
00:02 🔗 Martini I think we need more noise on Twitter. RT #IATelethon . lets send them to the YouTube live page, until they fix telethon.archive.org
00:12 🔗 Martini https://www.youtube.com/watch?v=UM71NPrb5iM
00:27 🔗 JesseW Martini: I'm trying to post links to neat things on the archive...
00:27 🔗 JesseW along with the hashtag
00:35 🔗 DFJustin telethon.archive.org is fixed
00:40 🔗 Martini Thanks.
00:40 🔗 Martini http://telethon.archive.org/ is working again.
00:55 🔗 Ghost_of_ has joined #archiveteam
01:13 🔗 asdf has joined #archiveteam
01:22 🔗 aaaaaaaaa has joined #archiveteam
01:22 🔗 swebb sets mode: +o aaaaaaaaa
02:04 🔗 parker_ has quit IRC (Remote host closed the connection)
02:05 🔗 parker_ has joined #archiveteam
02:19 🔗 Froggypwn has quit IRC (Ping timeout: 311 seconds)
02:29 🔗 nertzy has joined #archiveteam
02:38 🔗 parker_ has quit IRC (Remote host closed the connection)
02:38 🔗 parker_ has joined #archiveteam
02:43 🔗 parker_ has quit IRC (Remote host closed the connection)
02:44 🔗 parker_ has joined #archiveteam
02:46 🔗 nd1ddy has quit IRC (Read error: Connection reset by peer)
02:48 🔗 parker_ has quit IRC (Remote host closed the connection)
02:49 🔗 parker_ has joined #archiveteam
02:59 🔗 ndiddy has joined #archiveteam
03:04 🔗 asdf has quit IRC (Ping timeout: 378 seconds)
03:09 🔗 Martini has quit IRC (Quit: ChatZilla 0.9.92 [Firefox 43.0.1/20151216175450])
03:15 🔗 Froggypwn has joined #archiveteam
03:44 🔗 godane has quit IRC (Ping timeout: 311 seconds)
03:46 🔗 godane has joined #archiveteam
03:50 🔗 DDR has quit IRC (Remote host closed the connection)
03:55 🔗 godane has quit IRC (Leaving.)
03:55 🔗 godane has joined #archiveteam
04:09 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
04:09 🔗 Ghost_of_ has quit IRC (Quit: Leaving)
04:24 🔗 nertzy has joined #archiveteam
04:28 🔗 aaaaaaaaa has quit IRC (Leaving)
04:39 🔗 ndiddy has quit IRC (Read error: Connection reset by peer)
05:56 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
06:09 🔗 nertzy has joined #archiveteam
06:30 🔗 asdf has joined #archiveteam
07:22 🔗 Ungstein has quit IRC (Quit: Leaving.)
07:39 🔗 vitzli has joined #archiveteam
08:03 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
08:11 🔗 VADemon has quit IRC (left4dead)
08:19 🔗 Boppen has quit IRC (Read error: Connection reset by peer)
08:19 🔗 Boppen has joined #archiveteam
08:37 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
08:37 🔗 JesseW has quit IRC (Leaving.)
09:18 🔗 schbirid has joined #archiveteam
09:25 🔗 asdf has quit IRC (Ping timeout: 252 seconds)
14:15 🔗 Muad-Dib has joined #archiveteam
14:16 🔗 WinterFox has quit IRC (Remote host closed the connection)
14:41 🔗 Froggypwn has quit IRC (Ping timeout: 483 seconds)
14:45 🔗 Froggypwn has joined #archiveteam
15:08 🔗 signius has quit IRC (Ping timeout: 364 seconds)
15:15 🔗 VADemon has joined #archiveteam
15:17 🔗 Atom__ has quit IRC (Atom__)
15:23 🔗 Froggypwn has quit IRC (Ping timeout: 483 seconds)
15:26 🔗 Froggypwn has joined #archiveteam
15:57 🔗 alberto has joined #archiveteam
16:00 🔗 vitzli has quit IRC (Quit: Leaving)
16:21 🔗 arkiver Me and HCross have been working for some days on a newsgrabber.
16:21 🔗 arkiver The dashboard can be viewed here http://newsgrabber.harrycross.me:29000/
16:21 🔗 HCross Sites can be submitted here: https://github.com/ArchiveTeam/NewsGrabber
16:30 🔗 arkiver So feel free to read the readme and make a pull requst for youe newswebsites!
16:30 🔗 HCross At the moment it doesnt automagically sync to the server for archive, but ping me when you add one and Ill copy it down
16:43 🔗 Ghost_of_ has joined #archiveteam
16:47 🔗 HCross you can watch it underway now
16:49 🔗 arkiver Basically what the system does
16:49 🔗 arkiver For every newssite you want to add you have to add a small python file
16:50 🔗 arkiver this file contains the URLs it will recheck with a specified interval for new URLs
16:51 🔗 arkiver the file also contains some regexes to match if the URL is a newsarticle or if it some a videoURL
16:51 🔗 arkiver if it's a videoURL it will be downloaded with youtube-dl
17:11 🔗 Atluxity does the newsgrabber got its own channel?
17:11 🔗 HCross Not yet
17:12 🔗 Atluxity the news-site I am trying to submit has both rss for "top items" and "latest". Include both or just "latest"?
17:13 🔗 arkiver That would be just latest
17:13 🔗 Atluxity ok
17:13 🔗 arkiver Just add a good refresh time so it won't miss any articles
17:13 🔗 HCross The grabber has gone down for a second to update the script
17:28 🔗 Atluxity this freaking site has no structure! grrrr
17:29 🔗 Atluxity "latest" is small news bulletings... articles are "top items" only
17:30 🔗 Atluxity no tell in url if the page got video in it or not
17:31 🔗 HCross Do most of the pages in that site have videos?
17:34 🔗 Atluxity nah
17:34 🔗 Atluxity that would be a strech
17:35 🔗 arkiver If you have multiple URLs it has to check for new URLs you can multiple
17:36 🔗 arkiver Always try to add as less URLs as possible, but still get all artices
17:36 🔗 Atluxity yeah, I understand
17:51 🔗 JesseW has joined #archiveteam
17:53 🔗 ndiddy has joined #archiveteam
17:59 🔗 signius has joined #archiveteam
18:03 🔗 atomotic has joined #archiveteam
18:03 🔗 joepie91 arkiver: HCross: been thinking for a while about something like that, good to see it happening
18:03 🔗 joepie91 :p
18:04 🔗 arkiver joepie91: feel free to add as many websites as you can :)
18:04 🔗 Amitari has joined #archiveteam
18:04 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
18:05 🔗 Amitari Hey, anyone who knows wget that can help me?
18:05 🔗 joepie91 arkiver: how does one test it?
18:05 🔗 joepie91 also, dashboard shows nothing
18:05 🔗 arkiver joepie91: it checks for new links every now and then
18:05 🔗 arkiver and downloads the list of found new links every hour
18:06 🔗 arkiver There's not many websites, so that's why it often doesn't show downloads
18:06 🔗 arkiver joepie91: read the instructions please
18:07 🔗 arkiver Instructions and looking at other items shows how everything works I think
18:07 🔗 arkiver scripts will be made public later maybe
18:07 🔗 joepie91 arkiver: yes, I've read the instructions. it does not answer my question :)
18:08 🔗 joepie91 and eh, scripts should be public straightaway
18:08 🔗 HCross joepie91, we are changing the code every half an hour at this point
18:08 🔗 joepie91 (also, checks every hour? it's not uncommon for controversial articles to be removed faster than that)
18:08 🔗 joepie91 HCross: ok?
18:09 🔗 HCross Ye. When its more developed we are going to consider releasing
18:09 🔗 joepie91 "consider releasing"?
18:09 🔗 joepie91 and why does that have to wait until "when its more developed"?
18:09 🔗 arkiver yeah I'll put it online
18:09 🔗 arkiver I do want to keep this on one server for now though
18:10 🔗 joepie91 HCross: see also https://web.archive.org/web/20150429004351/http://blog.civiccommons.org/2011/01/be-open-from-day-one
18:10 🔗 RichardG has joined #archiveteam
18:10 🔗 HCross So we dont get overlap. We dont want 100 peoplle all archiving BBC news at the same time for example
18:10 🔗 Atluxity I need help with a regex for the newsgrabber
18:10 🔗 joepie91 HCross: that is unrelated to releasing code.
18:10 🔗 Atluxity videoregex should match on subdomain "tv"
18:11 🔗 joepie91 if you don't want people doing that, then put in the readme that you don't want people doing that
18:11 🔗 joepie91 making the code available, in this case, is a safety mechanism so that if you get hit by a bus, somebody can pick it up
18:11 🔗 HCross True
18:12 🔗 arkiver 3 north korean websites added!
18:12 🔗 HCross When the scripts get updated. - doing that now
18:12 🔗 joepie91 basically, if you want people to use it carefully, just *ask* them to do so. don't immediately resort to the option of "force" (ie. keeping the code unavailable to them)
18:15 🔗 HCross True, its in very early days right now
18:15 🔗 HCross godane, do we have any nres on the Cryengine stuff?
18:15 🔗 arkiver joepie91: yeah, we get it
18:16 🔗 Amitari Anyone who can help me with wget? When I try to save a cookie before archiving a PhpBB-forum, I get the message "Remote file exists and could contain further links,
18:16 🔗 Amitari but recursion is disabled -- not retrieving.
18:16 🔗 Amitari "
18:19 🔗 arkiver Atluxity: I'm off for some time now, can I help you later?
18:20 🔗 HCross Well, the north korean websites crashed on me
18:20 🔗 Atluxity arkiver: sure
18:23 🔗 Atluxity https://github.com/atluxity/NewsGrabber/blob/master/services/web_nrk_no.py
18:23 🔗 Atluxity they split up in so many urls :\
18:42 🔗 joepie91 HCross: arkiver: do you want example URLs for some of the BBC's older and newer formats?
18:42 🔗 joepie91 some are still in use for specials
18:42 🔗 joepie91 others only for historical articles
18:42 🔗 joepie91 (they don't migrate - they just leave the old content where it is)
18:43 🔗 HCross we have the BBC news stuff already, we are more about going after the breaking news. I dont see why not though
18:43 🔗 joepie91 HCross: the BBC uses more than one format
18:43 🔗 joepie91 including very fancy highly multimedial ones
18:43 🔗 HCross ah. Go on then
18:43 🔗 joepie91 :p
18:44 🔗 Amitari Hey, could anyone here possibly help me with wget?
18:45 🔗 joepie91 HCross: http://news.bbc.co.uk/2/hi/health/406713.stm, http://www.bbc.co.uk/news/resources/idt-07eeeebb-d450-4e4b-98d4-755369be7855 / http://www.bbc.com/news/special/2014/newsspec_7617/index.html, http://www.bbc.com/news/world-europe-25190119, http://www.bbc.co.uk/newsbeat/24449861, http://www.bbc.com/future/story/20131112-potato-power-to-light-the-world, http://www.bbc.co.uk/blogs/adamcurtis/posts/BUGGER, http://news.bbc.co.uk/2/hi/science/nature/
18:45 🔗 joepie91 630961.stm, http://news.bbc.co.uk/2/hi/uk_news/england/manchester/3758209.stm, http://www.bbc.co.uk/music/reviews/9gvh
18:45 🔗 joepie91 err
18:46 🔗 joepie91 the cut-off one is http://news.bbc.co.uk/2/hi/science/nature/630961.stm
18:46 🔗 joepie91 these are all slightly different URL/content formats
18:46 🔗 joepie91 for different types of content
18:46 🔗 joepie91 most of these are still in use
18:46 🔗 joepie91 the .stm ones are legacy, no longer in use but still referenced
18:47 🔗 joepie91 the news/resources, news/special and BBC future ones are likely to have JS-loaded content
18:47 🔗 joepie91 Amitari: probably best to ask in #archiveteam-bs
18:47 🔗 Amitari Thanks!
18:47 🔗 Amitari has left Leaving
18:48 🔗 HCross joepie91, thanks. cc arkiver
18:48 🔗 joepie91 HCross: arkiver: also, keep in mind that nutech is on a different domain from nu.nl, and their articles are not consistently listed on nu.nl
18:48 🔗 joepie91 idem for rtlz/editienl and rtl.nl
18:48 🔗 SN4T14 has quit IRC (Read error: Operation timed out)
18:48 🔗 SN4T14 has joined #archiveteam
18:49 🔗 joepie91 webwereld is also one worth looking into, but they also cross-post across multiple sites but not reliably
18:49 🔗 joepie91 same for infoworld/pcworld
18:49 🔗 JesseW urlteam tracker seems to be borked for now
18:50 🔗 arkiver joepie91: https://github.com/ArchiveTeam/NewsGrabber/blob/master/services/web__bbc_com.py
18:50 🔗 arkiver please have a look at those services
18:51 🔗 arkiver and if you want anything added you can write a python file for it
18:52 🔗 joepie91 arkiver: I don't have much time right now (or rather, until after 32C3), hence sharing the knowledge :)
18:52 🔗 joepie91 plus I need some way to test things
18:52 🔗 arkiver just test if the regex matches the URLs you want to extract from your seed URLs
18:53 🔗 JesseW arkiver: could you look at the server logs on the urlteam tracker -- it seems to be broken
18:53 🔗 joepie91 regardless, no time for PRs atm
19:01 🔗 arkiver Atluxity: commented
19:03 🔗 arkiver JesseW: I think chfoo has to do that
19:04 🔗 JesseW ah, ok
19:04 🔗 JesseW xmc: do you have access?
19:10 🔗 scyther has joined #archiveteam
19:38 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
19:38 🔗 schbirid has quit IRC (Quit: Leaving)
19:50 🔗 brayden_ has quit IRC (Read error: Connection reset by peer)
19:50 🔗 brayden has joined #archiveteam
19:50 🔗 swebb sets mode: +o brayden
19:51 🔗 Atluxity arkiver: ack
20:00 🔗 Start it seems that rather than having 1 rss feed cbc has a whole bunch: http://www.cbc.ca/rss/
20:01 🔗 maseck has quit IRC (Remote host closed the connection)
20:04 🔗 godane joepie91: i'm saving those bbc news urls
20:05 🔗 godane example: http://news.bbc.co.uk/2/hi/630961.stm
20:05 🔗 godane you can just brute force
20:11 🔗 schbirid has joined #archiveteam
20:19 🔗 JesseW has quit IRC (Leaving.)
20:25 🔗 alberto has quit IRC (Ping timeout: 250 seconds)
20:25 🔗 JesseW has joined #archiveteam
20:34 🔗 Ghost_of_ has quit IRC (Quit: Leaving)
20:38 🔗 JesseW has quit IRC (Leaving.)
20:41 🔗 maseck has joined #archiveteam
21:02 🔗 xXx_ndidd has joined #archiveteam
21:08 🔗 Coderjoe has quit IRC (Read error: Connection reset by peer)
21:09 🔗 ndiddy has quit IRC (Read error: Operation timed out)
21:14 🔗 Coderjoe has joined #archiveteam
21:33 🔗 schbirid has quit IRC (Quit: Leaving)
21:50 🔗 Ghost_of_ has joined #archiveteam
21:55 🔗 Atluxity arkiver: updated
21:56 🔗 JesseW has joined #archiveteam
22:26 🔗 JesseW has quit IRC (Leaving.)
22:30 🔗 scyther has quit IRC (Read error: Connection reset by peer)
22:44 🔗 closure has joined #archiveteam
22:45 🔗 nertzy has joined #archiveteam
23:05 🔗 err3 has joined #archiveteam
23:05 🔗 err3 hello
23:07 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
23:10 🔗 Atluxity GREETINGS!
23:10 🔗 err3 I've got an idea for archiving project
23:10 🔗 err3 just in case anyone likes it
23:11 🔗 Atluxity lay it on us
23:11 🔗 err3 there's some good forums where people post math problems and solutions, e.g. artofproblemsolving
23:11 🔗 err3 just went to it after a long time and it had totally changed, I got a shock that maybe they removed all of the old stuff - apparently they haven't
23:11 🔗 err3 but it might be good to somehow make an archive of it
23:12 🔗 err3 I'm not sure if it would need some special scripting to do
23:12 🔗 Atluxity got some urls?
23:14 🔗 err3 https://www.artofproblemsolving.com/community is it now
23:14 🔗 err3 https://web.archive.org/web/20130201150755/http://www.artofproblemsolving.com/Forum/index.php used to look like this
23:15 🔗 err3 let me gett a better one
23:15 🔗 Atluxity wonder how big these sites are... probably not too big
23:16 🔗 err3 they might not be too large, the important thing is the text (although sometimes equations get rendered into images)
23:16 🔗 err3 https://web.archive.org/web/20130510031806/http://www.artofproblemsolving.com/Forum/index.php
23:16 🔗 err3 thats how i remember it
23:17 🔗 err3 https://web.archive.org/web/20140331091424/http://www.artofproblemsolving.com/Forum/viewforum.php?f=56
23:17 🔗 err3 i think a lot of the posts are not archived
23:29 🔗 RichardG_ has joined #archiveteam
23:29 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
23:35 🔗 Ghost_of_ has quit IRC (Quit: Leaving)
23:42 🔗 WinterFox has joined #archiveteam
23:44 🔗 HCross For the newsgrab, when you submit, please check the file naming.
23:48 🔗 HCross its web__foo_bar_com.py

irclogger-viewer