#archiveteam 2015-12-20,Sun

↑back Search

Time	Nickname	Message
00:02 ^🔗	Martini	I think we need more noise on Twitter. RT #IATelethon . lets send them to the YouTube live page, until they fix telethon.archive.org
00:12 ^🔗	Martini	https://www.youtube.com/watch?v=UM71NPrb5iM
00:27 ^🔗	JesseW	Martini: I'm trying to post links to neat things on the archive...
00:27 ^🔗	JesseW	along with the hashtag
00:35 ^🔗	DFJustin	telethon.archive.org is fixed
00:40 ^🔗	Martini	Thanks.
00:40 ^🔗	Martini	http://telethon.archive.org/ is working again.
00:55 ^🔗		Ghost_of_ has joined #archiveteam
01:13 ^🔗		asdf has joined #archiveteam
01:22 ^🔗		aaaaaaaaa has joined #archiveteam
01:22 ^🔗		swebb sets mode: +o aaaaaaaaa
02:04 ^🔗		parker_ has quit IRC (Remote host closed the connection)
02:05 ^🔗		parker_ has joined #archiveteam
02:19 ^🔗		Froggypwn has quit IRC (Ping timeout: 311 seconds)
02:29 ^🔗		nertzy has joined #archiveteam
02:38 ^🔗		parker_ has quit IRC (Remote host closed the connection)
02:38 ^🔗		parker_ has joined #archiveteam
02:43 ^🔗		parker_ has quit IRC (Remote host closed the connection)
02:44 ^🔗		parker_ has joined #archiveteam
02:46 ^🔗		nd1ddy has quit IRC (Read error: Connection reset by peer)
02:48 ^🔗		parker_ has quit IRC (Remote host closed the connection)
02:49 ^🔗		parker_ has joined #archiveteam
02:59 ^🔗		ndiddy has joined #archiveteam
03:04 ^🔗		asdf has quit IRC (Ping timeout: 378 seconds)
03:09 ^🔗		Martini has quit IRC (Quit: ChatZilla 0.9.92 [Firefox 43.0.1/20151216175450])
03:15 ^🔗		Froggypwn has joined #archiveteam
03:44 ^🔗		godane has quit IRC (Ping timeout: 311 seconds)
03:46 ^🔗		godane has joined #archiveteam
03:50 ^🔗		DDR has quit IRC (Remote host closed the connection)
03:55 ^🔗		godane has quit IRC (Leaving.)
03:55 ^🔗		godane has joined #archiveteam
04:09 ^🔗		nertzy has quit IRC (Quit: This computer has gone to sleep)
04:09 ^🔗		Ghost_of_ has quit IRC (Quit: Leaving)
04:24 ^🔗		nertzy has joined #archiveteam
04:28 ^🔗		aaaaaaaaa has quit IRC (Leaving)
04:39 ^🔗		ndiddy has quit IRC (Read error: Connection reset by peer)
05:56 ^🔗		nertzy has quit IRC (Quit: This computer has gone to sleep)
06:09 ^🔗		nertzy has joined #archiveteam
06:30 ^🔗		asdf has joined #archiveteam
07:22 ^🔗		Ungstein has quit IRC (Quit: Leaving.)
07:39 ^🔗		vitzli has joined #archiveteam
08:03 ^🔗		BlueMaxim has quit IRC (Read error: Connection reset by peer)
08:11 ^🔗		VADemon has quit IRC (left4dead)
08:19 ^🔗		Boppen has quit IRC (Read error: Connection reset by peer)
08:19 ^🔗		Boppen has joined #archiveteam
08:37 ^🔗		nertzy has quit IRC (Quit: This computer has gone to sleep)
08:37 ^🔗		JesseW has quit IRC (Leaving.)
09:18 ^🔗		schbirid has joined #archiveteam
09:25 ^🔗		asdf has quit IRC (Ping timeout: 252 seconds)
14:15 ^🔗		Muad-Dib has joined #archiveteam
14:16 ^🔗		WinterFox has quit IRC (Remote host closed the connection)
14:41 ^🔗		Froggypwn has quit IRC (Ping timeout: 483 seconds)
14:45 ^🔗		Froggypwn has joined #archiveteam
15:08 ^🔗		signius has quit IRC (Ping timeout: 364 seconds)
15:15 ^🔗		VADemon has joined #archiveteam
15:17 ^🔗		Atom__ has quit IRC (Atom__)
15:23 ^🔗		Froggypwn has quit IRC (Ping timeout: 483 seconds)
15:26 ^🔗		Froggypwn has joined #archiveteam
15:57 ^🔗		alberto has joined #archiveteam
16:00 ^🔗		vitzli has quit IRC (Quit: Leaving)
16:21 ^🔗	arkiver	Me and HCross have been working for some days on a newsgrabber.
16:21 ^🔗	arkiver	The dashboard can be viewed here http://newsgrabber.harrycross.me:29000/
16:21 ^🔗	HCross	Sites can be submitted here: https://github.com/ArchiveTeam/NewsGrabber
16:30 ^🔗	arkiver	So feel free to read the readme and make a pull requst for youe newswebsites!
16:30 ^🔗	HCross	At the moment it doesnt automagically sync to the server for archive, but ping me when you add one and Ill copy it down
16:43 ^🔗		Ghost_of_ has joined #archiveteam
16:47 ^🔗	HCross	you can watch it underway now
16:49 ^🔗	arkiver	Basically what the system does
16:49 ^🔗	arkiver	For every newssite you want to add you have to add a small python file
16:50 ^🔗	arkiver	this file contains the URLs it will recheck with a specified interval for new URLs
16:51 ^🔗	arkiver	the file also contains some regexes to match if the URL is a newsarticle or if it some a videoURL
16:51 ^🔗	arkiver	if it's a videoURL it will be downloaded with youtube-dl
17:11 ^🔗	Atluxity	does the newsgrabber got its own channel?
17:11 ^🔗	HCross	Not yet
17:12 ^🔗	Atluxity	the news-site I am trying to submit has both rss for "top items" and "latest". Include both or just "latest"?
17:13 ^🔗	arkiver	That would be just latest
17:13 ^🔗	Atluxity	ok
17:13 ^🔗	arkiver	Just add a good refresh time so it won't miss any articles
17:13 ^🔗	HCross	The grabber has gone down for a second to update the script
17:28 ^🔗	Atluxity	this freaking site has no structure! grrrr
17:29 ^🔗	Atluxity	"latest" is small news bulletings... articles are "top items" only
17:30 ^🔗	Atluxity	no tell in url if the page got video in it or not
17:31 ^🔗	HCross	Do most of the pages in that site have videos?
17:34 ^🔗	Atluxity	nah
17:34 ^🔗	Atluxity	that would be a strech
17:35 ^🔗	arkiver	If you have multiple URLs it has to check for new URLs you can multiple
17:36 ^🔗	arkiver	Always try to add as less URLs as possible, but still get all artices
17:36 ^🔗	Atluxity	yeah, I understand
17:51 ^🔗		JesseW has joined #archiveteam
17:53 ^🔗		ndiddy has joined #archiveteam
17:59 ^🔗		signius has joined #archiveteam
18:03 ^🔗		atomotic has joined #archiveteam
18:03 ^🔗	joepie91	arkiver: HCross: been thinking for a while about something like that, good to see it happening
18:03 ^🔗	joepie91	:p
18:04 ^🔗	arkiver	joepie91: feel free to add as many websites as you can :)
18:04 ^🔗		Amitari has joined #archiveteam
18:04 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
18:05 ^🔗	Amitari	Hey, anyone who knows wget that can help me?
18:05 ^🔗	joepie91	arkiver: how does one test it?
18:05 ^🔗	joepie91	also, dashboard shows nothing
18:05 ^🔗	arkiver	joepie91: it checks for new links every now and then
18:05 ^🔗	arkiver	and downloads the list of found new links every hour
18:06 ^🔗	arkiver	There's not many websites, so that's why it often doesn't show downloads
18:06 ^🔗	arkiver	joepie91: read the instructions please
18:07 ^🔗	arkiver	Instructions and looking at other items shows how everything works I think
18:07 ^🔗	arkiver	scripts will be made public later maybe
18:07 ^🔗	joepie91	arkiver: yes, I've read the instructions. it does not answer my question :)
18:08 ^🔗	joepie91	and eh, scripts should be public straightaway
18:08 ^🔗	HCross	joepie91, we are changing the code every half an hour at this point
18:08 ^🔗	joepie91	(also, checks every hour? it's not uncommon for controversial articles to be removed faster than that)
18:08 ^🔗	joepie91	HCross: ok?
18:09 ^🔗	HCross	Ye. When its more developed we are going to consider releasing
18:09 ^🔗	joepie91	"consider releasing"?
18:09 ^🔗	joepie91	and why does that have to wait until "when its more developed"?
18:09 ^🔗	arkiver	yeah I'll put it online
18:09 ^🔗	arkiver	I do want to keep this on one server for now though
18:10 ^🔗	joepie91	HCross: see also https://web.archive.org/web/20150429004351/http://blog.civiccommons.org/2011/01/be-open-from-day-one
18:10 ^🔗		RichardG has joined #archiveteam
18:10 ^🔗	HCross	So we dont get overlap. We dont want 100 peoplle all archiving BBC news at the same time for example
18:10 ^🔗	Atluxity	I need help with a regex for the newsgrabber
18:10 ^🔗	joepie91	HCross: that is unrelated to releasing code.
18:10 ^🔗	Atluxity	videoregex should match on subdomain "tv"
18:11 ^🔗	joepie91	if you don't want people doing that, then put in the readme that you don't want people doing that
18:11 ^🔗	joepie91	making the code available, in this case, is a safety mechanism so that if you get hit by a bus, somebody can pick it up
18:11 ^🔗	HCross	True
18:12 ^🔗	arkiver	3 north korean websites added!
18:12 ^🔗	HCross	When the scripts get updated. - doing that now
18:12 ^🔗	joepie91	basically, if you want people to use it carefully, just ask them to do so. don't immediately resort to the option of "force" (ie. keeping the code unavailable to them)
18:15 ^🔗	HCross	True, its in very early days right now
18:15 ^🔗	HCross	godane, do we have any nres on the Cryengine stuff?
18:15 ^🔗	arkiver	joepie91: yeah, we get it
18:16 ^🔗	Amitari	Anyone who can help me with wget? When I try to save a cookie before archiving a PhpBB-forum, I get the message "Remote file exists and could contain further links,
18:16 ^🔗	Amitari	but recursion is disabled -- not retrieving.
18:16 ^🔗	Amitari	"
18:19 ^🔗	arkiver	Atluxity: I'm off for some time now, can I help you later?
18:20 ^🔗	HCross	Well, the north korean websites crashed on me
18:20 ^🔗	Atluxity	arkiver: sure
18:23 ^🔗	Atluxity	https://github.com/atluxity/NewsGrabber/blob/master/services/web_nrk_no.py
18:23 ^🔗	Atluxity	they split up in so many urls :\
18:42 ^🔗	joepie91	HCross: arkiver: do you want example URLs for some of the BBC's older and newer formats?
18:42 ^🔗	joepie91	some are still in use for specials
18:42 ^🔗	joepie91	others only for historical articles
18:42 ^🔗	joepie91	(they don't migrate - they just leave the old content where it is)
18:43 ^🔗	HCross	we have the BBC news stuff already, we are more about going after the breaking news. I dont see why not though
18:43 ^🔗	joepie91	HCross: the BBC uses more than one format
18:43 ^🔗	joepie91	including very fancy highly multimedial ones
18:43 ^🔗	HCross	ah. Go on then
18:43 ^🔗	joepie91	:p
18:44 ^🔗	Amitari	Hey, could anyone here possibly help me with wget?
18:45 ^🔗	joepie91	HCross: http://news.bbc.co.uk/2/hi/health/406713.stm, http://www.bbc.co.uk/news/resources/idt-07eeeebb-d450-4e4b-98d4-755369be7855 / http://www.bbc.com/news/special/2014/newsspec_7617/index.html, http://www.bbc.com/news/world-europe-25190119, http://www.bbc.co.uk/newsbeat/24449861, http://www.bbc.com/future/story/20131112-potato-power-to-light-the-world, http://www.bbc.co.uk/blogs/adamcurtis/posts/BUGGER, http://news.bbc.co.uk/2/hi/science/nature/
18:45 ^🔗	joepie91	630961.stm, http://news.bbc.co.uk/2/hi/uk_news/england/manchester/3758209.stm, http://www.bbc.co.uk/music/reviews/9gvh
18:45 ^🔗	joepie91	err
18:46 ^🔗	joepie91	the cut-off one is http://news.bbc.co.uk/2/hi/science/nature/630961.stm
18:46 ^🔗	joepie91	these are all slightly different URL/content formats
18:46 ^🔗	joepie91	for different types of content
18:46 ^🔗	joepie91	most of these are still in use
18:46 ^🔗	joepie91	the .stm ones are legacy, no longer in use but still referenced
18:47 ^🔗	joepie91	the news/resources, news/special and BBC future ones are likely to have JS-loaded content
18:47 ^🔗	joepie91	Amitari: probably best to ask in #archiveteam-bs
18:47 ^🔗	Amitari	Thanks!
18:47 ^🔗		Amitari has left Leaving
18:48 ^🔗	HCross	joepie91, thanks. cc arkiver
18:48 ^🔗	joepie91	HCross: arkiver: also, keep in mind that nutech is on a different domain from nu.nl, and their articles are not consistently listed on nu.nl
18:48 ^🔗	joepie91	idem for rtlz/editienl and rtl.nl
18:48 ^🔗		SN4T14 has quit IRC (Read error: Operation timed out)
18:48 ^🔗		SN4T14 has joined #archiveteam
18:49 ^🔗	joepie91	webwereld is also one worth looking into, but they also cross-post across multiple sites but not reliably
18:49 ^🔗	joepie91	same for infoworld/pcworld
18:49 ^🔗	JesseW	urlteam tracker seems to be borked for now
18:50 ^🔗	arkiver	joepie91: https://github.com/ArchiveTeam/NewsGrabber/blob/master/services/web__bbc_com.py
18:50 ^🔗	arkiver	please have a look at those services
18:51 ^🔗	arkiver	and if you want anything added you can write a python file for it
18:52 ^🔗	joepie91	arkiver: I don't have much time right now (or rather, until after 32C3), hence sharing the knowledge :)
18:52 ^🔗	joepie91	plus I need some way to test things
18:52 ^🔗	arkiver	just test if the regex matches the URLs you want to extract from your seed URLs
18:53 ^🔗	JesseW	arkiver: could you look at the server logs on the urlteam tracker -- it seems to be broken
18:53 ^🔗	joepie91	regardless, no time for PRs atm
19:01 ^🔗	arkiver	Atluxity: commented
19:03 ^🔗	arkiver	JesseW: I think chfoo has to do that
19:04 ^🔗	JesseW	ah, ok
19:04 ^🔗	JesseW	xmc: do you have access?
19:10 ^🔗		scyther has joined #archiveteam
19:38 ^🔗		atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
19:38 ^🔗		schbirid has quit IRC (Quit: Leaving)
19:50 ^🔗		brayden_ has quit IRC (Read error: Connection reset by peer)
19:50 ^🔗		brayden has joined #archiveteam
19:50 ^🔗		swebb sets mode: +o brayden
19:51 ^🔗	Atluxity	arkiver: ack
20:00 ^🔗	Start	it seems that rather than having 1 rss feed cbc has a whole bunch: http://www.cbc.ca/rss/
20:01 ^🔗		maseck has quit IRC (Remote host closed the connection)
20:04 ^🔗	godane	joepie91: i'm saving those bbc news urls
20:05 ^🔗	godane	example: http://news.bbc.co.uk/2/hi/630961.stm
20:05 ^🔗	godane	you can just brute force
20:11 ^🔗		schbirid has joined #archiveteam
20:19 ^🔗		JesseW has quit IRC (Leaving.)
20:25 ^🔗		alberto has quit IRC (Ping timeout: 250 seconds)
20:25 ^🔗		JesseW has joined #archiveteam
20:34 ^🔗		Ghost_of_ has quit IRC (Quit: Leaving)
20:38 ^🔗		JesseW has quit IRC (Leaving.)
20:41 ^🔗		maseck has joined #archiveteam
21:02 ^🔗		xXx_ndidd has joined #archiveteam
21:08 ^🔗		Coderjoe has quit IRC (Read error: Connection reset by peer)
21:09 ^🔗		ndiddy has quit IRC (Read error: Operation timed out)
21:14 ^🔗		Coderjoe has joined #archiveteam
21:33 ^🔗		schbirid has quit IRC (Quit: Leaving)
21:50 ^🔗		Ghost_of_ has joined #archiveteam
21:55 ^🔗	Atluxity	arkiver: updated
21:56 ^🔗		JesseW has joined #archiveteam
22:26 ^🔗		JesseW has quit IRC (Leaving.)
22:30 ^🔗		scyther has quit IRC (Read error: Connection reset by peer)
22:44 ^🔗		closure has joined #archiveteam
22:45 ^🔗		nertzy has joined #archiveteam
23:05 ^🔗		err3 has joined #archiveteam
23:05 ^🔗	err3	hello
23:07 ^🔗		nertzy has quit IRC (Quit: This computer has gone to sleep)
23:10 ^🔗	Atluxity	GREETINGS!
23:10 ^🔗	err3	I've got an idea for archiving project
23:10 ^🔗	err3	just in case anyone likes it
23:11 ^🔗	Atluxity	lay it on us
23:11 ^🔗	err3	there's some good forums where people post math problems and solutions, e.g. artofproblemsolving
23:11 ^🔗	err3	just went to it after a long time and it had totally changed, I got a shock that maybe they removed all of the old stuff - apparently they haven't
23:11 ^🔗	err3	but it might be good to somehow make an archive of it
23:12 ^🔗	err3	I'm not sure if it would need some special scripting to do
23:12 ^🔗	Atluxity	got some urls?
23:14 ^🔗	err3	https://www.artofproblemsolving.com/community is it now
23:14 ^🔗	err3	https://web.archive.org/web/20130201150755/http://www.artofproblemsolving.com/Forum/index.php used to look like this
23:15 ^🔗	err3	let me gett a better one
23:15 ^🔗	Atluxity	wonder how big these sites are... probably not too big
23:16 ^🔗	err3	they might not be too large, the important thing is the text (although sometimes equations get rendered into images)
23:16 ^🔗	err3	https://web.archive.org/web/20130510031806/http://www.artofproblemsolving.com/Forum/index.php
23:16 ^🔗	err3	thats how i remember it
23:17 ^🔗	err3	https://web.archive.org/web/20140331091424/http://www.artofproblemsolving.com/Forum/viewforum.php?f=56
23:17 ^🔗	err3	i think a lot of the posts are not archived
23:29 ^🔗		RichardG_ has joined #archiveteam
23:29 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
23:35 ^🔗		Ghost_of_ has quit IRC (Quit: Leaving)
23:42 ^🔗		WinterFox has joined #archiveteam
23:44 ^🔗	HCross	For the newsgrab, when you submit, please check the file naming.
23:48 ^🔗	HCross	its web__foo_bar_com.py

irclogger-viewer