Time |
Nickname |
Message |
00:02
🔗
|
Martini |
I think we need more noise on Twitter. RT #IATelethon . lets send them to the YouTube live page, until they fix telethon.archive.org |
00:12
🔗
|
Martini |
https://www.youtube.com/watch?v=UM71NPrb5iM |
00:27
🔗
|
JesseW |
Martini: I'm trying to post links to neat things on the archive... |
00:27
🔗
|
JesseW |
along with the hashtag |
00:35
🔗
|
DFJustin |
telethon.archive.org is fixed |
00:40
🔗
|
Martini |
Thanks. |
00:40
🔗
|
Martini |
http://telethon.archive.org/ is working again. |
00:55
🔗
|
|
Ghost_of_ has joined #archiveteam |
01:13
🔗
|
|
asdf has joined #archiveteam |
01:22
🔗
|
|
aaaaaaaaa has joined #archiveteam |
01:22
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
02:04
🔗
|
|
parker_ has quit IRC (Remote host closed the connection) |
02:05
🔗
|
|
parker_ has joined #archiveteam |
02:19
🔗
|
|
Froggypwn has quit IRC (Ping timeout: 311 seconds) |
02:29
🔗
|
|
nertzy has joined #archiveteam |
02:38
🔗
|
|
parker_ has quit IRC (Remote host closed the connection) |
02:38
🔗
|
|
parker_ has joined #archiveteam |
02:43
🔗
|
|
parker_ has quit IRC (Remote host closed the connection) |
02:44
🔗
|
|
parker_ has joined #archiveteam |
02:46
🔗
|
|
nd1ddy has quit IRC (Read error: Connection reset by peer) |
02:48
🔗
|
|
parker_ has quit IRC (Remote host closed the connection) |
02:49
🔗
|
|
parker_ has joined #archiveteam |
02:59
🔗
|
|
ndiddy has joined #archiveteam |
03:04
🔗
|
|
asdf has quit IRC (Ping timeout: 378 seconds) |
03:09
🔗
|
|
Martini has quit IRC (Quit: ChatZilla 0.9.92 [Firefox 43.0.1/20151216175450]) |
03:15
🔗
|
|
Froggypwn has joined #archiveteam |
03:44
🔗
|
|
godane has quit IRC (Ping timeout: 311 seconds) |
03:46
🔗
|
|
godane has joined #archiveteam |
03:50
🔗
|
|
DDR has quit IRC (Remote host closed the connection) |
03:55
🔗
|
|
godane has quit IRC (Leaving.) |
03:55
🔗
|
|
godane has joined #archiveteam |
04:09
🔗
|
|
nertzy has quit IRC (Quit: This computer has gone to sleep) |
04:09
🔗
|
|
Ghost_of_ has quit IRC (Quit: Leaving) |
04:24
🔗
|
|
nertzy has joined #archiveteam |
04:28
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
04:39
🔗
|
|
ndiddy has quit IRC (Read error: Connection reset by peer) |
05:56
🔗
|
|
nertzy has quit IRC (Quit: This computer has gone to sleep) |
06:09
🔗
|
|
nertzy has joined #archiveteam |
06:30
🔗
|
|
asdf has joined #archiveteam |
07:22
🔗
|
|
Ungstein has quit IRC (Quit: Leaving.) |
07:39
🔗
|
|
vitzli has joined #archiveteam |
08:03
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
08:11
🔗
|
|
VADemon has quit IRC (left4dead) |
08:19
🔗
|
|
Boppen has quit IRC (Read error: Connection reset by peer) |
08:19
🔗
|
|
Boppen has joined #archiveteam |
08:37
🔗
|
|
nertzy has quit IRC (Quit: This computer has gone to sleep) |
08:37
🔗
|
|
JesseW has quit IRC (Leaving.) |
09:18
🔗
|
|
schbirid has joined #archiveteam |
09:25
🔗
|
|
asdf has quit IRC (Ping timeout: 252 seconds) |
14:15
🔗
|
|
Muad-Dib has joined #archiveteam |
14:16
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
14:41
🔗
|
|
Froggypwn has quit IRC (Ping timeout: 483 seconds) |
14:45
🔗
|
|
Froggypwn has joined #archiveteam |
15:08
🔗
|
|
signius has quit IRC (Ping timeout: 364 seconds) |
15:15
🔗
|
|
VADemon has joined #archiveteam |
15:17
🔗
|
|
Atom__ has quit IRC (Atom__) |
15:23
🔗
|
|
Froggypwn has quit IRC (Ping timeout: 483 seconds) |
15:26
🔗
|
|
Froggypwn has joined #archiveteam |
15:57
🔗
|
|
alberto has joined #archiveteam |
16:00
🔗
|
|
vitzli has quit IRC (Quit: Leaving) |
16:21
🔗
|
arkiver |
Me and HCross have been working for some days on a newsgrabber. |
16:21
🔗
|
arkiver |
The dashboard can be viewed here http://newsgrabber.harrycross.me:29000/ |
16:21
🔗
|
HCross |
Sites can be submitted here: https://github.com/ArchiveTeam/NewsGrabber |
16:30
🔗
|
arkiver |
So feel free to read the readme and make a pull requst for youe newswebsites! |
16:30
🔗
|
HCross |
At the moment it doesnt automagically sync to the server for archive, but ping me when you add one and Ill copy it down |
16:43
🔗
|
|
Ghost_of_ has joined #archiveteam |
16:47
🔗
|
HCross |
you can watch it underway now |
16:49
🔗
|
arkiver |
Basically what the system does |
16:49
🔗
|
arkiver |
For every newssite you want to add you have to add a small python file |
16:50
🔗
|
arkiver |
this file contains the URLs it will recheck with a specified interval for new URLs |
16:51
🔗
|
arkiver |
the file also contains some regexes to match if the URL is a newsarticle or if it some a videoURL |
16:51
🔗
|
arkiver |
if it's a videoURL it will be downloaded with youtube-dl |
17:11
🔗
|
Atluxity |
does the newsgrabber got its own channel? |
17:11
🔗
|
HCross |
Not yet |
17:12
🔗
|
Atluxity |
the news-site I am trying to submit has both rss for "top items" and "latest". Include both or just "latest"? |
17:13
🔗
|
arkiver |
That would be just latest |
17:13
🔗
|
Atluxity |
ok |
17:13
🔗
|
arkiver |
Just add a good refresh time so it won't miss any articles |
17:13
🔗
|
HCross |
The grabber has gone down for a second to update the script |
17:28
🔗
|
Atluxity |
this freaking site has no structure! grrrr |
17:29
🔗
|
Atluxity |
"latest" is small news bulletings... articles are "top items" only |
17:30
🔗
|
Atluxity |
no tell in url if the page got video in it or not |
17:31
🔗
|
HCross |
Do most of the pages in that site have videos? |
17:34
🔗
|
Atluxity |
nah |
17:34
🔗
|
Atluxity |
that would be a strech |
17:35
🔗
|
arkiver |
If you have multiple URLs it has to check for new URLs you can multiple |
17:36
🔗
|
arkiver |
Always try to add as less URLs as possible, but still get all artices |
17:36
🔗
|
Atluxity |
yeah, I understand |
17:51
🔗
|
|
JesseW has joined #archiveteam |
17:53
🔗
|
|
ndiddy has joined #archiveteam |
17:59
🔗
|
|
signius has joined #archiveteam |
18:03
🔗
|
|
atomotic has joined #archiveteam |
18:03
🔗
|
joepie91 |
arkiver: HCross: been thinking for a while about something like that, good to see it happening |
18:03
🔗
|
joepie91 |
:p |
18:04
🔗
|
arkiver |
joepie91: feel free to add as many websites as you can :) |
18:04
🔗
|
|
Amitari has joined #archiveteam |
18:04
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
18:05
🔗
|
Amitari |
Hey, anyone who knows wget that can help me? |
18:05
🔗
|
joepie91 |
arkiver: how does one test it? |
18:05
🔗
|
joepie91 |
also, dashboard shows nothing |
18:05
🔗
|
arkiver |
joepie91: it checks for new links every now and then |
18:05
🔗
|
arkiver |
and downloads the list of found new links every hour |
18:06
🔗
|
arkiver |
There's not many websites, so that's why it often doesn't show downloads |
18:06
🔗
|
arkiver |
joepie91: read the instructions please |
18:07
🔗
|
arkiver |
Instructions and looking at other items shows how everything works I think |
18:07
🔗
|
arkiver |
scripts will be made public later maybe |
18:07
🔗
|
joepie91 |
arkiver: yes, I've read the instructions. it does not answer my question :) |
18:08
🔗
|
joepie91 |
and eh, scripts should be public straightaway |
18:08
🔗
|
HCross |
joepie91, we are changing the code every half an hour at this point |
18:08
🔗
|
joepie91 |
(also, checks every hour? it's not uncommon for controversial articles to be removed faster than that) |
18:08
🔗
|
joepie91 |
HCross: ok? |
18:09
🔗
|
HCross |
Ye. When its more developed we are going to consider releasing |
18:09
🔗
|
joepie91 |
"consider releasing"? |
18:09
🔗
|
joepie91 |
and why does that have to wait until "when its more developed"? |
18:09
🔗
|
arkiver |
yeah I'll put it online |
18:09
🔗
|
arkiver |
I do want to keep this on one server for now though |
18:10
🔗
|
joepie91 |
HCross: see also https://web.archive.org/web/20150429004351/http://blog.civiccommons.org/2011/01/be-open-from-day-one |
18:10
🔗
|
|
RichardG has joined #archiveteam |
18:10
🔗
|
HCross |
So we dont get overlap. We dont want 100 peoplle all archiving BBC news at the same time for example |
18:10
🔗
|
Atluxity |
I need help with a regex for the newsgrabber |
18:10
🔗
|
joepie91 |
HCross: that is unrelated to releasing code. |
18:10
🔗
|
Atluxity |
videoregex should match on subdomain "tv" |
18:11
🔗
|
joepie91 |
if you don't want people doing that, then put in the readme that you don't want people doing that |
18:11
🔗
|
joepie91 |
making the code available, in this case, is a safety mechanism so that if you get hit by a bus, somebody can pick it up |
18:11
🔗
|
HCross |
True |
18:12
🔗
|
arkiver |
3 north korean websites added! |
18:12
🔗
|
HCross |
When the scripts get updated. - doing that now |
18:12
🔗
|
joepie91 |
basically, if you want people to use it carefully, just *ask* them to do so. don't immediately resort to the option of "force" (ie. keeping the code unavailable to them) |
18:15
🔗
|
HCross |
True, its in very early days right now |
18:15
🔗
|
HCross |
godane, do we have any nres on the Cryengine stuff? |
18:15
🔗
|
arkiver |
joepie91: yeah, we get it |
18:16
🔗
|
Amitari |
Anyone who can help me with wget? When I try to save a cookie before archiving a PhpBB-forum, I get the message "Remote file exists and could contain further links, |
18:16
🔗
|
Amitari |
but recursion is disabled -- not retrieving. |
18:16
🔗
|
Amitari |
" |
18:19
🔗
|
arkiver |
Atluxity: I'm off for some time now, can I help you later? |
18:20
🔗
|
HCross |
Well, the north korean websites crashed on me |
18:20
🔗
|
Atluxity |
arkiver: sure |
18:23
🔗
|
Atluxity |
https://github.com/atluxity/NewsGrabber/blob/master/services/web_nrk_no.py |
18:23
🔗
|
Atluxity |
they split up in so many urls :\ |
18:42
🔗
|
joepie91 |
HCross: arkiver: do you want example URLs for some of the BBC's older and newer formats? |
18:42
🔗
|
joepie91 |
some are still in use for specials |
18:42
🔗
|
joepie91 |
others only for historical articles |
18:42
🔗
|
joepie91 |
(they don't migrate - they just leave the old content where it is) |
18:43
🔗
|
HCross |
we have the BBC news stuff already, we are more about going after the breaking news. I dont see why not though |
18:43
🔗
|
joepie91 |
HCross: the BBC uses more than one format |
18:43
🔗
|
joepie91 |
including very fancy highly multimedial ones |
18:43
🔗
|
HCross |
ah. Go on then |
18:43
🔗
|
joepie91 |
:p |
18:44
🔗
|
Amitari |
Hey, could anyone here possibly help me with wget? |
18:45
🔗
|
joepie91 |
HCross: http://news.bbc.co.uk/2/hi/health/406713.stm, http://www.bbc.co.uk/news/resources/idt-07eeeebb-d450-4e4b-98d4-755369be7855 / http://www.bbc.com/news/special/2014/newsspec_7617/index.html, http://www.bbc.com/news/world-europe-25190119, http://www.bbc.co.uk/newsbeat/24449861, http://www.bbc.com/future/story/20131112-potato-power-to-light-the-world, http://www.bbc.co.uk/blogs/adamcurtis/posts/BUGGER, http://news.bbc.co.uk/2/hi/science/nature/ |
18:45
🔗
|
joepie91 |
630961.stm, http://news.bbc.co.uk/2/hi/uk_news/england/manchester/3758209.stm, http://www.bbc.co.uk/music/reviews/9gvh |
18:45
🔗
|
joepie91 |
err |
18:46
🔗
|
joepie91 |
the cut-off one is http://news.bbc.co.uk/2/hi/science/nature/630961.stm |
18:46
🔗
|
joepie91 |
these are all slightly different URL/content formats |
18:46
🔗
|
joepie91 |
for different types of content |
18:46
🔗
|
joepie91 |
most of these are still in use |
18:46
🔗
|
joepie91 |
the .stm ones are legacy, no longer in use but still referenced |
18:47
🔗
|
joepie91 |
the news/resources, news/special and BBC future ones are likely to have JS-loaded content |
18:47
🔗
|
joepie91 |
Amitari: probably best to ask in #archiveteam-bs |
18:47
🔗
|
Amitari |
Thanks! |
18:47
🔗
|
|
Amitari has left Leaving |
18:48
🔗
|
HCross |
joepie91, thanks. cc arkiver |
18:48
🔗
|
joepie91 |
HCross: arkiver: also, keep in mind that nutech is on a different domain from nu.nl, and their articles are not consistently listed on nu.nl |
18:48
🔗
|
joepie91 |
idem for rtlz/editienl and rtl.nl |
18:48
🔗
|
|
SN4T14 has quit IRC (Read error: Operation timed out) |
18:48
🔗
|
|
SN4T14 has joined #archiveteam |
18:49
🔗
|
joepie91 |
webwereld is also one worth looking into, but they also cross-post across multiple sites but not reliably |
18:49
🔗
|
joepie91 |
same for infoworld/pcworld |
18:49
🔗
|
JesseW |
urlteam tracker seems to be borked for now |
18:50
🔗
|
arkiver |
joepie91: https://github.com/ArchiveTeam/NewsGrabber/blob/master/services/web__bbc_com.py |
18:50
🔗
|
arkiver |
please have a look at those services |
18:51
🔗
|
arkiver |
and if you want anything added you can write a python file for it |
18:52
🔗
|
joepie91 |
arkiver: I don't have much time right now (or rather, until after 32C3), hence sharing the knowledge :) |
18:52
🔗
|
joepie91 |
plus I need some way to test things |
18:52
🔗
|
arkiver |
just test if the regex matches the URLs you want to extract from your seed URLs |
18:53
🔗
|
JesseW |
arkiver: could you look at the server logs on the urlteam tracker -- it seems to be broken |
18:53
🔗
|
joepie91 |
regardless, no time for PRs atm |
19:01
🔗
|
arkiver |
Atluxity: commented |
19:03
🔗
|
arkiver |
JesseW: I think chfoo has to do that |
19:04
🔗
|
JesseW |
ah, ok |
19:04
🔗
|
JesseW |
xmc: do you have access? |
19:10
🔗
|
|
scyther has joined #archiveteam |
19:38
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
19:38
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
19:50
🔗
|
|
brayden_ has quit IRC (Read error: Connection reset by peer) |
19:50
🔗
|
|
brayden has joined #archiveteam |
19:50
🔗
|
|
swebb sets mode: +o brayden |
19:51
🔗
|
Atluxity |
arkiver: ack |
20:00
🔗
|
Start |
it seems that rather than having 1 rss feed cbc has a whole bunch: http://www.cbc.ca/rss/ |
20:01
🔗
|
|
maseck has quit IRC (Remote host closed the connection) |
20:04
🔗
|
godane |
joepie91: i'm saving those bbc news urls |
20:05
🔗
|
godane |
example: http://news.bbc.co.uk/2/hi/630961.stm |
20:05
🔗
|
godane |
you can just brute force |
20:11
🔗
|
|
schbirid has joined #archiveteam |
20:19
🔗
|
|
JesseW has quit IRC (Leaving.) |
20:25
🔗
|
|
alberto has quit IRC (Ping timeout: 250 seconds) |
20:25
🔗
|
|
JesseW has joined #archiveteam |
20:34
🔗
|
|
Ghost_of_ has quit IRC (Quit: Leaving) |
20:38
🔗
|
|
JesseW has quit IRC (Leaving.) |
20:41
🔗
|
|
maseck has joined #archiveteam |
21:02
🔗
|
|
xXx_ndidd has joined #archiveteam |
21:08
🔗
|
|
Coderjoe has quit IRC (Read error: Connection reset by peer) |
21:09
🔗
|
|
ndiddy has quit IRC (Read error: Operation timed out) |
21:14
🔗
|
|
Coderjoe has joined #archiveteam |
21:33
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
21:50
🔗
|
|
Ghost_of_ has joined #archiveteam |
21:55
🔗
|
Atluxity |
arkiver: updated |
21:56
🔗
|
|
JesseW has joined #archiveteam |
22:26
🔗
|
|
JesseW has quit IRC (Leaving.) |
22:30
🔗
|
|
scyther has quit IRC (Read error: Connection reset by peer) |
22:44
🔗
|
|
closure has joined #archiveteam |
22:45
🔗
|
|
nertzy has joined #archiveteam |
23:05
🔗
|
|
err3 has joined #archiveteam |
23:05
🔗
|
err3 |
hello |
23:07
🔗
|
|
nertzy has quit IRC (Quit: This computer has gone to sleep) |
23:10
🔗
|
Atluxity |
GREETINGS! |
23:10
🔗
|
err3 |
I've got an idea for archiving project |
23:10
🔗
|
err3 |
just in case anyone likes it |
23:11
🔗
|
Atluxity |
lay it on us |
23:11
🔗
|
err3 |
there's some good forums where people post math problems and solutions, e.g. artofproblemsolving |
23:11
🔗
|
err3 |
just went to it after a long time and it had totally changed, I got a shock that maybe they removed all of the old stuff - apparently they haven't |
23:11
🔗
|
err3 |
but it might be good to somehow make an archive of it |
23:12
🔗
|
err3 |
I'm not sure if it would need some special scripting to do |
23:12
🔗
|
Atluxity |
got some urls? |
23:14
🔗
|
err3 |
https://www.artofproblemsolving.com/community is it now |
23:14
🔗
|
err3 |
https://web.archive.org/web/20130201150755/http://www.artofproblemsolving.com/Forum/index.php used to look like this |
23:15
🔗
|
err3 |
let me gett a better one |
23:15
🔗
|
Atluxity |
wonder how big these sites are... probably not too big |
23:16
🔗
|
err3 |
they might not be too large, the important thing is the text (although sometimes equations get rendered into images) |
23:16
🔗
|
err3 |
https://web.archive.org/web/20130510031806/http://www.artofproblemsolving.com/Forum/index.php |
23:16
🔗
|
err3 |
thats how i remember it |
23:17
🔗
|
err3 |
https://web.archive.org/web/20140331091424/http://www.artofproblemsolving.com/Forum/viewforum.php?f=56 |
23:17
🔗
|
err3 |
i think a lot of the posts are not archived |
23:29
🔗
|
|
RichardG_ has joined #archiveteam |
23:29
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
23:35
🔗
|
|
Ghost_of_ has quit IRC (Quit: Leaving) |
23:42
🔗
|
|
WinterFox has joined #archiveteam |
23:44
🔗
|
HCross |
For the newsgrab, when you submit, please check the file naming. |
23:48
🔗
|
HCross |
its web__foo_bar_com.py |